Should Answer Immediately or Wait for Further Information? A Novel Wait-or-Answer Task and Its Predictive Approach

by   Zehao Lin, et al.
Zhejiang University

Different people have different habits of describing their intents in conversations. Some people may tend to deliberate their full intents in several successive utterances, i.e., they use several consistent messages for readability instead of a long sentence to express their question. This creates a predicament faced by dialogue systems' application, especially in real-world industrial scenarios, in which the dialogue system is unsure that whether it should answer the user's query immediately or wait for users' further supplementary input. Motivated by such interesting quandary, we define a novel task: Wait-or-Answer to better tackle this dilemma faced by dialogue systems. We shed light on a new research topic about how the dialogue system can be more competent to behave in this Wait-or-Answer quandary. Further, we propose a predictive approach dubbed Imagine-then-Arbitrate (ITA) to resolve this Wait-or-Answer task. More specifically, we take advantage of an arbitrator model to help the dialogue system decide to wait or answer. The arbitrator's decision is made with the assistance of two ancillary imaginator models: a wait imaginator and an answer imaginator. The wait imaginator tries to predict what the user would supplement and use its prediction to persuade the arbitrator that the user has some information to add, so the dialogue system should wait. The answer imaginator, nevertheless, struggles to predict the answer of the dialogue system and convince the arbitrator that it's a superior choice to answer the users' query immediately. To our best knowledge, our paper is the first work to explicitly define the Wait-or-Answer task in the dialogue system. Additionally, our proposed ITA approach significantly outperforms the existing models in solving this Wait-or-Answer problem.



page 1

page 2

page 3

page 4


Helpfulness and Fairness of Task-Oriented Dialogue Systems

Task-oriented dialogue systems aim to answer questions from users and pr...

"Wait, I'm Still Talking!" Predicting the Dialogue Interaction Behavior Using Imagine-Then-Arbitrate Model

Producing natural and accurate responses like human beings is the ultima...

Beyond Roll-Up's and Drill-Down's: An Intentional Analytics Model to Reinvent OLAP (long-version)

This paper structures a novel vision for OLAP by fundamentally redefinin...

Lifelong and Interactive Learning of Factual Knowledge in Dialogues

Dialogue systems are increasingly using knowledge bases (KBs) storing re...

Improved Goal Oriented Dialogue via Utterance Generation and Look Ahead

Goal oriented dialogue systems have become a prominent customer-care int...

Fast and Light-Weight Answer Text Retrieval in Dialogue Systems

Dialogue systems can benefit from being able to search through a corpus ...

Machine Learning for Utility Prediction in Argument-Based Computational Persuasion

Automated persuasion systems (APS) aim to persuade a user to believe som...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Artificial dialogue systems are fundamental to natural user interfaces (gao2019neural)

and have attracted more and more attention. Building a dialogue system is an emerging interdisciplinary problem at the intersection of Machine Learning (ML), Natural Language Processing (NLP), Information Retrieval (IR), etc., attracting many researchers in AI and IR, especially targeting Question Answering (QA), deep semantics and dialogue with intelligent agents. With the availability of large-scale dialogue corpus and the advancement in deep learning and reinforcement learning, conversational artificial intelligence has seen a great extent of improvement. Ritter et al.

(DBLP:conf/emnlp/RitterCD11) first treats the conversation system as a translation problem and applied phrase-based Statistical Machine Translation(DBLP:conf/ijcnlp/SetiawanLZO05). Recent years, many work try to improve the performance of data-driven dialogue systems from multiple dimension, e.g. dialogue states (Wu2019TransferableMS; Le2020NonAutoregressiveDS), dialogue generation (Li2017AdversarialLF), emotion integration (Hasegawa2013PredictingAE), knowledge integration (ghazvininejad2018knowledge; wu2019global), etc.

In the scenario of industrial conversation service, we observe an interesting phenomenon that a large portion of users are accustomed to delivering their intents in several successive utterances rather than a single utterance. This will create a critical dilemma faced by the dialogue systems in which the dialogue system is not sure whether it should wait for the further input of the user or simply answer the question right away. In this paper, we name the aforementioned issue as the Wait-or-Answer task. The dialogue system’s decision on this Wait-or-Answer issue is quite crucial because : (1) without this mechanism, the user must be prepared to express an intention without a breath to cooperate with the rigid dialogue system and (2) cutting in too early or waiting for further express after a complete intent will confuse the user and often make the conversation replay. This Wait-or-Answer quandary becomes even more complicated and complex when it comes to multi-turn dialogue systems. Despite the surge of attention into the dialogue system models, very few research works have investigated the Wait-or-Answer problem. To our best knowledge, our paper is the first work that clearly defines and investigate this Wait-or-Answer problem.

To address the aforementioned Wait-or-Answer

problem, an obvious way is to directly apply a classifier model such as TextCNN or BERT. These kinds of methods only consider the information in past dialogue history but omit the user and agent’s possible future intention. Intuitively, suppose that we can predict { (1). what the user would supplement if the user wants to supplement further information (2). what the dialogue system would answer if the user has finished his or her question and is waiting for the answer }, the dialogue system has more confidence to decide to wait or to answer. Motivated by such intuitions, we propose a model named

Imagine-then-Arbitrate (ITA). As shown in Figure 1, there is an arbitrator model that controls whether the bot should answer the user query or wait for further information. Except for the arbitrator model, there two auxiliary imaginator models: the wait imaginator and the answer imaginator. The wait imaginator persuade the arbitrator to make the decision that the bot should wait for further input from users. Nevertheless, for the answer imaginator, it tries to convince the arbitrator that the bot should immediately answer users’ queries. More specifically, these two imaginator models are two generative models: (1). The wait imaginator tries to predict what the user will supplement the existing input. The input of the wait imaginator is the current dialogue history while the output is the wait imaginator’s prediction of the user’s supplement information. The wait imaginator’s output is utilized to convince the arbitrator to make the decision that the bot should wait. (2). The answer imaginator strives to predict what the bot should reply to the user’s query. The input of the answer imaginator is also the current dialogue history while the output is the answer imaginator’s prediction of the agent’s answer. The answer imaginator’s output is used to make the arbitrator believe that answer the user’s query immediately is a superior choice. As for the arbitrator, given the suggestions from these two imaginators, it makes its decision whether the bot should wait or answer.

Figure 1. An overview of the ITA framework. The wait imaginator and answer imaginator predict user’s and agent’s future possible utterance and help arbitrator to solve Wait-or-Answer task.

In summary, this paper makes the following contributions:

  1. [leftmargin=*]

  2. To our best knowledge, our paper is the first work to explicitly define the Wait-or-Answer task, which is crucial to further enhance the capacity of dialogue systems.

  3. We propose a novel model, dubbed Imagine-then-Arbitrate (ITA), to solve the Wait-or-Answer task, which uses two imaginator models and one arbitrator model to help the dialogue system decide whether to wait or to answer.

  4. We further propose a dataset construction method of preparing the existing public datasets for the Wait-or-Answer task.

  5. Experimental results demonstrate that our model significantly outperforms the baselines, which prove the benefits brought by our ITA framework.

The rest of this paper is organized as follows: we first give some background knowledge in Section 2, after which we describe the significance and detailed formulation of the Wait-or-Answer task in Section 3. We present our proposed Imagine-then-Arbitrate (ITA) framework in Section 4. Section 5 and Section 6.1 are about the experiment setup and results analysis. Finally, we conclude in Section 7.

2. Background

Since our paper mainly focuses on Wait-or-Answer in a dialogue system, we give some background knowledge about Dialogue Systems in Section 2.1. Besides, our proposed ITA framework involves both generative models (for imaginators in ITA) and classification models (for the arbitrator in ITA), we present some preliminary about the generative and classification models in NLP in Section 2.2.

2.1. Dialogue Systems

Creating a perfect artificial human-computer dialogue system is always the ultimate goal of natural language processing. Research on dialogue systems mainly divided into two groups: task-oriented dialogue system and chit-chat dialogue systems. Task-oriented dialogue systems (DBLP:conf/eacl/ManningE17; DBLP:conf/sigdial/EricKCM17; YanDCZZL17) aim at solving tasks in specific domains with grounding knowledge while chit-chat bot (yan2018chitty; li2019follow; hancock2019learning) mainly concentrates on interacting with human to provide reasonable responses and entertainment (DBLP:journals/sigkdd/ChenLYT17). Recent years research on task-oriented dialogue systems mainly concentrates on dialogue states (budzianowski-etal-2018-multiwoz) and knowledge integrating (wu2019global; lin-etal-2019-task). Chit-chat bots focus on conversing with the human on open domains. Though chit-chat bot seems to perform totally different from task-oriented dialogue systems, actually as revealed in Yan et al. (yan2017building), nearly 80% utterances are chi-chat messages in the online shopping scenario and handling those queries is closely related to user experiences.

The recent development of big data and deep learning techniques has greatly advanced both task-oriented dialogue systems and chit-chat bots, which has encouraged a huge amount of deep learning based researches in dialogue systems. Lots of work has investigated on applying neural networks to dialogue system’s components or end-to-end dialogue frameworks

(YanDCZZL17; lipton2018bbq-networks). The advantage of deep learning is its ability to leverage large amounts of data from the internet, sensors, etc. The big conversation data and deep learning techniques like SEQ2SEQ (NIPS2014_5346) and attention mechanism (DBLP:conf/emnlp/LuongPM15) help the model understand the utterances, retrieve background knowledge and generate responses.

2.2. Generative and Classification Models

Dialogue Generation  In general, two major approaches have been developed for dialogue divided by the reply types: (1) generative methods such as sequence-to-sequence models, which generate proper responses during the conversation; and (2) retrieval-based methods, which learn to select responses from the current conversation from a repository.

The generative method has contracting more and more attention (Liu2018KnowledgeDF; Zhao2020LowResourceKD)

. The main reason is that comparing with the retrieval-based dialogue systems, generative models can produce more accurate and free replies, which are more user friendly. Different from the retrieval method, Natural Language Generation (NLG) tries converting a communication goal, selected by the dialogue manager, into a natural language form. It reflects the naturalness of a dialogue system, and thus the user experience. Another reason is that besides responses’ fluency and accuracy, generative systems are much easier to use for common users than the rigid robot.

Conventional template or rule-based approaches mainly contains a set of templates, rules, and hand-craft heuristics designed by domain experts. This makes it labor-intensive yet rigid, motivating researchers to find more data-driven approaches

(ghazvininejad2018knowledge; lin-etal-2019-task) that aim to optimize a generation module from corpora, one of which, Semantically Controlled LSTM (SC-LSTM) (wen2015semantically), a variant of LSTM (hochreiter1997long), gives a semantic control on language generation with an extra component. As for the fully-data driven dialogue systems, SEQ2SEQ (NIPS2014_5346) based encoder-decoder frameworks and attention mechanism (DBLP:conf/emnlp/LuongPM15) are still the most widely adopted (lin-etal-2019-task; ghazvininejad2018knowledge; DBLP:conf/www/ChenRTZY18) techniques.

Text Classification  Text classification is a critical problem in all NLP tasks. Text classification problems in various situations have been widely investigated and studied (jiang2018text; kowsari2017hdltex; lai2015recurrent) over the last few decades (kowsari2019text). Classification task applied on text can be multiple levels, e.g. document classification (yang2016hierarchical; manevitz2001one), sentence classification (komninos2016dependency), emotion classification (xia-ding-2019-emotion) etc.

Though end-to-end methods play a more and more important role in dialogue system, the text classification modules (jiang2018text; kowsari2017hdltex) remains very useful in many problems like emotion recognition (song-etal-2019-generating), gender recognition (hoyle-etal-2019-unsupervised)

, verbal intelligence, etc. There have been several widely used text classification methods proposed, e.g. Recurrent Neural Networks (RNNs) and CNNs. Typically RNN is trained to recognize patterns across time, while CNN learns to recognize patterns across space.


proposed TextCNNs trained on top of pre-trained word vectors for sentence-level classification tasks and achieved excellent results on multiple benchmarks.

Besides RNNs and CNNs, (vaswani2017attention) proposed a new network architecture called Transformer, based solely on attention mechanism and obtained promising performance on many NLP tasks. To make the best use of unlabeled data, (devlin2018bert) introduced a new language representation model called BERT based on the transformer and obtained state-of-the-art results.

Figure 2. A multi-turn dialogue fragment. In this case, a user sends split utterances in a turn, e.g. split U1 to {U11, U12 and U13}

3. The Wait-or-Answer Task

In this section, we firstly describe the reasons why we investigate the Wait-or-Answer task in Section 3.1, after which we present the detailed task formulation in Section 3.2.

3.1. Why We Study The Wait-or-Answer Task?

Conventional dialogue systems mainly concentrate on the accuracy and fluency of generated or retrieved answers. These kinds of dialogue systems, including most commercial chatbots, require users to speak strictly follow the designed conversation instruction. This requires users to describe their intents in a single sentence.

However, the aforementioned setting in existing dialogue systems is NOT held in real-life settings. For instance, as shown in Figure 2, in a real-world scenario in which a user is asking for the information about a theater, the agent firstly starts the conversation with ”Good morning. Vane Theater at your service.” (A1), then the user replied with three sentences, firstly ”Hello” (U11), secondly ”I’m thinking about watching a Chinese traditional opera with a foreign girl.” (U12) and thirdly ”What’s on this weekend?” (U13). Generally speaking, users won’t speak all sentences without a breath. If the agent cut in the wrong opportunity of the conversation, e.g. immediately replies to the user’s second statement, the agent has to guess what the user really wants and omit the important information ”on this weekend” in the third sentence. So in this case, the agent should wait for the user until the user finished his last message, otherwise, the pace of the conversation will be messed up. However, existing dialogue agents can not handle well when faced with this scenario and will reply to every utterance received immediately.

There are mainly two issues when applying existing dialogue agents to the real-life conversation:

  1. [leftmargin=*]

  2. When received a short utterance from users as the start of a conversation, existing dialogue systems lack the capability of making a decision to avoid generating bad responses based on semantically incomplete utterance.

  3. Existing dialogue systems may cut into the conversation at an unreasonable time, which could confuse the user and mess up the pace of conversation and thus leads to nonsense interactions. In other words, the existing dialogue system can NOT catch the right opportunity to Answer or Wait.

As stated above, it is worthwhile to investigate this Wait-or-Answer task which would empower the dialogue system to enhance their ability to decide appropriately in the wait-or-answer dilemma.

Figure 3. Model Overview. (a) Train the answer and wait imaginators using the same dialogues but different samples. (b) During training and inference step, arbitrator uses the dialogue history and two trained imaginators’ predictions.

3.2. Task Formulation

In this case, we propose the Wait-or-Answer task and test on our modified datasets. Different from traditional conversation datasets, which have combined the same user’s utterance, the user’s messages don’t end with an explicit ending signal but may split into several utterances, in our case called subturns, and send sequentially. The task of the agent is not to reply but recognize if the user has sent the complete utterances and reply to the user immediately or should wait for the user’s continuing subturns.

Our problem is formulated as follows. There is a conversation history represented as a sequence of utterances: , where each utterance itself is a sequence of words . Besides, each utterance has some additional tags:

  • [leftmargin=*]

  • turn tags to show which turn this utterance is in the whole conversation.

  • speakers’ identification tags or to show who sends this utterance.

  • subturn tags for user to indicate which subturn an utterance is in. Note that an utterance will be labelled as even if it doesn’t have one.

Now, given a dialogue history and tags , the goal of the model is to predict a label , the action the agent would take, where means the agent will wait for the user for the next message, and

means the agent will reply immediately. Formally we are going to maximize the following probability:


4. The Imagine-then-Arbitrate (ITA) Framework

In this section, we firstly present the overview of our ITA framework in Section 4.1. Then we describe the overall training and inference phase in Section 4.2. Section 4.3 and Section 4.4 are about the detailed model structures of the imaginator and arbitrator respectively.

4.1. The Overview of ITA Framework

As shown in Figure 1, there are two imaginators and one arbitrator in the ITA framework. The arbitrator makes the final decision about whether the dialogue system should answer users’ queries immediately or wait for the users’ further information. We use two imaginator models to assist the arbitrator. The wait imaginator tries to predict what the user might supplement. Then, the wait imaginator uses this simulated query to convince the arbitrator to wait for the user’s following input since the user does have some information to supplement. The answer imaginator, nevertheless, predicts what the dialogue system’s answer for users’ present query. Then, the answer imaginator utilizes this simulated answer to make the arbitrator believe that the dialogue system should answer the user’s queries immediately because the user has finished its input.

In fact, these two imaginators acted as the world model (ha2018recurrent), which create a virtual environment to simulate the possible change of future to train the agent, for the arbitrator. More specifically, the output of the wait imaginator (simulated query) and the output of the answer imaginator (simulated answer) both function as the simulated experience. Peng et al. (peng2018deep) first propose Deep Dyna-Q incorporating into the dialogue agent a world model to mimic real user response and generate simulated experience. Compared with models who directly applying a classifier, e.g. TextCNN and BERT, on dialogue systems to solve the Wait-or-Answer problem, our proposed imaginators are better at learning semantic information, both history and future possible utterances, by training on the corpus, and give supplemental prediction to the arbitrator. Imaginators also will magnify errors and this will give negative feedback and make our ITA easier to learn to distinguish which decision is better.

4.2. Training and Inference of ITA Framework

As shown in Figure 3, we show the procedure of the model’s training and inference. We first train wait imaginator on the dialogue history with ground truth is user’s utterance (from  [, ] to in Figure 2) and answer imaginator on the dialogue history with ground truth is agent’s utterance (from  [, , , ] to ). And then we inference predicted user and agent future possible utterances from wait and answer imaginators as arbitrator’s training data. In this kind of design, the two imaginators will not only simulate future possible dialogue to support wait and answer action, but also will magnify errors, e.g. wait simulator’s performance will be poor when the ground truth is answer because wait imaginator never learned how to speak like an agent, this will make arbitrator easier to distinguish which decision is better. At last, we feed the predicted utterances and original dialogue history together to train the arbitrator. During the inference procedure, we simply use two imaginators to predict possible user and agent’s utterance ( and ) and combined with dialogue histories  [, , ] for the arbitrator to decide whether the model should wait or answer the question directly.

4.3. Imaginator

An imaginator is a natural language generator generating the next sentence given the dialogue history. There are two imaginators in our method, the wait imaginator, and the answer imaginator. The goal of the two imaginators is to learn the user’s and agent’s speaking style respectively and generate possible future utterances.

As shown in Figure 3 (a), imaginator itself is a sequence generation model. We use one-hot embedding to convert all words and relative tags, e.g. turn tags and place holders, to one-hot vectors , where is the length of the vocabulary list. Then we extend each word in utterance by concatenating the token itself with turn tag, identity tag, and subturn tag. We adopt SEQ2SEQ as the basic architecture and LSTMs as the encoder and decoder networks. LSTMs will encode each extended word as a continuous vector at each time step . The process can be formulated as follows:


where is the embedding of the extended word , , , , , , , , and are learnt parameters.

Though trained on the same dataset, the two imaginators learn different roles independently. So in the same piece of dialogue, we split it into different samples for different imaginators. For example, as shown in Figure 2 and 3 (a), we use utterance (A1, U11, U12) as dialogue history input and U13 as ground truth to train the wait imaginator and use utterance (A1, U11, U12, U13) as dialogue history and A2 as ground truth to train the answer imaginator.

During training, the encoder runs as equation 2, and the decoder is the same structured LSTMs but will be fed to a Softmax with

, which will produce a probability distribution

over all words, formally:

Datasets MultiWoz DailyDialogue CCPE
Split Train Valid Test Train Valid Test Train Valid Test
Vocabulary Size 2443 6219 4855
Dialogues 8423 1000 1000 11118 1000 1000 398 49 52
Avg. Turns/Dialogue 6.32 6.97 6.98 4.09 4.21 4.03 9.7 9.96 9.92
Avg. Split User Turns 1.89 1.92 1.94 2.09 2.12 2.12 3.12 3.02 2.73
Avg. Utterance Length 10.54 10.7 10.56 8.71 8.54 8.75 7.93 8.02 7.78
Avg. Agent’s Utterances Per Dialogue 14.43 14.78 14.69 12.04 11.81 12.17 8.7 8.84 8.19
Avg. User’s Utterances Per Dialogue 6.18 6.28 6.17 5.91 5.87 5.96 7.61 7.66 7.56
Agent Wait Samples Size 47341 6410 6573 49540 4717 4510 8183 973 894
Agent Answer Sample Size 53249 6970 6983 41547 3846 3689 3455 436 464
Table 1. Datasets Statistics. Note that the statistics are based on the modified dataset described in Section 5.1.2

the decoder at time step t will select the highest word in , and our imaginator’s loss is the sum of the negative log-likelihood of the correct word at each step as follows:


where is the length of the generated sentence. During inference, we also apply a beam search to improve generation performance.

Finally, the trained answer imaginator and wait imaginator are obtained.

4.4. Arbitrator

The arbitrator module is fundamentally a text classifier. However, in this task, we make the module maximally utilize both dialogue history and future possible utterances semantic information. So we turned the problem of maximizing from in equation (1) to:


where and are the trained answer imaginator and wait imaginator respectively, and is a selection indicator where means selecting whereas means selecting . And Thus we (1) introduce the supervise information in imaginators’ training data and future possible predicted utterances (2) turn the label prediction problem into a response selection problem.

We adopt several architectures like Bi-GRUs, TextCNNs, and BERT as the basis of the arbitrator module. We will show how to build an arbitrator by taking TextCNNs as an example.

As is shown in Figure 3, the three CNNs with same structure take the inferred responses , and dialogue history , tags . For each raw word sequence , we embed each word as one-hot vector . By looking up a word embedding matrix , the input text is represented as an input matrix , where is the length of sequence of words and is the dimension of word embedding features. The matrix is then fed into a convolution layer where a filter is applied:


where is the window of token representation and the function is , and are learnt parameters. Applying this filter to possible obtains a feature map:

(7) c

where for filters. And we use different size of filters in parallel in the same convolution layer. This means we will have windows at the same time, so formally:

(8) C

, then we apply max-over-time pooling operation to capture the most important feature:


, and thus we get the final feature map of the input sequence.

We apply the same CNNs to get the feature maps of , and :


where function TextCNNs() follows as equations from 6 to 9. Then we will have two possible dialogue paths, with and with , representations and :


And then, the arbitrator will calculate the probability of the two possible dialogue paths:


Through learned parameters and , we will get a two-dimensional probability distribution , in which the most reasonable response has the max probability. This also indicates whether the agent should wait or not.

Original Dialogue Modified Dialogue
Roles Utterance Label Utterance
Good morning.
Vane Theater
at your service.
Role: Agent
Turn: 0
Good morning.
Vane Theater at
your service.
Hello. I’m
thinking about
watching a
Chinese traditional
opera with a foreign
girl. What’s on this
Role: User
Turn: 0
Subturn: 0
Role: User
Turn: 0
Subturn: 1
I’m thinking about
watching a Chinese
traditional opera
with a foreign girl.
Role: User’
Turn: 0
Subturn: 2
What’s on
this weekend?
Table 2. Comparison of a piece of the original dialogue and Our modified dialogue following the section 5.1.2. Note that task-oriented corpus’ like MultiWoz slot values, knowledge base, and ontology content are not shown here.

And the total loss function of the whole attribution module will be the negative log-likelihood of the probability of choosing the correct action:


where is the number of samples and is the ground truth label of i-th sample.

The arbitrator module based on Bi-GRU and BERT is implemented similarly to TextCNNs.

5. Experimental Setup

In this section, we firstly present the process of data construction in Section 5.1

, after which we give the evaluation metrics we use in Section 

5.2. Finally, we detail the training setup of baselines and our ITA models in Section 5.3 and Section 5.4 respectively.

5.1. Datasets Construction

5.1.1. Original Datasets

As the proposed approach mainly concentrates on the interaction of human-computer, we select and modify three very different styled datasets to test the performance and generalization of our method. Two of them are a fairly large task-oriented dialogue dataset MultiWoz 2.0 111 and a smaller but with much more turns per dialogue datasets Coached Conversational Preference Elicitation (CCPE) 222 The other is a chit-chat dataset DailyDialogue 333 All datasets are collected from human-to-human conversations. We evaluate and compare the results with the baseline methods in multiple dimensions. Table 1 shows the statistics of datasets and datasets details described as below:

  • [leftmargin=*]

  • MultiWOZ 2.0 (budzianowski-etal-2018-multiwoz). MultiDomain Wizard-of-Oz dataset (MultiWOZ) is a fully-labeled collection of human-human written conversations. Compared with previous task-oriented dialogue datasets, e.g. DSTC 2 (henderson-etal-2014-second) and KVR (DBLP:conf/sigdial/EricKCM17), it is a much larger multi-turn conversational corpus and across several domains and topics.

  • DailyDialogue (li-etal-2017-dailydialog). DailyDialogue is a high-quality multi-turn dialogue dataset, which contains conversations about daily life. In this dataset, humans often first respond to the previous context and then propose their own questions and suggestions. In this way, people pay more attention to others’ words and are willing to continue the conversation. Compare to the task-oriented dialogue datasets, the speaker’s behavior will be more unpredictable and complex.

  • CCPE (radlinski2019coached). CCPE is a dataset consisting of 502 English dialogues. Though seems much smaller than MultiWoz 2.0 and DialyDialogue, it has 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers and concentrates on the movie domain. We select this dataset to test if our model can run well on both larger and smaller datasets.

5.1.2. The Pipeline of Dataset Construction

Dataset MultiWoz DailyDialogue CCPE
Random 50.91 50.77 55.72
Bi-GRU 79.12 75.23 67.53
GRU-ITA 82.03 77.80 72.69
TextCNNs 77.68 75.79 68.65
TextCNN-ITA 80.75 79.02 73.32
BERT 80.75 78.68 70.99
BERT-ITA 82.73 79.35 75.41
Table 3. Accuracy Results on Three datasets. Better results between baselines and corresponding ITA models are in BOLD and the best results on datasets are in RED. The Random is a script that making random decisions according to the positive/negative samples rate.

As the task we concentrate on, making a decision to wait or to answer, is quite different from traditional dialogue systems. Existing dialogue datasets will be unable to provide the information for training and testing. Thus we propose a fairly simple and general datasets construction method to directly rebuild over the existing public dialogue corpus.

We modify the datasets with the following steps:

  1. [leftmargin=3mm,label=.]

  2. Delexicalisation: For task-oriented dialogue, slot labels are important for navigating the system to complete a specific task. However, those labels and accurate values from ontology files will not benefit our task essentially. So we replace all specific values with a slot placeholder in the preprocessing step.

  3. Utterance segmentation: Existing datasets concentrate on the dialogue content, combining multiple sentences into one utterance each turn when gathering the data. In this step, we randomly split the combined utterance into multiple utterances according to the punctuation with the probability . The determined probability is designed to decide if the pre-processing program should split a certain sentence.

  4. Extra Labeling: We add several labels, including turn tags, subturn tags, and role tags, to each split and original sentences in order to (1) label the speaker role and dialogue turns (2) mark the ground truth for supervised training and evaluate the baselines and our model.

Finally, we have the modified datasets which imitate the real-life human chatting behaviors. As shown in Table 2

, we compare one original dialogue (in this example, from DailyDialogue) with our modified one. Our modified datasets and code will be open-sourced to both academic and industrial communities.

5.2. Evaluation Metrics

In our Wait-or-Answer task, we define the Answer action of the agent as the positive samples and the Wait action is the negative action. As both the positive and negative actions are important in this task, so we choose the model with the accuracy metrics instead of precision or recall.

To compare with dataset baselines in multiple dimensions and test the model’s performance, we use the overall Bilingual Evaluation Understudy (BLEU) (DBLP:conf/acl/PapineniRWZ02) to evaluate the imaginators’ generation performance. As for the arbitrator, we use the accuracy score as the main metrics to evaluate the wait-or-answer decision and select models. Apart from BLEU and accuracy, we also adopt Precision, Recall and F1 to evaluate baselines and our models from multiple perspective. Details as follows:

  • [leftmargin=*]

  • Bilingual Evaluation Understudy (BLEU) (DBLP:conf/acl/PapineniRWZ02)

    . BLEU has been widely employed in evaluating sequence generation including machine translation, text summarization, and dialogue systems. BLEU calculates the n-gram precision which is the fraction of n-grams in the candidate text which is present in any of the reference texts.

  • Accuracy The accuracy metric is the probability of whether the arbitrator model can successfully classify the ground truth in the test dataset. The accuracy score in our experiments is the correct ratio in all samples.

  • Precision also called positive predictive value is the fraction of relevant instances among the retrieved instances. In our case, we calculate precision by the ratio of correctly predicted answer actions in all predicted answer actions of the test dataset.

  • Recall also known as sensitivity is the fraction of the total amount of relevant instances that were actually retrieved. In our case, we calculate recall by the ratio of correctly predicted answer actions in all answer actions of the test dataset.

  • F1 Score

    Only consider the precision p or the recall r is difficult to determine which one is really better of not both p and r get a better score. F1 score considers both the precision p and the recall r of the test to compute the score. We calculate the F1 score by the harmonic mean of the precision and the recall.

Dataset MultiWoz DailyDialogue CCPE
Precision Recall F1 Precision Recall F1 Precision Recall F1
Bi-GRU 75.69 87.70 80.85 72.07 72.94 71.86 54.83 31.15 38.49
GRU-ITA 79.02 88.87 83.27 77.97 77.29 75.71 63.40 48.47 53.50
TextCNN 73.61 88.85 80.03 71.03 78.49 73.91 59.78 27.63 36.35
TextCNN-ITA 77.17 89.08 82.52 77.14 74.87 75.35 68.99 41.90 51.43
BERT 76.93 89.46 82.73 75.03 78.86 76.90 59.21 48.49 53.31
BERT-ITA 80.36 87.99 84.00 75.92 79.23 77.54 67.86 53.23 59.66
Table 4. Multiple Metrics Results on Three Datasets. Better results between baselines and corresponding ITA models are in BOLD and best results on datasets are in RED.

5.3. Baselines and Their Training Setup

To make the best practice hyper-parameter settings adopted by each training set in baselines and our models. We conduct experiments on the following baselines with fine-tuned parameters:

  • [leftmargin=*]

  • Gated Recurrent Units(GRU) (chung2014empirical): we test hidden size from 200 to 600, dropout rate from 0.2 to 0.8, batch size in [32, 64, 128, 256].

  • TextCNN (kim2014convolutional): we search the best performance in batch size in [32, 64, 128, 256], dropout rate from 0.3 to 0.7, kernel numbers, which is numbers of convolution kernels of each size type, from 100 to 600, kernel size in [(1,2,3),(3,4,5),(5,6,7),(7,8,9)].

  • BERT (devlin2018bert)

    : we test learning rate in [2e-5, 3e-5, 5e-5], training epochs in [2.0, 3.0, 4.0] and batch size in [16, 32].

5.4. ITA Models and Their Training Setup

To test the performance of our proposed ITA framework, we apply our ITA framework in the baselines and obtain GRU-ITA, TextCNN-ITA and BERT-ITA. The detailed setting is described as follows:

  • [leftmargin=*]

  • GRU-ITA: for GRU-ITA on MultiWoz, batch size is 32, hidden size is 300, dropout rate is 0.3. On DailyDialogue, batch size is 64, hidden size is 500, dropout rate is 0.5. On CCPE, batch size is 32, hidden size is 200, dropout rate is 0.8.

  • TextCNN-ITA: for TextCNN-ITA on MultiWoz, batch size is 64, kernel numbers is 400, kernel size is (7,8,9), dropout rate is 0.3. On DailyDialogue, batch size is 32, kernel numbers is 400, kernel size is (5,6,7), dropout rate is 0.5. On CCPE, batch size is 64, kernel numbers is 600, kernel size is (5,6,7), dropout rate is 0.4.

  • BERT-ITA: the maximum sequence length to 128, batch size is 32 and the number of training epochs is 3.0 to 4.0.

During training, we also adopt a learning rate decay factor as 0.5. All experiments employ the teacher-forcing scheme, feeding the gold target of last time. We also perform early stopping for arbitrator when the number of validation epochs without improving has gone past 6. We test the hidden size in [32, 64, 128, 256] and set dropout rate in [0.1, 0.2]. The learning rate is initiated with 0.001 and the training batch is set to 64. The metrics results are coming from the best result settings for each dataset.

6. Experimental Results and Analysis

In this section, we mainly present the results of baselines and models in Section 6.1 and illustrate the results and reason in Section 6.2.

6.1. Results

To illustrate the benefits brought by our ITA framework, we present the comparison result between our ITA models 444Without loss of generality, the imaginators in our ITA models: GRU-ITA, TextCNN-ITA, BERT-ITA all adopt an LSTM structure applied with the attention mechanism. with the baselines in Table 3. Beyond doubt, we can safely conclude that our ITA models achieve superior performance compared with their counterparts. Additionally, to evaluate our models and baselines from other multiple dimensions, we show the comparison results 555These models are selected by the accuracy scores on precision, recall and F1 results in Table 4, which also prove the benefits brought by our ITA framework. To better analyze the effects of imaginators in our ITA framework, we present different ITA models’ performance with different types of imaginator in Table 5.

For a better understanding of our ITA framework, we also present an example of how ITA acts in Table 6. The wait imaginator predicts what the user might supplement based on the dialogue history: Thanks for all your help. The answer imaginator, however, predicts what the dialogue system might answer: Would you like me to book it for you. Based on both predictions, the arbitrator concludes that it’s a better choice to wait for the users’ supplementary input. So, the dialogue system decides to wait rather than to answer.

6.2. Analysis

In this section, we firstly discuss the comparison results between the baselines and our ITA models to illustrate the benefits brought by our ITA framework. Then we analyse the extra advantages brought by ITA for small-scale datasets. Finally, we discuss the effects of imaginator models in the ITA framework.

Dataset MultiWoz DailyDialogue CCPE
Task Agent User Wait-or-Answer Agent User Wait-or-Answer Agent User Wait-or-Answer
Type of imaginator N/A N/A N/A 77.68 N/A N/A 75.79 N/A N/A 68.65
Answer Imaginator 11.77 0.80 4.51 0.61 15.71 (37.8) 0.00 (8.4)
LSTM Wait Imaginator 0.3 8.87 80.04 0.15 8.70 76.37 0.00 (9.9) 1.14 (19.9) 70.04
Answer Imaginator 12.47 0.72 19.19 0.60 23.86 (45.2) 0.00 (8.4)
LSTM+Attn. Wait Imaginator 0.24 9.71 80.75 0.26 24.52 79.02 0.00 (12.3) 1.46 (24.3) 73.32
Answer Imaginator 13.37 0.67 19.01 0.67 19.56 (43.0) 0.00 (8.0)
LSTM (with
GLOVE) + Attn.
Wait Imaginator 0.51 10.61 80.38 0.21 24.65 78.56 0.00 (13.0) 1.77 (22.5) 71.62
Table 5. The effects of different types of imaginators in ITA Framework. All the model adapt the TextCNN as the classification model. The baseline is the TextCNN arbitration model without the imaginators model. The Agent and User columns are the BLEU score of imaginators generated queries or answers. And Wait-or-Answer columns are ITA model’s accuracy score. Better results between imaginators are in BOLD and best results on datasets are in RED.

Benefits Brought By ITA Framework  From Table 3, we can see that our BERT-ITA model achieves the best performance in all datasets. Besides, the other two ITA models: GRU-ITA and TextCNN-ITA also significantly outperform their corresponding baselines. Even the most rudimentary ITA model: GRU-ITA can beat all baselines (GRU, TextCNN and BERT) in all these datasets. Besides, the results on more evaluation metrics in Table 4 also verify that our ITA framework is a more suitable choice for the Wait-or-Answer task.

User: Actually
User: Can you suggest [value_count] of them
User: Can I get their contact info as well
Agent: Sure , I would suggest the
[restaurant_name] at [restaurant_address]
. You can reach them at [restaurant_phone]
. I could, reserve it for you .
User: No
User: That s OK
User: I can take it from here
Ground Truth User: Thank you for all your help
Answer Imaginator Would you like me to book it for you
Wait Imaginator Thanks for all your help
Arbitrator Selection Wait Imaginator
Table 6. An Example of The Imaginator’s Generation and arbitrator’s Selection.

ITA’s Advantage on Small-scale Datasets  One of the most crucial limits of the dialogue systems’ applications is the lack of high-quality datasets. In this case, we analyze the ITA’s effects on small-scale datasets. As shown in Table 1, CCPE is relatively small-scale datasets, which consists of only 502 dialogues but significantly much more average turns (9.7 in train set compared with 4.09 in DailyDialogue and 6.32 in MultiWoz). And the numbers of positive (Agent Answer) samples and negative (Agent Wait) samples are more imbalanced. This makes it much more difficult to train a satisfying model.

As shown in Table 3, we can see that baselines: Bi-GRU, TextCNNs, and BERT achieve accuracy scores of 67.53, 68.65 and 70.99. We can observe that baselines’ performance is significantly worse than that on large-scale datasets such as MultiWoz. However, our ITA models: GRU-ITA, TextCNN-ITA, and BERT-ITA all achieve satisfying scores. As shown in Table 5, we can see that in small datasets, the imaginators’ BLEU scores are not worse than that on larger datasets like MultiWoz. And all imaginators help arbitrators get significant improvement compared with the arbitrator and improvement is positively correlated with the imaginators’ performance. The LSTM with Attention-based imaginators gets the best generation scores and the best arbitrator results.

In this case, we can get the conclusion that on small-scale and imbalanced datasets, baselines have more difficulty in achieving high results. However, our ITA models can learn more semantic information from the dialogue history with the wait imaginator and the answer imaginator. In this way, our ITA models can achieve much more satisfying results than the baselines.

Effects of Imaginators in ITA Framework  Another interesting issue about the ITA framework is the imaginators’ effects on the ITA framework. We investigate this issue by answering the following questions:

  1. [leftmargin=*]

  2. Do the imaginator models work as we expect? We want to check out if the wait imaginator can truly predict the user’s supplementary input and the answer imaginator can predict the dialogue system’s answer precisely. But, do they work as we expect? We conduct an experiment on the MultiWOZ dataset. As shown in Table 5, the LSTM based answer imaginator get the BLEU score at 11.77 on agent samples, in which the ground truth is agents’ utterances, and the wait imaginator gets the BLEU score at 0.3 on agent samples. Similar results are shown in other imaginators’ experiments. This phenomenon doesn’t mean that the answer imaginator runs terrible. Actually, these results show that our wait imaginator successfully behave like a user. And its difficulty in generating agent utterance also meets our design. For example also shown in Table 6, the predicted agent utterance by answer imaginator seems a high-quality fluent sentence and is also suitable for the scene. However, referring to the dialogue history, it is not a good choice since user in the last turn has said a semantically similar sentence I can take it from here, so the answer imaginators’ prediction Would you like me to book it for you is not a good choice for arbitration, which means, the arbitrator prefer to wait for user’s further utterance.

    From above we can conclude that contrasting results of the two imaginators work as we expect and help the arbitrator in Wait-or-Answer task.

  3. Can better imaginator lead to better ITA models? Another interesting question is that if the improvement of the imaginator can always lead to ITA models’ better performance. Take the DailyDilogue as an example, we can see that the with the enhancement of the attention mechanism and pre-trained GLOVE, the imaginators’ performance increase 666With the attention mechanism, the BLEU score increases from 4.51 to 19.1. With the attention mechanism and pre-trained GLOVE vector, the BLEU score increase from 4.51 to 19.01.. The accuracy of the ITA models also increases: from 76.37 to 79.02, from 76.37 to 78.56. We can also observe the same phenomenon on MultiWOZ. From those results, we can conclude that there is a positive correlation between the performance of imaginators and the final ITA models’ performance.

From the above analysis, we can conclude that both the wait imaginator and the answer imaginator can significantly enhance the arbitrator models by predicting the dialogue interaction behavior.

7. Conclusion

Conventional dialogue systems require that users must describe their intents in a single utterance, otherwise dialogue systems will answer immediately and may cause misunderstanding or reply to the wrong question. Motivated by this problem, we explicitly define a novel task dubbed Wait-or-Answer, which sheds light on the enhancement of the existing dialogue systems’ ability to handle the wait-or-answer plight. Additionally, we propose an Imagine-then-Arbitrate (ITA) model to tackle with this Wait-or-Answer task, which uses two imaginator models and an arbitrator model to decide whether to answer or to wait. Experimental results demonstrate that our ITA models achieve a great extent of improvement over baselines on addressing this Wait-or-Answer task. We believe that our proposed Wait-or-Answer task provides an interesting topic for both academic and industrial NLP communities. We are optimistic about the future of our ITA framework.