All species are unique, but languages make humans uniquest [premack2004language]. Dialogues, especially spoken and written dialogues, are fundamental communication mechanisms for human beings. In real life, tons of businesses and entertainments are done via dialogues. This makes it significant and valuable to build an intelligent dialogue product. So far there are quite a few business applications of dialogue techniques, e.g. personal assistant, intelligent customer service and chitchat companion.
The quality of response is always the most important metric for dialogue agent, targeted by most existing work and models searching the best response. Some works incorporate knowledge [DBLP:conf/acl/FungWM18, lin-etal-2019-task] to improve the success rate of task-oriented dialogue models, while some others [NIPS2015_5866] solve the rare words problem and make response more fluent and informative.
Despite the heated competition of models, however, the pace of interaction is also important for human-computer dialogue agent, which has drawn less or no attention. Figure 1 shows a typical dialogue fragment in an instant message program. A user is asking the service about the schedule of the theater. The user firstly says hello (U11) followed by demand description (U12), and then asks for suggested arrangement (U13), each of which is sent as a single message in one turn. The agent doesn’t answer (A2) until the user finishes his description and throws his question. The user then makes a decision (U21) and asks a new question (U22). And then the agent replies with (A3). It’s quite normal and natural that the user sends several messages in one turn and the agent waits until the user finished his last message, otherwise the pace of the conversation will be messed up. However, existing dialogue agents can not handle well when faced with this scenario and will reply to every utterance received immediately.
There are two issues when applying existing dialogue agents to real life conversation. Firstly, when user sends a short utterance as the start of a conversation, the agent has to make a decision to avoid generating bad responses based on semantically incomplete utterance. Secondly, dialogue agent cutting in the conversation at an unreasonable time could confuse user and mess up the pace of conversation, leading to nonsense interactions.
To address these two issues, in this paper, we propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to recognize if it is the appropriate moment for agent to reply when agent receives a message from the user. In our method, we have two imaginator modules and an arbitrator module. Imaginators will learn both of the agent’s and user’s speaking styles respectively. The arbitrator will use the dialogue history and the imagined future utterances generated by the two imaginators to decide whether the agent should wait user or make a response directly.
In summary, this paper makes the following contributions:
We first addressed an interaction problem, whether the dialogue model should wait for the end of the utterance or make a response directly in order to simulate real life conversation and tried several popular baseline models to solve it.
We proposed a novel Imagine-then-Arbitrate (ITA) neural dialogue model to solve the problem mentioned above, based on both of the historical conversation information and the predicted future possible utterances.
We modified two popular dialogue datasets to simulate the real human dialogue interaction behavior.
Experimental results demonstrate that our model performs well on addressing ending prediction issue and the proposed imaginator modules can significantly help arbitrator outperform baseline models.
2 Related Work
2.1 Dialogue System
Creating a perfect artificial human-computer dialogue system is always a ultimate goal of natural language processing. In recent years, deep learning has become a basic technique in dialogue system. Lots of work has investigated on applying neural networks to dialogue system’s components or end-to-end dialogue frameworks[YanDCZZL17, lipton2018bbq-networks]. The advantage of deep learning is its ability to leverage large amount of data from internet, sensors, etc. The big conversation data and deep learning techniques like SEQ2SEQ [NIPS2014_5346] and attention mechanism [DBLP:conf/emnlp/LuongPM15] help the model understand the utterances, retrieve background knowledge and generate responses.
2.2 Classification in Dialogue
Though end-to-end methods play a more and more important role in dialogue system, the text classification modules [jiang2018text, kowsari2017hdltex] remains very useful in many problems like emotion recognition [song-etal-2019-generating], gender recognition [hoyle-etal-2019-unsupervised]
, verbal intelligence, etc. There have been several widely used text classification methods proposed, e.g. Recurrent Neural Networks (RNNs) and CNNs. Typically RNN is trained to recognize patterns across time, while CNN learns to recognize patterns across space.[kim2014convolutional]
proposed TextCNNs trained on top of pre-trained word vectors for sentence-level classification tasks, and achieved excellent results on multiple benchmarks.
Besides RNNs and CNNs, [vaswani2017attention] proposed a new network architecture called Transformer, based solely on attention mechanism and obtained promising performance on many NLP tasks. To make the best use of unlabeled data, [devlin2018bert] introduced a new language representation model called BERT based on transformer and obtained state-of-the-art results.
2.3 Dialogue Generation
Different from retrieval method, Natural Language Generation (NLG) tries converting a communication goal, selected by the dialogue manager, into a natural language form. It reflects the naturalness of a dialogue system, and thus the user experience. Traditional template or rule-based approach mainly contains a set of templates, rules, and hand-craft heuristics designed by domain experts. This makes it labor-intensive yet rigid, motivating researchers to find more data-driven approaches[ghazvininejad2018knowledge, lin-etal-2019-task] that aim to optimize a generation module from corpora, one of which, Semantically Controlled LSTM (SC-LSTM) [wen2015semantically], a variant of LSTM [hochreiter1997long], gives a semantic control on language generation with an extra component.
3 Task Definition
In this section we will describe the task by taking a scenario and then define the task formally.
As shown in Figure 1, we have two participants in a conversation. One is the dialogue agent, and the other is a real human user. The agent’s behavior is similar to most chatbots, except that it doesn’t reply on every sentence received. Instead, this agent will judge to find the right time to reply.
Our problem is formulated as follows. There is a conversation history represented as a sequence of utterances: , where each utterance itself is a sequence of words . Besides, each utterance has some additional tags:
turn tags to show which turn this utterance is in the whole conversation.
speakers’ identification tags or to show who sends this utterance.
subturn tags for user to indicate which subturn an utterance is in. Note that an utterance will be labelled as even if it doesn’t have one.
Now, given a dialogue history and tags , the goal of the model is to predict a label , the action the agent would take, where means the agent will wait the user for next message, and
means the agent will reply immediately. Formally we are going to maximize following probability:
4 Proposed Framework
Basically, the task can be simplified as a simple text classification problem. However, traditional classification models only use the dialogue history and predict ground truth label. The ground truth label actually ignores all context information in the next utterance. To make the best use of training data, we propose a novel Imagine-then-Arbitrate (ITA) model taking , ground truth label, and the future possible into consideration. In this section, we will describe the architecture of our model and how it works in detail.
An imaginator is a natural language generator generating next sentence given the dialogue history. There are two imaginators in our method, agent’s imaginator and user’s imaginator. The goal of the two imaginators are to learn the agent’s and user’s speaking style respectively and generate possible future utterances.
As shown in Figure 2 (a), imaginator itself is a sequence generation model. We use one-hot embedding to convert all words and relative tags, e.g. turn tags and place holders, to one-hot vectors , where is the length of vocabulary list. Then we extend each word in utterance by concatenating the token itself with turn tag, identity tag and subturn tag. We adopt SEQ2SEQ as the basic architecture and LSTMs as the encoder and decoder networks. LSTMs will encode each extended word as a continuous vector at each time step . The process can be formulated as following:
where is the embedding of the extended word , , , , , , , , and are learnt parameters.
Though trained on the same dataset, the two imaginators learn different roles independently. So in the same piece of dialogue, we split it into different samples for different imaginators. For example, as shown in Figure 1 and 2 (a), we use utterance (A1, U11, U12) as dialogue history input and U13 as ground truth to train the user imaginator and use utterance (A1, U11, U12, U13) as dialogue history and A2 as ground truth to train the agent imaginator.
During training, the encoder runs as equation 2, and the decoder is the same structured LSTMs but will be fed to a Softmax with
, which will produce a probability distributionover all words, formally:
the decoder at time step t will select the highest word in , and our imaginator’s loss is the sum of the negative log likelihood of the correct word at each step as follows:
where is the length of the generated sentence. During inference, we also apply beam search to improve the generation performance.
Finally, the trained agent imaginator and user imaginator are obtained.
The arbitrator module is fundamentally a text classifier. However, in this task, we make the module maximally utilize both dialogue history and ground truth’s semantic information. So we turned the problem of maximizingfrom in equation (1) to:
where and are the trained agent imaginator and user imaginator respectively, and is a selection indicator where means selecting whereas means selecting . And Thus we (1) introduce the generation ground truth semantic information and future possible predicted utterances (2) turn the label prediction problem into a response selection problem.
We adopt several architectures like Bi-GRUs, TextCNNs and BERT as the basis of arbitrator module. We will show how to build an arbitrator by taking TextCNNs as an example.
As is shown in Figure 2, the three CNNs with same structure take the inferred responses , and dialogue history , tags . For each raw word sequence , we embed each word as one-hot vector . By looking up a word embedding matrix , the input text is represented as an input matrix , where is the length of sequence of words and is the dimension of word embedding features. The matrix is then fed into a convolution layer where a filter is applied:
where is the window of token representation and the function is , and are learnt parameters. Applying this filter to possible obtains a feature map:
where for filters. And we use different size of filters in parallel in the same convolution layer. This means we will have windows at the same time, so formally:
, then we apply max-over-time pooling operation to capture the most important feature:
, and thus we get the final feature map of the input sequence.
We apply same CNNs to get the feature maps of , and :
And then, the arbitrator will calculate the probability of the two possible dialogue paths:
Through learnt parameters and , we will get a two-dimensional probability distribution , in which the most reasonable response has the max probability. This also indicates whether the agent should wait or not.
And the total loss function of the whole attribution module will be negative log likelihood of the probability of choosing the correct action:
where is the number of samples and is the ground truth label of i-th sample.
The arbitrator module based on Bi-GRU and BERT is implemented similar to TextCNNs.
|Avg. Split User Turns||1.89||1.92||1.94||2.09||2.12||2.12|
|Avg. Utterance Length||10.54||10.7||10.56||8.71||8.54||8.75|
|Avg. Agent’s Utterance||14.43||14.78||14.69||12.04||11.81||12.17|
|Avg. User’s Utterance||6.18||6.28||6.17||5.91||5.87||5.96|
|Agent Wait Samples||53249||6970||6983||41547||3846||3689|
|Agent Reply Samples||47341||6410||6573||49540||4717||4510|
|LSTM (with GLOVE) + Attn.||User Imaginator||0.51||10.61||80.38||0.21||24.65||78.56|
5 Experimental Setup
As the proposed approach mainly concentrates on the interaction of human-computer, we select and modify two very different style datasets to test the performance of our method. One is a task-oriented dialogue dataset MultiWoz 2.0 111http://dialogue.mi.eng.cam.ac.uk/index.php/corpus/ and the other is a chitchat dataset DailyDialogue 222http://yanran.li/dailydialog.html. Both datasets are collected from human-to-human conversations. We evaluate and compare the results with the baseline methods in multiple dimensions. Table 1 shows the statistics of datasets.
MultiWOZ 2.0 [budzianowski-etal-2018-multiwoz]. MultiDomain Wizard-of-Oz dataset (MultiWOZ) is a fully-labeled collection of human-human written conversations. Compared with previous task-oriented dialogue datasets, e.g. DSTC 2 [henderson-etal-2014-second] and KVR [DBLP:conf/sigdial/EricKCM17], it is a much larger multi-turn conversational corpus and across serveral domains and topics: It is at least one order of magnitude larger than all previous annotated task-oriented corpora, with dialogues spanning across several domains and topics.
DailyDialogue [li-etal-2017-dailydialog]. DailyDialogue is a high-quality multi-turn dialogue dataset, which contains conversations about daily life. In this dataset, humans often first respond to previous context and then propose their own questions and suggestions. In this way, people show their attention others’ words and are willing to continue the conversation. Compare to the task-oriented dialogue datasets, the speaker’s behavior will be more unpredictable and complex for the arbitrator.
5.2 Datasets Modification
Because the task we concentrate on is different from traditional ones, to make the datasets fit our problems and real life, we modify the datasets with the following steps:
|Ground Truth||User: thank,you for all your help|
Drop Slots and Values For task-oriented dialogue, slot labels are important for navigating the system to complete a specific task. However, those labels and accurate values from ontology files will not benefit our task essentially. So we replace all specific values with a slot placeholder in preprocessing step.
Split Utterances Existing datasets concentrate on the dialogue content, combining multiple sentences into one utterance each turn when gathering the data. In this step, we randomly split the combined utterance into multiple utterances according to the punctuation. And we set a determined probability to decide if the preprocessing program should split a certain sentence.
Add Turn Tag We add turn tags, subturn tags and role tags to each split and original sentences to (1) label the speaker role and dialogue turns (2) tag the ground truth for training and testing the supervised baselines and our model.
Finally, we have the modified datasets which imitate the real life human chatting behaviors as shown in Figure 1. Our datasets and code333https://github.com/mumeblossom/ITA will be released to public for further researches in both academic and industry.
5.3 Evaluation Method
To compare with dataset baselines in multiple dimensions and test the model’s performance, we use the overall Bilingual Evaluation Understudy (BLEU) [DBLP:conf/acl/PapineniRWZ02] to evaluate the imaginators’ generation performance. As for arbitrator, we use accuracy score of the classification to evaluate. Accuracy in our experiments is the correct ratio in all samples.
5.4 Baselines and Training Setup
The hyper-parameter settings adopted in baselines and our model are the best practice settings for each training set. All models are tested with various hyper-parameter settings to get their best performance. Baseline models are Bidirectional Gated Recurrent Units (Bi-GRUs)[chung2014empirical], TextCNNs [kim2014convolutional] and BERT [devlin2018bert].
6 Experimental Results and Analysis
In Table 2
, we show different imaginators’ generation abilities and their performances on the same TextCNN based arbitrator. Firstly, we gathered the results of agent and user imaginators’ generation based on LSTM, LSTM-attention and LSTM-attention with GLOVE pretrained word embedding. According to the evaluation metric BLEU, the latter two models achieve higher but similar results. Secondly, when fixed the arbitrator on the TextCNNs model, the latter two also get the similar results on accuracy and significantly outperform the others including the TextCNNs baseline.
The performances on different arbitrators with the same LSTM-attention imaginators are shown in Table 3. From those results, we can directly compared with the corresponding baseline models. The imaginators with BERT based arbitrator make the best results in both datasets while all ITA models beat the baseline models.
We also present an example of how our model runs in Table 4. Imaginators predict the agent and user’s utterance according to the dialogue history(shown in model prediction), and then arbitrator selects the user imaginator’s prediction that is more suitable with the dialogue history. It is worth noting that the arbitrator generates a high-quality sentence again if only considering the generation effect. However, referring to the dialogue history, it is not a good choice since its semantic is repeated in the last turn by the agent.
6.2.1 Imaginators Benefit the Performance
From Table 3, we can see that not only our BERT based model get the best results in both datasets, the other two models also significantly beat the corresponding baselines. Even the TextCNNs based model can beat all baselines in both datasets.
Table 2 figures out experiment results on MultiWOZ dataset. The LSTM based agent imaginator get the BLEU score at 11.77 on agent samples, in which the ground truth is agents’ utterances, and 0.80 on user samples. Meanwhile, the user imaginator get the BLEU score at 0.3 on agent samples and 8.87 on user target samples. Similar results are shown in other imaginators’ expermients. Although these comparisons seem unfair to some extends since we do not have the agent and user’s real utterances at the same time and under the same dialogue history, these results show that the imaginators did learn the speaking style of agent and user respectively. So the suitable imaginator’s generation will be more similar to the ground truth, such an example shown in Table 4, which means this response more semantically suitable given the dialogue history.
If we fix the agent and user imaginators’ model, as we take the LSTM-attention model, the arbitrators achieve different performances on different models, shown in Table3. As expected, ITA models beat their base models by nearly 2 3% and ITA-BERT model beats all other ITA models.
So from the all results, we can conclude that imaginators will significantly help the arbitrator in predicting the dialogue interaction behavior using the future possible agent and user responses’ semantic information.
6.2.2 Relation of Imaginators and Arbitrator’s Performance
As shown in the DailyDialogue dataset of Table 2, we can see that attention mechanism works in learning the generation task. LSTMs -Attention and LSTMs-attention-GLOVE based imaginators get more than 19 and 24 BLEU scores in corresponding target, while the LSTMs without attention gets only 4.51 and 8.70. These results also impact on the arbitrator results. The imaginator with attention mechanism get an accuracy score of 79.02 and 78.56, significantly better than the others. The evidence also exists in the results on MultiWoz. All imaginators get similar generation performance, so the arbitrators gets the similar accuracy scores.
From those results, we can conclude that there is positive correlation between the performance of imaginators and arbitrators. However, there still exists problems. It’s not easy to evaluate the dialogue generation’s performance. In the results of MultiWoz, we can see that LSTMs-GLOVE based ITA performs a little better than LSTMs-attention based ITA, but not the results of the arbitrator are opposite. This may indicate that (1) when the imaginators’ performance is high enough, the arbitrator’s performance will be stable and (2) the BLEU score will not perfectly present the contribution to the arbitrator. We leave these hypotheses in future work.
We first address an interaction problem, whether the dialogue model should wait for the end of the utterance or reply directly in order to simulate user’s real life conversation behavior, and propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to deal with it. Our model introduces the imagined future possible semantic information for prediction. We modified two popular dialogue datasets to fit in the real situation. It is reasonable that additional information is helpful for arbitrator, despite its fantasy.