TripleNet: Triple Attention Network for Multi-Turn Response Selection in Retrieval-based Chatbots (CoNLL2019)
We consider the importance of different utterances in the context for selecting the response usually depends on the current query. In this paper, we propose the model TripleNet to fully model the task with the triple <context, query, response> instead of <context, response> in previous works. The heart of TripleNet is a novel attention mechanism named triple attention to model the relationships within the triple at four levels. The new mechanism updates the representation for each element based on the attention with the other two concurrently and symmetrically. We match the triple <C, Q, R> centered on the response from char to context level for prediction. Experimental results on two large-scale multi-turn response selection datasets show that the proposed model can significantly outperform the state-of-the-art methods. TripleNet source code is available at https://github.com/wtma/TripleNetREAD FULL TEXT VIEW PDF
We study response selection for multi-turn conversation in retrieval-bas...
This paper proposes an utterance-to-utterance interactive matching netwo...
We study multi-turn response generation in chatbots where a response is
This paper introduces and evaluates two novel Hierarchical Attention Net...
This paper presents our work for the ninth edition of the Dialogue Syste...
In this paper we study the problem of answering cloze-style questions ov...
Graph Attention Networks (GATs) are one of the most popular GNN architec...
TripleNet: Triple Attention Network for Multi-Turn Response Selection in Retrieval-based Chatbots (CoNLL2019)
To establish a human-machine dialogue system is one of the most challenging tasks in Artificial Intelligence (AI). Existing works on building dialogue systems are mainly divided into two categories: retrieval-based methodYan et al. (2016); Zhou et al. (2016), and generation-based method Vinyals and Le (2015). The retrieval-based method retrieves multiple candidate responses from the massive repository and selects the best one as the system’s response, while the generation-based method uses the encoder-decoder framework to generate the response, which is similar to machine translation.
|A: i downloaded angry ip scanner and now it doesn’t work and i can’t uninstall it|
|B: you installed it via package or via some binary installer|
|A: i installed from ubuntu soft center|
|B: hm i do n’t know what package it is but it should let you remove it the same way|
|A: ah makes sense then … hm was it a deb file|
|True Response: i think it was another format mayge sth starting with r|
|False Response: thanks i appreciate it try sudo apt-get install libxine-extracodecs|
In this paper, we are focusing on the retrieval-based method because it is more practical in applications. Selecting a response from a set of candidates is an important and challenging task for the retrieval-based method. Many of the previous approaches are based on Deep Neural Network (DNN) to select the response for single-turn conversationLu and Li (2013). We study multi-turn response selection in this paper, which is rather difficult because it not only requires identification of the important information such as keywords, phrases, and sentences, but also the latent dependencies between the context, query, and candidate response.
Previous works Zhou et al. (2018); Wu et al. (2017) show that representing the context at different granularities is vital for multi-turn response selection. However, it is not enough for multi-turn response selection. Figure 1 illustrates the problem with a real example in Ubuntu Corpus. As demonstrated, the following two points should be modeled to solve the problem: (1) the importance of current query should be highlighted, because it has great impact on the importance of different utterances in the context. For example, the query in the case is about the format of the file (‘deb file’), which leads the last two utterances (including the query) are more important than the previous ones. If we only match the response with the context, the model may be misled by the high frequency word ‘install’ and choose the false candidate. (2) the information of different granularities is important, which includes not only the word, utterance, and context level, but also the char level. For example, the different tenses (‘install,’ ‘installed’) and the misspelling word (‘angry’) appear constantly in the conversation. Similar to the role of question for the task of machine reading comprehension Seo et al. (2016); Cui et al. (2017); Chen et al. (2019), the query in this task is also the key to selecting the response. In this paper, we propose a model named TripleNet to excavate the role of query. The main contributions of our work are listed as follows.
we use a novel triple attention mechanism to model the relationships within instead of ;
we propose a hierarchical representation module to fully model the conversation from char to context level;
The experimental results on Ubuntu and Douban corpus show that TripleNet significantly outperform the state-of-the-art result.
Earlier works on building the conversation systems are generally based on rules or templates Walker et al. (2001), which are designed for the specific domain and need much human effort to collect the rules and domain knowledge. As the portability and coverage of such systems are far from satisfaction, people pay more attention to the data-driven approaches for the open-domain conversation system Ritter et al. (2011); Higashinaka et al. (2014). The main challenge for open-domain conversation is to produce a corresponding response based on the current context. As mentioned previously, the retrieval-based and generation-based methods are the mainstream approaches for conversational response generation. In this paper, we focus on the task response selection which belongs to retrieval-based approach.
The early studies of response selection generally focus on the single-turn conversation, which use only the current query to select the response Lu and Li (2013); Ji et al. (2014); Wang et al. (2015). Since it is hard to get the topic and intention of the conversation by single-turn, researchers turn their attention to multi-turn conversation and model the context instead of the current query to predict the response. First, Lowe et al. (2015) released the Ubuntu Dialogue dataset and proposed a neural model which matches the context and response with corresponding representations via RNNs and LSTMs. Kadlec et al. (2015) evaluate the performances of various models on the dataset, such as LSTMs, Bi-LSTMs, and CNNs. Later, Yan et al. (2016) concatenated utterances with the reformulated query and various features in a deep neural network. Baudiš et al. (2016) regarded the task as sentence pair scoring and implemented an RNN-CNN neural network model with attention. Zhou et al. (2016) proposed a multi-view model with CNN and RNN, modeling the context in both word and utterance view. Further, Xu et al. (2017) proposed a deep neural network to incorporate background knowledge for conversation by LSTM with a specially designed recall gate. Wu et al. (2017) proposed matching the context and response by their word and phrase representations, which had significant improvement from previous work. Zhang et al. (2018) introduced a self-matching attention to route the vital information in each utterance, and used RNN to fuse the matching result. Zhou et al. (2018) used self-attention and cross-attention to construct the representations at different granularities, achieving a state-of-the-art result.
Our model is different from the previous methods: first we model the task with the triple instead of in the early works, and use a novel triple attention matching mechanism to model the relationships within the triple. Then we represent the context from low (character) to high (context) level, which constructs the representations for the context more comprehensively.
In this section, we will give a detailed introduction of the proposed model TripleNet. We first formalize the problem of the response selection for multi-turn conversation. Then we briefly introduce the overall architecture of the proposed model. Finally, the details of each part of our model will be illustrated.
For the response selection, we define the task as given the context , current query and candidate response , which is different from almost all the previous works Zhou et al. (2018); Wu et al. (2017). We aim to build a model function to predict the possibility of the candidate response to be the correct response.
The information in context is composed of four levels: context, utterances, words and characters, which can be formulated as , where represents the th utterance, and is the maximum utterance number. The last utterance in the context is query ; we still use query as the end of context to maintain the integrity of the information in context. Each utterance can be formulated as , where is the th word in the utterance and is the maximum word number in the utterance. Each word can be represented by multiply characters , where is the th char and is the length of the word in char-level. The latter two levels are similar in the query and response.
The overall architecture of the model TripleNet is displayed in Figure 2. The model has a bottom-up architecture that organizes the calculation from char to context level. In each level, we first uses the hierarchical representation module to construct the representations of context, response and query. Then the triple attention mechanism is applied to update the representations. At last, the model matches them while focused on the response and fuses the result for prediction.
In the hierarchical representation module, we represent the conversation in four perspectives including char, word, utterance, and context. In the char-level, a convolutional neural network (CNN) is applied to the embedding matrix of each word and produces the embedding of the word by convolution and maxpooling operations as the char-level representation. In word-level, we use a shared LSTM layer to obtain the word-level embedding for each word. After that, we use self-attention to encode the representation of each utterance into a vector which is the utterance-level representation. At last, the utterance-level representation of each utterance is fed into another LSTM layer to further model the information among different utterances, forming the context-level representation.
The structure of the triple attention mechanism can be seen in the right part of Figure 2. We first design a bi-directional attention function (BAF) to calculate the attention between two sequences and output their new representations. To model the relationship of the triple , we apply BAF to each pair within the triple and get two new representations for each one element, and then we add them together as its final attention-based representation. In the triple attention mechanism, we can update the representation of each one based on the attention result with the other two simultaneously, and each element participates in the whole calculation in the same way.
At first, we embed the characters in each word into fixed size vectors and use a CNN followed by max-pooling to get character-derived embeddings for each word, which can be formulated by
where , are parameters, refers to the concatenation of the embedding of (,…,), is the window size of th filter, and the is the representation of the word in char-level.
Word-level Representation. Furthermore, we embed word by pre-trained word vectors, and we also introduce a word matching (MF) feature to the embedding to make the model more sensitive to concurrent words. If the word appears in the response and context or query simultaneously, we set the feature to 1, otherwise to 0.
where to denotes the embedding representation, is the pre-trained word embedding, and is the character embedding function. We use a shared bi-directional LSTM to get contextual word representations in each utterance, query, and the response. The representation of each word is formed by concatenating the forward and backward LSTM hidden output.
where is the representation of the word. We denote the word-level representation of the context as and the response as , where is the dimension of Bi-LSTMs. Until now, we have constructed the representations of context, query, and response in char and word level, and we only represent the latter two in these two levels because they don’t have such rich contextual information as the context.
Utterance-level Representation. Given the utterance , we construct the utterance-level representation by self-attention Lin et al. (2017):
where , are trainable weights,
is a hyperparameter,is the utterance-level representation, and is the attention weight for the th word in the th utterance, which signifies the importance of the word in the utterance.
Context-level Representation. To further model the continuity and contextual information among the utterances, we fed the utterance-level representations into another bi-directional LSTM layer to obtain the representation for each utterance in context perspective.
where is the context-level representation for the th utterance in the context and is the output size of the Bi-LSTM.
In this part, we update the representations of context, query, and response in each level by triple attention, the motivation of which is to model the latent relationships within .
Given the triple , we fed each of its pairs into bi-directional attention function (BAF).
denotes the batch normalization layerIoffe and Szegedy (2015) which is conducive to preventing vanishing or exploding of gradients. produces the new representations for two sequences (P, Q) by the attention from two directions, which is inspired by Seo et al. (2016). We can formulate it by
where , are the attention between and in two directions, , are the new representations the two sequences (P, Q), and we apply a batch normalization layer upon them too.
We find that the triple attention has some interesting features: (1) triple, the representation for each element in the triple is updated based on the attention to the other two concurrently; (2) symmetrical, which means each element in the triple plays the same role in the structure because their contents are similar in the whole conversation; (3) unchanged dimension, all the outputs of triple attention has the same dimensions as the inputs, so we can stack multiple layers as needed.
Triple Matching. We match the triple in each level with the cosine distance using new representations produced by triple attention. This process focuses on the response because it is our target. For example, in the char-level, we match the triple by
where is the representation updated by triple attention, is the char-level matching result, the word-level matches the triple in the same way, and the utterance and the context level match the triple without the maxpooling operation. We use , , as the matching results in the word, utterance and context levels.
Fusion. After obtaining the four-level matching matrix, we use hierarchical RNN to get highly abstract features. Firstly, we concatenate the four matrices to form a 3D cube and we use as one of the matrix in , which denotes the matching result for one word in response in four levels.
Where and are the th, th row in the matrix and . We merge the results from different time steps in the outputs of LSTM by max-pooling operation. Until now, we encode the matching result into a single feature vector .
Final Prediction. For the final prediction, we fed the vector into a full-connected layer with sigmoid output activation.
where are trainable weights. Our purpose is to predict the matching score between the context, query and candidate response, which can be seen as a binary classification task. To train our model, we minimize the cross entropy loss between the prediction and ground truth.
|Ubuntu Dialogue Corpus||Douban Conversation Corpus|
Experimental results on two public dialogue datasets. The table is segmented into three sections: Non-Attention models, Attention-based models and our models. The italics denotes the previous best results, and the scores in bold express the new state-of-the-art result of single model without any pre-training layer.
We first evaluate our model on Ubuntu Dialogue Corpus Lowe et al. (2015) because it is the largest public multi-turn dialogue corpus which consists of about one million conversations in the specific domain. To reduce the number of unknown words, we use the shared copy of the Ubuntu corpus by Xu et al. (2017) which replaces the numbers, paths, and URLs by specific symbols.333https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntudata.zip Furthermore, to verify the generalization of our model, we also carry out experiments on Douban Conversation Corpus Wu et al. (2017), which shares similar format with the Ubuntu corpus but is open-domain and in the Chinese language.
We implement our model by KerasChollet and others (2015)
with TensorFlow backend. In the Embedding Layer, the word embeddings are pre-trained using the training set via GloVePennington et al. (2014), the weights of which are trainable. For char embedding, we set the kernel shape as 3 and filter number as 200 in the CNN layer. For all the Bi-directional LSTM layers, we set their hidden size to 200. We use Adamax Kingma and Ba (2014) for weight updating with an initial learning rate of 0.002. For ensemble models, we generate 6 models for each corpus using different random seeds and merge the result by voting.
For better comparison with the baseline models, the main super parameters in TripleNet, such as the embedding size, max length of each turn, and the vocabularies, are the same as those of the baseline models. The maximum number of conversation turns, which changes with the models, is 12 in our model, 9 in DAM Wu et al. (2017), and 10 in SMN Wu et al. (2017).
We basically divided baseline models into two categories for comparisons.
Non-Attention Models. The majority of the previous works on this task are designed without attention mechanisms, including the Sequential Matching Network (SMN) Wu et al. (2017), Multi-View model Zhou et al. (2016)
, Deep Learning to Respond (DL2R)Yan et al. (2016), Match-LSTM Wang and Jiang (2016), MV-LSTM Wan et al. (2016), and DualEncoder Lowe et al. (2015).
The overall results on two datasets are depicted in Table 1. Our results are obviously better on the two datasets compared with recently attention-based model DAM, which exceeds 2.3% in @1 of Ubuntu and 2.6% in of Douban. Furthermore, our score is significantly exceeding in almost all metrics except the @5 in Douban when compared with DUA, which may be because the metric is not very stable as the test set in Douban is very small (1000).
To further improve the performance, we utilize pre-trained ELMo Peters et al. (2018) and fine-tune it on the training set in the Ubuntu condition while we train ELMo from scratch using the Douban training set. As the baseline of Douban corpus is relatively lower, we observe much bigger improvements in the corpus using ELMo. The model ensemble has further improvements based on the single model with ELMo; the score of @1 in Ubuntu is close to the average performance of human experts at 83.8 Lowe et al. (2016).
Compared to non-attention models such as the SMN and Multi-view, which match the context and response at two levels, TripleNet shows substantial improvements. On @1 for Ubuntu corpus, there is a 6.3% absolute improvement from SMN and 12.8% from Multi-view, showing the effectiveness of triple attention.
To better demonstrate the effectiveness of TripleNet, we conduct the ablations on the model under the Ubuntu corpus for its larger data size.
We first remove the triple attention and matching parts (-TAM); the result shows a marked decline (2.4% in @1), which is in the second part of Table 2. The performance of the model is similar to the baseline model DAM. This indicates that our four-level hierarchical representation may play a similar role to the five stacks Transformer in DAM. We then remove the triple attention part, which means we match the pairs and with their original representation in each level; the score of @1 drops 1.4%, which shows the effect of triple attention. We also have tried to remove all the parts related to the query (-Query). That means the attention and matching parts are only calculated within the pair . It is worth mentioning that the information of the query is still contained at the end of the context. The performance also has a marked drop (1.6% in @1), which shows that it is necessary to model the query separately. To find out which subsection in those parts is more important, we remove each one of them.
Triple attention matching ablation. As we can see in the third part of Table 2, when attention between context and response is removed (-A), the largest decrease (0.6% in @1) appears, which indicates that the relationship between context and response is most important in the triple. The attentions in the other two pairs and all lead to a slight performance drop (0.3 and 0.5 in @1), which may be because they overlap with each other for updating the representation of the triple.
When we remove the matching between context and response, we find that the performance of the model has a marked drop (2.1 in @1), which shows that the relationship within is the base for selecting the response. The query and response matching part also leads to a significant decline. This shows that we should pay more attention to query within the whole context.
Hierarchical representation ablation. To find out the calculation of which level is most important, we also tried to remove each level calculation from the hierarchical representation module, which can be seen in the fourth part of Table 2. To our surprise, when we remove char (-char) and context level calculation (-context), we observe that the reduction (0.5 in @1) is more significant than the other two, indicating that we should pay more attention to the lowest and highest level information. Also by removing the other two levels, there is also a significant reduction from TripleNet, which means each level of the three is indispensable for our TripleNet .
From the experiments in this part, we find that each subsection of the hierarchical representation module only leads to a slight performance drop. Maybe it’s because the representation from each level represent the conversation from a unique and indispensable perspective, and the information conveyed by different representations may have some overlap.
By decoding our model for the case in Figure 1, we find that our model TripleNet can choose the true response. To analyze in detail how triple attention works, we get the attention in word-level as the example and visualize it in Figure 3. As there are so many words in the context, we only use the second utterance in the upper part of Figure 1 for its relatively rich semantics.
In the query-context attention, the query mainly pays attention to the keyword ‘package.’ This is helpful to get the topic of the conversation. While the attention of context focuses on the word ‘a’ which is near the key phrase ‘deb file,’ which may be because the representation of the word catches some information from the words nearby by Bi-LSTM. In the query-response attention, the result shows that the attention of the query mainly focuses on the word ‘format,’ which is the most important word in the response. But we can also find that the response does not catch the important words in the query. In the response-context attention, the response pays more attention to the word ‘binary,’ which is another important word in the context.
From the three maps, we find that each attention can catch some important information but miss some useful information too. If we join the information in query-context and response-context attention, we can catch the most import information in the context. Furthermore, the query-response attention can help us catch the most important word in the response. So it is natural for TripleNet to select the right response because the model can integrate the three attentions together.
In this section, we will discuss the importance of different utterances in the context. To find out the importance of different utterances in the context, we conduct an experiment by removing each one of them with the model (-Query) in the ablation experiment part because the model deals all the utterances include the query in the same way. For each experiment in this part, we remove the th ( and ) utterance in the context both in training and evaluation processes and report the decrease of performance in Figure 4. We find that the removing of the query leads the most significant decline (more than 6% in @1), that indicates the query is much more important than any other utterances. Furthermore, the decrease is stable before the th utterances and raises rapidly in the last 3 utterances. We can deduce that the last three utterances are more important than the other ones.
From the whole result, we can conclude that it’s better to model the query separately than deal all of the utterances in the same way for their significantly different importance; we also find that we should pay more attention to the utterances near the query because they are more important.
In this paper, we propose a model TripleNet for multi-turn response selection. We model the context from low (character) to high (context) level, update the representation by triple attention within , match the triple focused on response, and fuse the matching results with hierarchical LSTM for prediction. Experimental results show that the proposed model achieves state-of-the-art results on both Ubuntu and Douban corpus, which ranges from a specific domain to open domain, and English to Chinese language, demonstrating the effectiveness and generalization of our model. In the future, we will apply the proposed triple attention mechanism to other NLP tasks to further testify its extensibility.
We would like to thank all anonymous reviewers for their hard work on reviewing and providing valuable comments on our paper. We also would like to thank Yunyi Anderson for proofreading our paper thoroughly. This work is supported by National Key R&D Program of China via grant 2018YFC0832100.
Towards an open-domain conversational system fully based on natural language processing. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 928–939. External Links: Cited by: §2.
International Conference on Machine Learning, pp. 448–456. Cited by: §3.4.