Data and Code on Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems (SIGIR 2018)
Intelligent personal assistant systems with either text-based or voice-based conversational interfaces are becoming increasingly popular around the world. Retrieval-based conversation models have the advantages of returning fluent and informative responses. Most existing studies in this area are on open domain "chit-chat" conversations or task / transaction oriented conversations. More research is needed for information-seeking conversations. There is also a lack of modeling external knowledge beyond the dialog utterances among current conversational models. In this paper, we propose a learning framework on the top of deep neural matching networks that leverages external knowledge for response ranking in information-seeking conversation systems. We incorporate external knowledge into deep neural models with pseudo-relevance feedback and QA correspondence knowledge distillation. Extensive experiments with three information-seeking conversation data sets including both open benchmarks and commercial data show that, our methods outperform various baseline methods including several deep text matching models and the state-of-the-art method on response selection in multi-turn conversations. We also perform analysis over different response types, model variations and ranking examples. Our models and research findings provide new insights on how to utilize external knowledge with deep neural models for response selection and have implications for the design of the next generation of information-seeking conversation systems.READ FULL TEXT VIEW PDF
Personal assistant systems, such as Apple Siri, Google Assistant, Amazon...
Intelligent personal assistant systems, with either text-based or voice-...
Intelligent personal assistant systems for information-seeking conversat...
Building multi-turn information-seeking conversation systems is an impor...
We investigate the task of modeling open-domain, multi-turn, unstructure...
While multimodal conversation agents are gaining importance in several
Many professional services are provided through text and voice systems, ...
Data and Code on Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems (SIGIR 2018)
Personal assistant systems, such as Apple Siri, Google Now, Amazon Alexa, and Microsoft Cortana, are becoming ever more widely used111For example, over 100M installations of Google Now (Google, http://bit.ly/1wTckVs); 15M sales of Amazon Echo (GeekWire, http://bit.ly/2xfZAgX); more than 141M monthly users of Microsoft Cortana (Windowscentral, http://bit.ly/2Dv6TVT).. These systems, with either text-based or voice-based conversational interfaces, are capable of voice interaction, information search, question answering and voice control of smart devices. This trend has led to an interest in developing conversational search systems, where users would be able to ask questions to seek information with conversation interactions. Research on speech and text-based conversational search has also recently attracted significant attention in the information retrieval (IR) community.
Existing approaches to building conversational systems include generation-based methods (Ritter et al., 2011; Shang et al., 2015) and retrieval-based methods (Ji et al., 2014; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017). Compared with generation-based methods, retrieval-based methods have the advantages of returning fluent and informative responses. Most work on retrieval-based conversational systems studies response ranking for single-turn conversation (Wang et al., 2013), which only considers a current utterance for selecting responses. Recently, several researchers have been studying multi-turn conversation (Yan et al., 2016a; Zhou et al., 2016; Wu et al., 2017; Yan et al., 2017), which considers the previous utterances of the current message as the conversation context to select responses by jointly modeling context information, current input utterance and response candidates. However, existing studies are still suffering from the following weaknesses:
(1) Most existing studies are on open domain chit-chat conversations or task / transaction oriented conversations. Most current work (Ritter et al., 2011; Shang et al., 2015; Ji et al., 2014; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017) is looking at open domain chit-chat conversations as in microblog data like Twitter and Weibo. There is some research on task oriented conversations (Young et al., 2010; Wen et al., 2017; Bordes et al., 2017), where there is a clear goal to be achieved through conversations between the human and the agent. However, the typical applications and data are related to completing transactions like ordering a restaurant or booking a flight ticket. Much less attention has been paid to information oriented conversations, which is referred to as information-seeking conversations in this paper. Information-seeking conversations, where the agent is trying to satisfy the information needs of the user through conversation interactions, are closely related to conversational search systems. More research is needed on response selection in information-seeking conversation systems.
(2) Lack of modeling external knowledge beyond the dialog utterances. Most research on response selection in conversation systems are purely modeling the matching patterns between user input message (either with context or not) and response candidates, which ignores external knowledge beyond the dialog utterances. Similar to Web search, information-seeking conversations could be associated with massive external data collections that contain rich knowledge that could be useful for response selection. This is especially critical for information-seeking conversations, since there may be not enough signals in the current dialog context and candidate responses to discriminate a good response from a bad one due to the wide range of topics for user information needs. An obvious research question is how to utilize external knowledge effectively for response ranking. This question has not been well studied, despite the potential benefits for the development of information-seeking conversation systems.
To address these research issues, we propose a learning framework on top of deep neural matching networks that leverages external knowledge for response ranking in information-seeking conversation systems. We study two different methods on integrating external knowledge into deep neural matching networks as follows:
(1) Incorporating external knowledge via pseudo-relevance feedback. Pseudo-relevance feedback (PRF) has been proven effective in improving the performance of many retrieval models (Lavrenko and Croft, 2001; Lv and Zhai, 2009; Zamani et al., 2016; Zhai and Lafferty, 2001; Rocchio, 1971; Cao et al., 2008). The motivation of PRF is to assume a certain number of top-ranked documents from the initial retrieval run to be relevant and use these feedback documents to improve the original query representation. For conversation response ranking, many candidate responses are much shorter compared with conversation context, which could have negative impacts on deep neural matching models. Inspired by the key idea of PRF, we propose using the candidate response as a query to run a retrieval round on a large external collection. Then we extract useful information from the (pseudo) relevant feedback documents to enrich the original candidate response representation.
(2) Incorporating external knowledge via QA correspondence knowledge distillation. Previous neural ranking models enhanced the performance of retrieval models such as BM25 and QL, which mainly rely on lexical match information, via modeling semantic match patterns in text (Guo et al., 2016; Huang et al., 2013; Mitra et al., 2017). For response ranking in information-seeking conversations, the match patterns between candidate responses and conversation context can be quite different from the well studied lexical and semantic matching. Consider the following sample utterance and response from the conversations in the Microsoft Answers community 222https://answers.microsoft.com/ shown in Table 1. A Windows user proposed a question about the windows update failure on “restart install”. An expert replied with a response pointing to a potential cause “Norton leftovers”. The match signals between the problem “restart install” and the cause “Norton leftovers” may not be captured by simple lexical and semantic matching. To derive such match patterns, we need to rely on external knowledge to distill QA correspondence information. We propose to extract the “correspondence” regularities between question and answer terms from retrieved external QA pairs. We define this type of match patterns as a “correspondence match”, which will be incorporated into deep matching networks as external knowledge to help response selection in information-seeking conversations.
|QA Dialog Title: : Windows Update Failure|
|Dialog Tags: Windows, Windows 10, Windows update, recovery, backup, PC|
|USER: I have Windows10, version 1511, OS Build 10586.1106. For the past year I have tried to upgrade from this without success. Upgrade download OK but on installing only get to 85 - 93% and then on restart install previous version of windows (the 1511 version), I have Windows update assistant installed. Any help or advice on this would be most welcome.|
AGENT: James (Microsoft MVP - Windows Client) :
Response:There’s not a doubt in my mind that those Norton “leftovers” is your troublemaker here - but now that the Norton Removal Tool has been deprecated and especially since the new-fangled Norton Remove and Reinstall tool doesn’t get rid of the leftovers, a manual upgrade or a clean install of Microsoft Win10 appears to be your only possible resolution here. Feel free to give Norton/Symantec a piece of your mind!
|Term Match: Magenta Semantic Match: Blue Correspondence Match: Red|
We conduct extensive experiments with three information-seeking conversation data sets: the MSDialog data which contains crawled customer service dialogs from Microsoft Answers community , a popular benchmark data Ubuntu Dialog Corpus (UDC) (Lowe et al., 2015), and another commercial customer service data AliMe from Alibaba group. We compare our methods with various deep text matching models and the state-of-the-art baseline on response selection in multi-turn conversations. Our methods outperform all baseline methods regrading a variety of metrics.
To sum up, our contributions can be summarized as follows:
(1) Focusing on information-seeking conversations and building a new benchmark data set. We target information-seeking conversations to push the boundaries of conversational search models. To this end, we create a new information-seeking conversation data set MSDialog on technical support dialogs of Microsoft products and released it to the research community 333The MSDialog dataset can be downloaded from https://ciir.cs.umass.edu/downloads/msdialog. We also released our source code at https://github.com/yangliuy/NeuralResponseRanking ..
(2) Integrating external knowledge into deep neural matching networks for response ranking. We propose a new response ranking paradigm for multi-turn conversations by incorporating external knowledge into the matching process of dialog context and candidate responses. Under this paradigm, we design two different methods with pseudo relevance feedback and QA correspondence knowledge distillation to integrate external knowledge into deep neural matching networks for response ranking.
(3) Extensive experimental evaluation on benchmark / commercial data sets and promising results. Experimental results with three different information-seeking conversation data sets show that our methods outperform various baseline methods including the state-of-the-art method on response selection in multi-turn conversations. We also perform analysis over different response types, model variations and ranking examples to provide insights.
Our work is related to research on conversational search, neural conversational models and neural ranking models.
Conversational Search. Conversational search has received significant attention with the emerging of conversational devices in the recent years. Radlinski and Craswell described the basic features of conversational search systems (Radlinski and Craswell, 2017). Thomas et al. (Thomas et al., 2017) released the Microsoft Information-Seeking Conversation (MISC) data set, which contains information-seeking conversations with a human intermediary, in a setup designed to mimic software agents such as Siri or Cortana. But this data is quite small (in terms of the number of dialogs) for the training of neural models. Based on state-of-the-art advances on machine reading, Kenter and de Rijke (Kenter and de Rijke, 2017) adopted a conversational search approach to question answering. Except for conversational search models, researchers have also studied the medium of conversational search. Arguello et al. (Arguello et al., 2017) studied how the medium (e.g., voice interaction) affect user requests in conversational search. Spina et al. studied the ways of presenting search results over speech-only channels to support conversational search (Spina et al., 2017; Trippas et al., 2015). Yang et al. (Yang et al., 2017) investigated predicting the new question that the user will ask given the past conversational context. Our research targets at the response ranking of information-seeking conversations, with deep matching networks and integration of external knowledge.
Neural Conversational Models.2015; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017; Li et al., 2016a, b; Sordoni et al., 2015; Bordes et al., 2017). Existing work includes retrieval-based methods (Wu et al., 2017; Zhou et al., 2016; Yan et al., 2016a, 2017; Ji et al., 2014; Lowe et al., 2015) and generation-based methods (Shang et al., 2015; Tian et al., 2017; Ritter et al., 2011; Sordoni et al., 2015; Vinyals and Le, 2015; Li et al., 2016b; Bordes et al., 2017; Dhingra et al., 2017; Qiu et al., 2017). Sordoni et al. (Sordoni et al., 2015)
proposed a neural network architecture for response generation that is both context-sensitive and data-driven utilizing the Recurrent Neural Network Language Model architecture. Our work is a retrieval-based method. There are some research on multi-turn conversations with retrieval-based method. Wu et al.(Wu et al., 2017) proposed a sequential matching network that matches a response with each utterance in the context on multiple levels of granularity to distill important matching information. The main difference between our work with their research is that we consider external knowledge beyond dialog context for multi-turn response selection. We show that incorporating external knowledge with pseudo-relevance feedback and QA correspondence knowledge distillation is important and effective for response selection.
Neural Ranking Models.
Recently a number of neural ranking models have been proposed for information retrieval, question answering and conversation response ranking. These models could be classified into three categories(Guo et al., 2016)
. The first category is the representation focused models. These models will firstly learn the representations of queries and documents separately and then calculate the similarity score of the learned representations with functions such as cosine, dot, bilinear or tensor layers. A typical example is the DSSM(Huang et al., 2013)
model, which is a feed forward neural network with a word hashing phase as the first layer to predict the click probability given a query string and a document title. The second category is the interaction focused models, which build a query-document term pairwise interaction matrix to capture the exact matching and semantic matching information between the query-document pairs. Then the interaction matrix will be fed into deep neural networks which could be CNN(Hu et al., 2014; Pang et al., 2016; Yu et al., 2018), term gating network with histogram or value shared weighting mechanism (Guo et al., 2016; Yang et al., 2016) to generate the final ranking score. In the end, the neural ranking models in the third category combine the ideas of the representation focused models and interaction focused models to joint learn the lexical matching and semantic matching between queries and documents (Mitra et al., 2017; Yu et al., 2018). The deep matching networks used in our research belong to the interaction focused models due to their better performances on a variety of text matching tasks compared with representation focused models (Hu et al., 2014; Pang et al., 2016; Guo et al., 2016; Yang et al., 2016; Wu et al., 2017; Xiong et al., 2017). We study different ways to build the interaction matching matrices to capture the matching patterns in term spaces, sequence structures and external knowledge signals between dialog context utterances and response candidates.
The research problem of response ranking in information-seeking conversations is defined as follows. We are given an information-seeking conversation data set , where in which is the dialog context and is the input utterance in the -th turn. and are a set of response candidates and the corresponding binary labels , where denotes is a true response for . Otherwise . In order to integrate external knowledge, we are also given an external collection , which is related to the topics discussed in conversation . Our task is to learn a ranking model with and . For any given , the model should be able to generate a ranking list for the candidate responses with . The external collection could be any massive text corpus. In our paper, are historical QA posts in Stack Overflow data dump 444https://stackoverflow.com/ for MSDialog, AskUbuntu data dump 555https://askubuntu.com/ for Ubuntu Dialog Corpus and product QA pairs for AliMe data.
|The conversation data set used for training/validation/testing|
|The collection for the retrieval and distillation of external knowledge|
|The -th utterance of the -th dialog, all utterances of the -th dialog and the set of all dialog utterances|
|The -th response candidate for the -th dialog, all response candidates of the -th dialog and the set of all candidate responses|
|The -th expanded response candidate for the -th dialog|
|The label for the -th response candidate for the -th dialog and the set of all labels|
|The ranking model learnt with and|
|The predicted matching score between and|
|The total number of dialogs in|
|The total number of response candidates for|
|The number of expanded words in response candidates|
|The language model constructed from the pseudo relevance feedback document set for response candidate expansion|
|The number of top ranked QA posts retrieved from and the top ranked QA post set|
|The length of a response candidate and the length of an utterance|
The number of dimensions of word embedding vectors
Interaction matrices between dialog utterance and candidate response or
for word embedding similarity, sequence hidden representation similarity and QA correspondence matching similarity
|The -th element in the interaction matrix|
|The window size for the utterances in dialog context, which is the maximal number of previous utterances modeled|
In the following sections, we describe the proposed learning framework built on the top of deep matching networks and external knowledge for response ranking in information-seeking conversations. A summary of key notations in this work is presented in Table 2. In general, there are three modules in our learning framework:
(1) Information retrieval (IR) module: Given the information seeking conversation data and external QA text collection , this module is to retrieve a small relevant set of QA pairs from with the response candidate as the queries. These retrieved QA pairs become the source of external knowledge.
(2) External knowledge extraction (KE) module: Given the retrieved QA pairs from the IR module, this module will extract useful information as term distributions, term co-occurrence matrices or other forms as external knowledge.
(3) Deep matching network (DMN) module: This is the module to model the extracted external knowledge from , dialog utterances and the response candidate to learn the matching pattern, over which it will accumulate and predict a matching score for and .
We explore two different implementations under this learning framework as follows: 1) Incorporating external knowledge into deep matching networks via pseudo-relevance feedback (DMN-PRF). The architecture of DMN-PRF model is presented in Figure 1. 2) Incorporating external knowledge via QA correspondence knowledge distillation (DMN-KD). The architecture of DMN-KD model is presented in Figure 2. We will present the details of these two models in Section 3.3 and Section 3.4.
We adopt different QA text collections for different conversation data (e.g. Stack Overflow data for MSDialog, AskUbuntu for UDC). The statistics of these external collections are shown in Table 3. We download the data dumps for Stack Overflow and AskUbuntu from archive.org666https://archive.org/download/stackexchange. We index the QA posts in Stack Overflow in most recent two years and all the QA posts in AskUbuntu. Then we use the response candidate as the query to retrieve top 777In our experiments, we set . QA posts with BM25 as the source for external knowledge.
|#Time||2 Years||7 years|
The motivation of Pseudo-Relevance Feedback (PRF) is to extract terms from the top-ranked documents in the first retrieval results to help discriminate relevant documents from irrelevant ones (Cao et al., 2008). The expansion terms are extracted either according to the term distributions (e.g. extract the most frequent terms) or extracted from the most specific terms (e.g. extract terms with the maximal IDF weights) in feedback documents. Given the retrieved top QA posts from the previous step, we compute a language model using . Then we extract the most frequent 888In our experiments, we set . terms from as expansion terms for response candidate and append them at the end of . For the query , we perform several preprocessing steps including tokenization, punctuation removal and stop words removal. QA posts in both Stack Overflow and AskUbuntu have two fields: “Body” and “Title”. We choose to search the “Body” field since we found it more effective in experiments.
The expanded response candidates and dialog contexts will be modeled by a deep neural matching network. Given an expanded response and an utterance in the context , the model firstly looks up a global embedding dictionary to represent and as two sequences of embedding vectors and , where , are the embedding vectors of the -th word in and respectively. Given these two word embedding sequences, there are two different methods to learn matching patterns: representation focused methods and interaction focused methods (Guo et al., 2016). Here we adopt the interaction focused methods due to their better performances over a number of text matching tasks (Hu et al., 2014; Pang et al., 2016; Wan et al., 2016; Yang et al., 2016). Specifically, the model builds two interaction matrices with and : a word pairwise similarity matrix and a sequence hidden representation similarity matrix . and
will be two input channels of a convolutional neural network (CNN) to learn important matching features, which will be aggregated by the final BiGRU layer and a multi-layer perceptron (MLP) to generate a matching score.
Specifically, in the input channel one, , the element in the is defined by . models the word pairwise similarity between and via the dot product similarity between the embedding representations.
For input channel two, we firstly employ bidirectional gated recurrent units (BiGRU)(Chung et al., 2014) to encode and into two hidden representations. A BiGRU consists two GRUs that run in opposite directions on sequence : a forward GRUs processing the sequence as it is ordered, and another backward GRUs processing the sequence in its reverse order. These two GRUs will generate two sequences of hidden states and . BiGRU then concatenates the forward and the backward hidden states to form the final hidden vectors for as . More specifically, , the hidden state vector is calculated by the following formulas:
where and are an update gate and a reset gate respectively. are the input and hidden state output of the network at time step . and
are parameter matrices and bias vectors to be learned. The backward hidden stateis computed in a similar way according to Equation 1. The hidden vectors for the dialog utterance can be obtained in the same procedure. Given the hidden vectors of and , we calculate element in the sequence hidden representation similarity matrix by . BiGRU models the neighbor context information around words from two directions and encode the text sequences into hidden vectors. Thus matches and with local sequence structures such as phrases or text segments.
The interaction matrices and
are then fed into a CNN to learn high level matching patterns as features. CNN alternates convolution and max-pooling operations over these input channels. Letdenote the output feature map of the l-th layer and k-th kernel, the model will do convolution operations and max-pooling operations according to the following equations.
Convolution. Let denote the shape of the k-th convolution kernel in the -th layer, the convolution operation can be defined as:
whereand are the parameters of the -th kernel on the -th layer to be learned. is the number of kernels on the -th layer.
Max Pooling. Let denote the shape of the k-th pooling kernel in the -th layer, the max pooling operation can be defined as:
Given the output feature representation vectors learned by CNN for utterance-response pairs , we add another BiGRU layer to model the dependency and temporal relationship of utterances in the conversation according to Equation 1 following the previous work (Wu et al., 2017). The output hidden states will be concatenated as a vector and fed into a multi-layer perceptron (MLP) to calculate the final matching score as
where are model parameters. and are tanh and softmax functions respectively.
For model training, we consider a pairwise ranking learning setting. The training data consists of triples where and denote the positive and the negative response candidate for dialog utterances . Let
denote all the parameters of our model. The pairwise ranking-based hinge loss function is defined as:
where is the total number of triples in the training data . is the regularization term where denotes the regularization coefficient. denotes the margin in the hinge loss. The parameters of the deep matching network are optimized using back-propagation with Adam algorithm (Kingma and Ba, 2014).
In addition to the DMN-PRF model presented in Section 3.3, we also propose another model for incorporating external knowledge into conversation response ranking via QA correspondence knowledge distillation, which is referred to as DMN-KD model in this paper. The architecture of DMN-KD model is presented in Figure 2. Compared with DMN-PRF, the main difference is that the CNN of DMN-KD will run on an additional input channel denoted as blue matrices in Figure 2, which captures the correspondence matching patterns of utterance terms and response terms in relevant external QA pairs retrieved from . Specifically, we firstly use the response candidate as the query to retrieve a set of relevant QA pairs999Note that we want QA pairs here instead of question posts or answer posts, since we would like to extract QA term co-occurrence information with these QA pairs. . Suppose , where denotes the -th QA pair. Given a response candidate and a dialog utterance in dialog , the model will compute the term co-occurrence information as the Positive Pointwise Mutual Information (PPMI) of words of and in retrieved QA pair set . Let and denote the word sequence in and . We construct a QA term correspondence matching matrix as the third input channel of CNN for and with the PPMI statistics from . More specifically, , the element in is computed as
where and denote the -th word in the response candidate and -th word in the dialog utterance. The intuition is that the PPMI between and in the top retrieved relevant QA pair set could encode the correspondence matching patterns between and in external relevant QA pairs . Thus is the extracted QA correspondence knowledge from the external collection for and . These correspondence matching knowledge capture relationships such as “(Problem Descriptions, Solutions)”, “(Symptoms, Causes)”, “(Information Request, Answers)”, etc. in the top ranked relevant QA pair set . They will help the model better discriminate a good response candidate from a bad response candidate given the dialog context utterances. To compute the co-occurrence count between and , we count all word co-occurrences considering and as bag-of-words as we found this setting is more effective in experiments.
We evaluated our method with three data sets: Ubuntu Dialog Corpus (UDC), MSDialog, and AliMe data consisting of a set of customer service conversations in Chinese from Alibaba.
|# C-R pairs||1000k||500k||500k||173k||37k||35k||51k||6k||6k|
|# Cand. per C||2||10||10||10||10||10||15||15||15|
|# + Cand. per C||1||1||1||1||1||1||2.9||2.8||2.9|
|Min # turns per C||1||2||1||2||2||2||2||2||2|
|Max # turns per C||19||19||19||11||11||11||3||3||3|
|Avg # turns per C||10.1||10.1||10.1||5.0||4.9||4.4||2.4||2.1||2.2|
|Avg # words per C||116||116||116||271||263||227||38||35||34|
|Avg # words per R||22.1||22.1||22.1||66.7||67.6||66.8||4.9||4.7||4.6|
The Ubuntu Dialog Corpus (UDC) (Lowe et al., 2015) contains multi-turn technical support conversation data collected from the chat logs of the Freenode Internet Relay Chat (IRC) network. We used the data copy shared by Xu et al.(Xu et al., 2016), in which numbers, urls and paths are replaced by special placeholders. It is also used in several previous related works (Wu et al., 2017)101010The data can be downloaded from https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntu%20data.zip?dl=0. It consists of million context-response pairs for training, million pairs for validation and million pairs for testing. The statistics of this data is shown in Table 4. The positive response candidates in this data come form the true responses by human and negative response candidates are randomly sampled.
In addition to UDC, we also crawled another technical support conversation data from the Microsoft Answer community, which is a QA forum on topics about a variety of Microsoft products. We firstly crawled dialogs about different categories of Microsoft products including “Windows”, “IE”, “Office”, “Skype”, “Surface”, “Xbox”, etc. 111111Note that some categories are more fine-grained, such as“Outlook_Calendar”, “Outlook_Contacts”, “Outlook_Email”, “Outlook_Messaging”, etc. Then we filtered dialogs whose number of turns are out of the range . After that we split the data into training/validation/testing partitions by time. Specifically, the training data contains dialogs from “2005-11-12” to “2017-08-20”. The validation data contains dialogs from “2017-08-21” to “2017-09-20”. The testing data contains dialogs from “2017-09-21” to “2017-10-04”.
The next step is to generate the dialog context and response candidates. For each dialog, we assigned “User” label to the first participant who proposed the question leading to this information-seeking conversation, and “Agent” label to the other participants who provided responses. The “Agent” in our data could be Microsoft customer service staff, a Microsoft MVP (Most Valuable Professional) or a user from the Microsoft Answer community. Then for each utterance by the “User” 121212We consider the utterances by the user except the first utterance, since there is no associated dialog context with it. , we collected the previous utterances as the dialog context, where and is the total number of utterances before . The true response by the “Agent” becomes the positive response candidate. For the negative response candidates, we adopted negative sampling to construct them following previous work (Wan et al., 2016; Lowe et al., 2015; Wu et al., 2017). For each dialog context, we firstly used the true response as the query to retrieve the top results from the whole response set of agents with BM25. Then we randomly sampled responses from them to construct the negative response candidates. The statistics of MSDialog data is presented in Table 4. For data preprocessing, we performed tokenization and punctuation removal. Then we removed stop words and performed word stemming. For neural models, we also removed words that appear less than times in the whole corpus.
We collected the chat logs between customers and a chatbot AliMe from “2017-10-01” to “2017-10-20” in Alibaba. The chatbot is built based on a question-to-question matching system 131313 Interested readers can access AliMe Assist through the Taobao App, or the web version via https://consumerservice.taobao.com/online-help (Li et al., 2017), where for each query, it finds the most similar candidate question in a QA database and return its answer as the reply. It indexes all the questions in our QA database using Lucence141414https://lucene.apache.org/core/. For each given query, it uses TF-IDF ranking algorithm to call back candidates. To form our data set, we concatenated utterances within three turns 151515The majority (around ) of conversations in the data set are within 3 turns. to form a query, and used the chatbot system to call back top-K 161616We set K=15. most similar candidate questions as candidate “responses”. 171717A “response” here is a question in our system. We then asked a business analyst to annotate the candidate responses, where a “response” is labeled as positive if it matches the query, otherwise negative. In all, we have annotated 63,000 context-response pairs, where we use 51,000 as training, 6,000 for testing, and 6,000 for validation shown in Table 4. Note that we have included human evaluation in AliMe data. Furthermore, if the confidence score of answering a given user query is low, the system will prompt three top related questions for users to choose. We collected such user click logs as our external data, where we treat the clicked question as positive and the others as negative. We collected 510,000 clicked questions with answers from the click logs in total as the source of external knowledge.
measured by the Student’s paired t-test.
We consider different types of baselines for comparison, including traditional retrieval models, deep text matching models and the state-of-the-art multi-turn conversation response ranking method as the following:
BM25. This method uses the dialog context as the query to retrieve response candidates for response selection. We consider BM25 model (Robertson and Walker, 1994) as the retrieval model.
ARC-II. ARC-II is an interaction focused deep text matching architectures proposed by Hu et al. (Hu et al., 2014), which is built directly on the interaction matrix between the dialog context and response candidates. A CNN is running on the interaction matrix to learn the matching representation score.
MV-LSTM. MV-LSTM (Wan et al., 2016) is a neural text matching model that matches two sequences with multiple positional representations learned by a Bi-LSTM layer.
DRMM. DRMM (Guo et al., 2016) is a deep relevance matching model for ad-hoc retrieval. We implemented a variant of DRMM for short text matching. Specifically, the matching histogram is replaced by a top-k max pooling layer and the remaining part is the same with the original model.
Duet. Duet (Mitra et al., 2017) is the state-of-the-art deep text matching model that jointly learns local lexical matching and global semantic matching between the two text sequences.
SMN. Sequential Matching Network (SMN) (Wu et al., 2017)
is the state-of-the-art deep neural architecture for multi-turn conversation response selection. It matches a response candidate with each utterance in the context on multiple levels of granularity and then adopts a CNN network to distill matching features. We used the TensorFlow181818https://www.tensorflow.org/ implementation of SMN shared by authors (Wu et al., 2017) 191919The reported SMN results with the code from authors are on the raw data sets of UDC and MSDialog without any over sampling of negative training data..
We also consider a degenerated version of our model, denoted as DMN, where we do not incorporate external knowledge via pseudo-relevance feedback or QA correspondence knowledge distillation. Finally, we consider a baseline BM25-PRF, where we incorporate external knowledge into BM25 by matching conversation context with the expanded responses as in Section 3.3.2 using BM25 model.
For the evaluation metrics, we adopted mean average precision (MAP), Recall@1, Recall@2, and Recall@5 following previous related works(Wu et al., 2017; Lowe et al., 2015). For UDC and MSDialog, MAP is equivalent to the mean reciprocal rank (MRR) since there is only one positive response candidate per dialog context. For AliMe data, each dialog context could have more than one positive response candidates.
All models were implemented with TensorFlow and MatchZoo202020https://github.com/faneshion/MatchZoo toolkit. Hyper-parameters are tuned with the validation data. For the hyper-parameter settings of DMN-KD and DMN-PRF models, we set the window size of the convolution and pooling kernels as . The number of convolution kernels is for UDC and for MSDialog. The dimension of the hidden states of BiGRU layer is set as for UDC and for MSDialog . The dropout rate is set as for UDC and
for MSDialog . All models are trained on a single Nvidia Titan X GPU by stochastic gradient descent with Adam(Kingma and Ba, 2014) algorithm. The initial learning rate is . The parameters of Adam, and are and respectively. The batch size is for UDC and for MSDialog. The maximum utterance length is for UDC and for MSDialog. The maximum conversation context length is set as following previous work (Wu et al., 2017)
. We padded zeros if the number of utterances in a context is less than. Otherwise the most recent utterances will be kept. For DMN-PRF, we retrieved top QA posts and extracted terms as response expansion terms. For DMN-KD, we retrieved top question posts with accepted answers. For the word embeddings used in our experiments, we trained word embeddings with the Word2Vec tool (Mikolov et al., 2013) with the Skip-gram model using our training data. The max skip length between words and the number of negative examples is set as and respectively. The dimension of word vectors is . Word embeddings will be initialized by these pre-trained word vectors and updated during the training process.
We present evaluation results over different methods on UDC and MSDialog in Table 5. We summarize our observations as follows: (1) DMN-PRF model outperforms all the baseline methods including traditional retrieval models, deep text matching models and the state-of-the-art SMN model for response ranking on both conversation datasets. The results demonstrate that candidate response expansion with pseudo-relevance feedback could improve the ranking performance of responses in conversations. The main difference between DMN-PRF model and SMN model is the information extracted from retrieved feedback QA posts as external knowledge. This indicates the importance of modeling external knowledge with pseudo-relevant feedback beyond the dialog context for response selection. (2) DMN-KD model also outperforms all the baseline methods on MSDialog and UDC. These results show that the extracted QA correspondence matching knowledge could help the model select better responses. Comparing DMN-KD and DMN-PRF, their performances are very close. (3) If we compare the performances of DMN-PRF, DMN-KD with the degenerated model DMN, we can see that incorporating external knowledge via both pseudo-relevance feedback and QA correspondence knowledge distillation could improve the performance of the deep neural networks for response ranking with large margins. For example, the improvement of DMN-PRF against DMN on UDC is for MAP, for Recall@5, for Recall@1, for Recall@2 respectively. The differences are statistically significant with measured by the Student’s paired t-test.
We further compare our models with the competing methods on AliMe data in Table 5. We find that: (1) our DMN model has comparable results in terms of MAP when compared with SMN, but has better Recall; (2) DMN-KD shows comparable or better results than all the baseline methods; (3) DMN-PRF significantly outperforms other competing baselines which shows the effectiveness of adding external pseudo-relevance feedback to the task; (4) both DMN-PRF and DMN-KD show better results than DMN, which demonstrates the importance of incorporating external knowledge via both pseudo-relevance feedback and QA correspondence knowledge distillation.
We conduct fine-grained analysis on the performance of different models on different response types. We annotated the user intents in MSDialog utterances using Amazon Mechanical Turk 212121https://www.mturk.com/. We defined
user intent types including several types related to “questions” (original question, follow-up question, information request, clarifying question, and etc.), “answers” ( potential answer and further details), “gratitude” (expressing thanks, greetings) and “feedback” (positive feedback and negative feedback). Then we trained a Random Forest classifier with TF-IDF features and applied this classifier to predict the response candidate types in the testing data of MSDialog. The dialog contexts were grouped by the type of the true response candidate. Finally we computed the average Recall@1 over different groups. Figure3 shows the results. We find that both DMN-KD and DMN-PRF improve the performances of SMN for responses with type “questions”, “answers” and “gratitude”. This indicates that incorporating external knowledge with PRF or QA correspondence knowledge distillation can help the model select better responses, especially for QA related responses. For responses with type “Feedback”, DMN-KD and DMN-PRF achieved similar performances comparing with SMN.
We investigate the effectiveness of different components of DMN-PRF and DMN-KD by removing them one by one from the original model with UDC and MSDialog data. We also study the effectiveness of different interaction types for . Table 6 shows the results. We summarize our observations as follows: 1) For the interaction matrices, we find that the performance will drop if we remove any one of for DMN-PRF or
for DMN-KD. This indicates that all of word level interaction matching, sequence level interaction matching and external QA correspondence interaction matching are useful for response selection in information-seeking conversation. 2) For interaction types, we can find that dot product is the best setting on both UDC and MSDialog except the results of DMN-KD on MSDialog. The next best one is cosine similarity. Bilinear product is the worst, especially on MSDialog data. This is because bilinear product will introduce a transformation matrixas an additional model parameter, leading to higher model complexity. Thus the model is more likely to overfit the training data, especially for the relatively small MSDialog data. 3) If we only leave one channel in the interaction matrices, we can find that is more powerful than for DMN-PRF. For DMN-KD, is also the best one, followed by . is the last one, but it stills adds additional matching signals when it is combined with and . The matching signals from external collection could be supplementary features to the word embedding based matching matrix and BiGRU representation based matching matrix .
We further analyze the impact of the conversation context length on the performances of our proposed DMN-KD and DMN-PRF models. As presented in Figure 4, we find the performance first increases and then decreases, with the increase of conversation context length. The reason for these trends is that the context length controls the available previous utterances in the dialog context modeled by DMN-KD and DMN-PRF. If the context length is too small, there would be not enough information for the model to learn the matching patterns between the context and response candidates. However, setting the context length too large will also bring noise into the model results, since the words in utterances a few turns ago could be very different due to the topic changes during conversations.
|Context||[User] I open Excel and it automatically formats my dates into American formatting. I have changed and saved the formatting to NZ style.|
|However everytime I pull the document out of office 365 it reverts back to the American format. How do I stop this ? [Agent] Is it one file or all files in Excel ? [User] It does seem to be all Excel files. How do I change the global date format setting ?|
|Method||Top-1 Ranked Response|
|SMN||0||Go to Settings ->System ->Tablet Mode….Change setting as indicated in the snapshot below.|
|DMN-KD||1||That is a Windows setting. Go to Control Panel >Regional settings. This will change date settings for all applications.|
|DMN-PRF||1||That is a Windows setting. Go to Control Panel >Regional settings. This will change date settings for all applications.|
We perform a case study in Table 7 on the top ranked responses by different methods including SMN, DMN-KD and DMN-PRF. In this example, both DMN-KD and DMN-PRF produced correct top ranked responses. We checked the retrieved QA posts by the correct response candidate and found that “settings, regional, change, windows, separator, format, excel, panel, application” are the most frequent terms. Among them “excel” is especially useful for promoting the rank of the correct response candidate, since this term which is included multiple times by the dialog context does not actually appear in the raw text of the correct response candidate. This gives an example of the effectiveness of incorporating external knowledge from the retrieved QA posts into response candidates.
In this paper, we propose a learning framework based on deep matching networks to leverage external knowledge for response ranking in information-seeking conversation systems. We incorporate external knowledge into deep neural models with pseudo-relevance feedback and QA correspondence knowledge distillation. Extensive experiments on both open benchmarks and commercial data show our methods outperform various baselines including the state-of-the-art methods. We also perform analysis on different response types and model variations to provide insights on model applications. For future work, we plan to model user intent in information-seeking conversations and learn meaningful patterns from user intent dynamics to help response selection. Incorporating both structured and unstructured knowledge into deep matching networks for response ranking is also interesting to explore.
This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF grant #IIS-1419693. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
A Comparative Study of Methods for Estimating Query Language Models with Pseudo Feedback. InCIKM ’09.
Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce.WSDM ’18.