The fast development of artificial intelligence has enabled many intelligent personal assistant systems, such as Amazon Alexa, Apple Siri, Alibaba AliMe, Microsoft Cortana, Google Now and Samsung Bixby.111For example, over 100M installations of Google Now (Google, http://bit.ly/1wTckVs); 100M sales of Amazon Alexa devices (TheVerge, https://bit.ly/2FbnzTN); more than 141M monthly users of Microsoft Cortana (Windowscentral, http://bit.ly/2Dv6TVT).
As a natural interface for human computer interaction, conversation systems have attracted the attention of researchers in the Information Retrieval (IR), Natural Language Processing (NLP) and Machine Learning (ML) communities, leading to a rapidly growing field referred to as Conversational AI(Gao et al., 2018).
Typical conversation systems are modularized systems with a natural language understanding module, a dialog state tracker, a dialog policy learning module, and a natural language generation module (Henderson, 2015). In recent years, fully data-driven end-to-end conversation models have been proposed to reduce hand-crafted features, rules or templates. These methods could be grouped into two different categories: generation-based approaches (Ritter et al., 2011; Shang et al., 2015; Sordoni et al., 2015; Vinyals and Le, 2015; Li et al., 2016b) and retrieval-based approaches (Ji et al., 2014; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017; Yang et al., 2018).
Given some conversation context, retrieval-based models try to find the most relevant context-response pairs in a pre-constructed conversational history repository. Some of these methods achieve this in two steps: 1) retrieve a candidate response set with basic retrieval models such as BM25 (Robertson and Walker, 1994) or QL (Ponte and Croft, 1998); and 2) re-rank the candidate response set with neural ranking models to find the best matching response (Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017; Wu et al., 2017; Yang et al., 2018). These methods can return natural human utterances in the conversational history repository, which is controllable and explainable. Retrieved responses often come with better diversity and richer information compared to generated responses (Song et al., 2018). However, the performance of retrieval-based methods is limited by the size of the conversational history repository, especially for long tail contexts that are not covered in the history. Retrieval-based models lack the flexibility of generation-based models, since the set of responses of a retrieval system is fixed once the historical context/response repository is constructed.
|Item||Retrieval-based methods||Generation-based methods|
|Main techniques||Retrieval models; Neural ranking models||Seq2Seq models|
|Diversity||Usually good if similar contexts have diverse responses in the repository||Easy to generate bland or universal responses|
|Response length||Can be very long||Usually short|
|Context property||Easy for similar context in the repository; Hard for unseen context||Easy to generalize to unseen context|
|Efficiency||Building index takes long time; Retrieval is fast||Training takes long time; Decoding is fast|
|Flexibility||Fixed response set once the repository is constructed||Can generate new responses not covered in history|
|Fluency||Natural human utterances||Sometimes bad or contain grammar errors|
|Bottleneck||Size and coverage of the repository||Specific responses; Long text; Sparse data|
|Informativeness||Easy to retrieve informative content||Hard to integrate external factual knowledge|
|Controllability||Easy to control and explain||Difficult to control the actual generated content|
On the other hand, the generation-based methods could generate highly coherent new responses given the conversation context. Much previous research along this line was based on the Seq2Seq model (Shang et al., 2015; Sordoni et al., 2015; Vinyals and Le, 2015)
, where there is an encoder to learn the representation of conversation context as a contextual vector, and a decoder to generate a response sequence conditioning on the contextual vector as well as the generated part of the sequence. The encoder/ decoder could be implemented by an RNN with long short term memory (LSTM)(Hochreiter and Schmidhuber, 1997)
hidden units or gated recurrent units (GRU)(Chung et al., 2014). Although generation-based models can generate new responses for a conversation context, a common problem with generation-based methods is that they are likely to generate very general or universal responses with insufficient information such as “I don’t know”, “I have no idea”, “Me too”, “Yes please”. The generated responses may also contain grammar errors. Ghazvininejad et al. (Ghazvininejad et al., 2018) proposed a knowledge-grounded neural conversation model in order to infuse the generated responses with more factual information relevant to the conversation context without slot filling. Although they show that the generated responses from the knowledge-grounded neural conversation model are more informative compared with responses from the vanilla Seq2Seq model, their model is still generation-based, and it is not clear how well this model will perform compared to retrieval-based methods. A comparison of retrieval-based methods and generation-based methods for end-to-end data driven conversation models is shown in Table 1. Clearly these two types of methods have their own advantages and disadvantages, it is thus necessary to integrate the merits of these two methods.
To this end, in this paper we study the integration of retrieval-based and generation-based conversation models in an unified framework. The closest prior research to our work is the study on the ensemble of retrieval-based and generation-based conversation models by Song et. al. (Song et al., 2018)
. Their proposed system uses a multi-seq2seq model to generate a response and then adopts a Gradient Boosting Decision Tree (GBDT) ranker to re-rank the generated responses and retrieved responses. However, their method still required heavy feature engineering to encode the context/ response candidate pairs in order to train the GBDT ranker. They constructed the training data by negative sampling, which may lead to sub-optimal performance, since the sampled negative response candidates could be easily discriminated from the positive response candidates by simple term-matching-based features.
We address these issues by proposing a hybrid neural conversational model with a generation module, a retrieval module and a hybrid ranking module. The generation module generates a response candidate given a conversation context, using a Seq2Seq model consisting of a conversation context encoder, a facts encoder and a response decoder. The retrieval module adopts a “context-context match” approach to recall a set of response candidates from the historical context/ response repository. The hybrid ranking module is built on the top of neural ranking models to select the best response candidate among retrieved/ generated response candidates. The integration of neural ranking models, which can learn representations and matching features for conversation context/ response candidate pairs, enables us to minimize feature engineering costs during model development. To construct the training data of the neural ranker for response selection, we propose a distant supervision approach to automatically infer labels for retrieved/ generated response candidates. We evaluate our proposed approach with experiments on Twitter and Foursquare (Ghazvininejad et al., 2018) data. Experimental results show that the proposed model can outperform both retrieval-based models and generation-based models (including a recently proposed knowledge-grounded neural conversation model (Ghazvininejad et al., 2018)) on both automatic evaluation and human evaluation.222Code will be released on Github upon paper acceptance.
In all, our contributions can be summarized as follows:
We perform a comparative study of retrieval-based models and generation-based models for the conversational response generation task.
We propose a hybrid neural conversational model to combine response generation and response retrieval with a neural ranking model to reduce feature engineering costs.
For model training, we propose a distant supervision approach to automatically infer labels for retrieved/ generated response candidates. We evaluate the effectiveness of different kinds of distant supervision signals and settings for the hybrid ranking of response candidates.
We run extensive experimental evaluation on retrieval-based, generation-based and hybrid models using the Twitter and Foursquare data. Experimental results show that the proposed hybrid neural conversation model can outperform both retrieval-based and generation-based models on both automatic evaluation and human evaluation. We also perform qualitative analysis on top responses selected by the neural re-ranker and response generation examples to provide insights.
Roadmap. The rest of our paper is organized as follows. We will review related work in Section 2. Then Section 3 will present the details of the generation module, retrieval module and hybrid ranking module in the proposed method. Section 4 contains the experiments and results analysis. We will conclude in Section 5.
2. Related Work
Our work is related to research on retrieval-based conversation models, generation-based conversation models and neural ranking models.
Retrieval-based Conversation Models. There have been several recent studies on retrieval based-conversation models (Wu et al., 2017; Zhou et al., 2016; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017; Ji et al., 2014; Lowe et al., 2015; Yang et al., 2017; Yang et al., 2018). Yan et al. (Yan et al., 2016a)
proposed a retrieval-based conversation system with the deep learning-to-respond schema by concatenating context utterances with the input message as reformulated queries. Wu et al.(Wu et al., 2017) proposed a sequential matching network that matches a response with each utterance in the context on multiple levels of granularity to distill important matching information. Yang et al. (Yang et al., 2018) considered external knowledge beyond dialog context through pseudo-relevance feedback and QA correspondence knowledge distillation for multi-turn response ranking. Although retrieval-based methods can return fluent responses with great diversity, these approaches lack the flexibility of generation based methods since the set of responses of a retrieval system is fixed once the historical context/ response repository is constructed in advance. Thus retrieval systems may fail to return any appropriate responses for those unseen conversation context inputs (Gao et al., 2018). In our work, we study the integration of retrieval-based methods and generation-based methods for conversation response generation to combine the merits of these two types of methods.
Generation-based Conversation Models.
There has also been a number of recent studies on conversation response generation with deep learning and reinforcement learning(Ritter et al., 2011; Shang et al., 2015; Sordoni et al., 2015; Vinyals and Le, 2015; Li et al., 2016b, a; Tian et al., 2017; Bordes et al., 2017; Dhingra et al., 2017; Qiu et al., 2017; Zhang et al., 2018a; Pandey et al., 2018; Wu et al., 2018; Zhang et al., 2018b). Early generation-based conversation models were inspired by statistical machine translation (SMT) (Ritter et al., 2011), which applied a phrase-based translation approach (Koehn et al., 2003) to conversation response generation. In order to utilize longer conversation context, Sordoni et al. (Sordoni et al., 2015) proposed an RNN based architecture that encodes a sequential context for response generation. Shang et al. (Shang et al., 2015) proposed the Neural Responding Machine (NRM), which is an RNN encoder-decoder framework for short text conversations and showed that it outperformed retrieved-based methods and SMT-based methods for a single round conversation. In order to mitigate the blandness problem of universal responses generated by Seq2Seq models, Li et al. (Li et al., 2015) proposed the Maximum Mutual Information (MMI) objective function for conversation response generation. The approach first generates N-best lists and rescores them with MMI during decoding process. Zhang et al. (Zhang et al., 2018b) proposed a model which introduces an additional variable modeled using a Gaussian kernel layer to control the level of specificity of the response. Some previous work augments the context encoder to not only represent the conversation history, but also some additional input from external knowledge. Ghazvininejad et al. (Ghazvininejad et al., 2018) proposed a knowledge-grounded neural conversation model which infuses factual content that is relevant to the conversation context. Our research shares a similar motivation with this work, but we do not adopt a pure generation-based approach. Instead, we look at a hybrid approach of retrieval-based models and generation-based models. Similar hybrid approaches are also used in some popular personal intelligent assistant systems including the “Core Chat” component of Microsoft XiaoIce (Zhou et al., 2018). Our proposed model distinguishes from prior work using the boosted tree ranker (Zhou et al., 2018; Song et al., 2018) by using a neural ranking model which holds the advantage of reducing feature engineering efforts for the conversation context/ response candidates pairs during the hybrid re-ranking process.
Neural Ranking Models. A number of neural ranking models have been proposed for information retrieval, question answering and conversation response ranking (Huang et al., 2013; Hu et al., 2014; Pang et al., 2016; Guo et al., 2016; Yang et al., 2016; Wu et al., 2017; Xiong et al., 2017; Mitra et al., 2017; Yu et al., 2018)
. These models could be classified into three categories(Guo et al., 2016). The first category is the representation-focused
models. These models learn the representations of queries and documents separately and then calculate the similarity score of the learned representations with functions such as cosine, dot, bilinear or tensor layers. A typical example is the DSSM(Huang et al., 2013)interaction-focused models, which build a query-document pairwise interaction matrix to capture the exact matching and semantic matching information between the query-document pairs. The interaction matrix is further fed into deep neural networks which could be a CNN (Hu et al., 2014; Pang et al., 2016; Yu et al., 2018), term gating network with histogram or value shared weighting mechanism (Guo et al., 2016; Yang et al., 2016) to generate the final ranking score. In the end, the neural ranking models in the third category combine the ideas of the representation-focused models and interaction-focused models to joint learn the lexical matching and semantic matching between queries and documents (Mitra et al., 2017; Yu et al., 2018). The neural ranking model used in our research belongs to the interaction-focused models due to their better performance on a variety of text matching and ranking tasks compared with representation-focused models (Hu et al., 2014; Pang et al., 2016; Guo et al., 2016; Yang et al., 2016; Wu et al., 2017; Xiong et al., 2017).
3. Our Approach
3.1. Problem Formulation
We define the task of conversational response generation following the previous literature (Ghazvininejad et al., 2018). We are given a conversation context , where is the -th context sequence which contains one or multiple utterances. There are also factual snippets of text that are relevant to the -th conversation context . Based on the conversation context and the set of external facts , the system outputs an appropriate response which provides useful information to users. Figure 1 shows an example of the conversational response generation task. Given an conversation context “Going to Din Tai Fung Dumpling House tonight!”, we can associate it with several contextually relevant facts from a much larger collection of external knowledge text (e.g. the Wikipedia dump, tips on Foursquare, product customer reviews on Amazon, etc.). A response that is both appropriate and informative in the given example could be “The shrimp and pork wontons with spicy sauce are amazing!”.
|The context of the -th conversation and the set of all conversation contexts|
|The -th factual text relevant to context , the factual texts relevant to context and the set of all factual texts|
|,||the -th retrieved response candidate to context , the set of all retrieved response candidates for context and the set of all retrieved response candidates|
|,||the -th generated response candidate to context , the set of all generated response candidates for context and the set of all generated response candidates|
|the -th response candidate and the union set of all the candidates for the -th context, i.e.,|
|The ground truth response candidate for the i-th context and the set of all ground truth response candidates|
|The neural ranking model learned in the hybrid ranking module|
|The predicted matching score between and|
3.2. Method Overview
In the following sections, we describe the proposed Hybrid Neural Conversation Model (HybridNCM) for response generation. Figure 2 shows the architecture of the hybrid neural conversation model. In general, there are three modules in our proposed model:
(1) Generation Module: Given the conversation context and the relevant facts , this module is to generate a set of response candidates using a Seq2Seq model which consists of a conversation context encoder, a facts encoder and a response decoder.
(2) Retrieval Module: This module adopts a “context-context match” approach to retrieve a few response candidates . The “context-context matching” approach matches the conversation context with all historical conversation context. It then returns the corresponding responses of the top ranked historical conversation context as a set of the retrieved response candidates .
(3) Hybrid Ranking Module: Given the generated and retrieved response candidates, i.e., , this module is used to re-rank all the response candidates with a hybrid neural ranker trained with labels from distant supervision to find the best response as the final system output.
3.3. Generation Module
We map a sequence of words to a sequence of embeddings by looking up the indices in an embedding matrix, e.g., where is the length of a word sequence .
3.3.1. Context Encoder
Inspired by previous works on response generation with Seq2Seq models (Vinyals and Le, 2015; Shang et al., 2015; Ghazvininejad et al., 2018), we adopt a Seq2Seq architecture with attention mechanism (Bahdanau et al., 2014; Luong et al., 2015) in the hybrid neural conversation model. In the Seq2Seq architecture, a context encoder is used to transform a sequence of context vectors into contextual hidden vectors in Eq. (1).
where is the hidden state at time step
. In our implementation, we stack two layers of LSTM networks as the recurrent neural network. With the context encoder, we can summarize the conversation context by the last hidden vectorand maintain the detailed information at each time step by each hidden state .
3.3.2. Facts Encoder
For the facts encoder, we use the same architecture of the stacked LSTM as the context encoder in Section 3.3.1
to generate the hidden representations of relevant facts. Note that for each conversation context, there are sequences of facts . We encode these facts into sequences of hidden vectors by the stacked LSTM, where and . We summarize a fact into a fixed-size vector by averaging its hidden vectors, i.e., .
3.3.3. Response Decoder
The response decoder is trained to predict the next word given the representations of conversation context , facts , and all the previously generated words as follows:
For the decoder, we stack two layers of LSTM networks with the attention mechanism proposed in (Luong et al., 2015). More specifically, we concatenate the hidden vectors of a context and all factual vectors into a matrix in Eq. (3). We then compute the attention weight by the dot product between the decoder’s previous hidden state and all vectors in , followed by a softmax function in Eq. (4). The attention context summarizes the conversation context and facts by the weighted sum of in Eq. (5). For the input to the decoder’s RNN network, we concatenate the attention context and the previous hidden state that summarizes the partial generated response , and apply a tanh function afterwards in Eq. (6). The initial hidden vector of the decoder is initialized by the last hidden state of the context encoder and the average factual vectors in Eq. (8). is a linear function that maps a vector from the encoder’s hidden space to the decoder’s hidden space. The conditional probability at the -th time step can be computed by a linear function , which is a fully connected layer, that maps the decoder’s hidden state to a distributional vector over the vocabulary, and a softmax function in Eq. (9).
where is the hidden state of the decoder RNN at time step .
3.3.4. Train and Decode
Given the ground-truth response to a conversation context with facts , the training objective is to minimize the negative log-likelihood over all the training data in Eq. (10).
During prediction, we use beam search to generate response candidates and perform length normalization by dividing the output log-likelihood score with the length of generated sequences to add penalty on short generated sequences.
3.4. Retrieval Module
The retrieval module retrieves a set of response candidates from the historical conversation context-response repository. It adopts a “context-context match” approach to retrieve a few response candidates. We first index all context/ response pairs in the training data with Lucene.333http://lucene.apache.org/ Then for each conversation context , we match it with the “conversation context” text field in the index with BM25. We return the “response” text field of top ranked context/ response pairs as the retrieved response candidates.444We set in our experiments. We would like to keep the retrieval module simple and efficient. The re-ranking process of response candidates will be performed in the hybrid ranking module as presented in Section 3.5.
3.5. Hybrid Ranking Module
3.5.1. Interaction Matching Matrix
We combine a set of generated response candidates and a set of retrieved response candidates as the set of all response candidates . The hybrid ranking module re-ranks all candidates in to find the best one as the final system output. In our implementation, contains one generated response and contains retrieved responses. We adopt a neural ranking model following the previous work (Pang et al., 2016; Yang et al., 2018). Specifically, for each conversation context and response candidate , we first build an interaction matching matrix. Given and , the model looks up a global embedding dictionary to represent and as two sequences of embedding vectors and , where , are the embedding vectors of the -th word in the word sequences and respectively. The model then builds an interaction matrix , which computes the pairwise similarity between words in and
via the dot product similarity between the embedding representations. The interaction matching matrix is used as the input of a convolutional neural network (CNN) to learn important matching features, which are aggregated by the final multi-layer perceptron (MLP) to generate a matching score.
3.5.2. CNN Layers and MLP
The interaction matrices are fed into a CNN to learn high level matching patterns as features. CNN alternates convolution and max-pooling operations over these inputs. Letdenote the output feature map of the -th layer and -th kernel, the model performs convolution operations and max-pooling operations respectively in Eq. (11) and (12).
Convolution. Let denote the shape of the -th convolution kernel in the -th layer, the convolution operation can be defined as:
whereand are the parameters of the -th kernel on the -th layer to be learned. is the number of kernels on the -th layer.
Max Pooling. Let denote the shape of the -th pooling kernel in the -th layer, the max pooling operation can be defined as:
Finally we feed the output feature representation vectors learned by CNN into a multi-layer perceptron (MLP) to calculate the final matching score .
3.5.3. Distant Supervision for Model Training
For model training, we consider a pairwise ranking learning setting. The training data consists of triples , where and denote the positive and the negative response candidate for dialog context . A challenging problem here is that there is no ground truth ranking for all the candidate responses in given a conversation context . The costs for annotating all context/ response candidates pairs for model training would be very high. Thus, we generate training data to train the hybrid ranking module with distant supervision inspired by previous work on relation extraction (Mintz et al., 2009). Specifically we construct by mixing retrieved response candidates and one generated response candidate . We then score these response candidates with metrics like BLEU/ ROUGE-L by comparing them with the ground truth responses in the training data. Finally we treat the top response candidates 555We set in our experiments. ranked by BLEU/ ROUGE-L as positive candidates and other responses as negative candidates. In this way, the training labels of response candidates can be inferred from distant supervision from the ground truth responses in the training data.666Note that we do not have to do such inference during model testing, since we just need to use the trained ranking model to score response candidates instead of computing training loss during model testing. We perform experiments to evaluate the effectiveness of different kinds of distant supervision signals. In practice, there could be multiple appropriate and diverse responses for a given conversation context. Ideally, we need multiple reference responses for each conversation context, each for a different and relevant response. We leave generating multiple references for a conversation context for distant supervision to the future work. We have to point out that it is difficult to collect the data where each context is paired with comprehensive reference responses. Our proposed method can also be easily adapted to the scenario where we have multiple reference responses for a conversation context. Given inferred training labels, we can compute the pairwise ranking-based hinge loss, which is defined as:
where is the total number of triples in the training data. is the regularization term where denotes the regularization coefficient. denotes the margin in the hinge loss.
|# Context-response pairs||1,059,370||2,067||2,066|
|Avg # facts per context||40.70||38.68||38.68|
|Avg # words per facts||17.58||17.42||17.47|
|Avg # words per context||16.66||17.85||17.66|
|Avg # words per response||11.65||15.58||15.89|
4.1. Data Set Description
We used the same grounded Twitter conversation data set from the study by Ghazvininejad et. al. (Ghazvininejad et al., 2018). The data contains 1 million two-turn Twitter conversations. Foursquare tips777https://foursquare.com/ are used as the fact data, which is relevant to the conversation context in the Twitter data. The Twitter conversations contain entities that tie to Foursquare. Then the conversation data is associated with the fact data by identifying Twitter conversation pairs in which the first turn contained either a handle of the entity name or a hashtag that matched a handle appears in the Foursquare tip data. The validation and test sets (around 4K conversations) are created to contain responses that are informative and useful, in order to evaluate conversation systems on their ability to produce contentful responses. The statistics of data are shown in Table 3.
4.2. Experimental Setup
4.2.1. Competing Methods.
We consider different types of methods for comparison including retrieval-based, generation-based and hybrid retrieval-generation methods as follows:888We did not compare with (Song et al., 2018) since the code of both the state-of-the-practice IR system (Yan et al., 2016b) and the multi-seq2seq model, which are the two main components of the proposed ensemble model in (Song et al., 2018), is not available. The experimental data used in (Song et al., 2018) is also not available.
Seq2Seq. This is the standard Seq2Seq model with a conversation context encoder and a response decoder, which is the method proposed in (Vinyals and Le, 2015).
Seq2Seq-Facts. This is the Seq2Seq model with an additional facts encoder, which is the generation module in the proposed hybrid neural conversational model.
KNCM-MTask-R. KNCM-MTask-R is the best setting of the knowledge-grounded neural conversation model proposed in the research by Ghazvininejad et al. (Ghazvininejad et al., 2018) with multi-task learning. This system is trained with 23 million general Twitter conversation data to learn the conversation structure or backbone and 1 million grounded conversation data with associated facts from Foursquare tips. Since we used the same 1 million grounded Twitter conversation data set from this work, our experimental results are directly comparable with response generation results reported by Ghazvininejad et al. (Ghazvininejad et al., 2018).
Retrieval. This method uses BM25 model (Robertson and Walker, 1994) to match the conversation context with conversation context/ response pairs in the historical conversation repository to find the best pair, which is the retrieval module in the proposed hybrid neural conversational model.
HybridNCM. This is the method proposed in this paper. It contains two different variations: 1) HybridNCM-RS is a hybrid method by mixing generated response candidates from Seq2Seq and retrieved response candidates from the retrieval module in HybridNCM; 2) HybridNCM-RSF is a hybrid method by mixing generated response candidates from Seq2Seq-Facts and retrieved response candidates from the retrieval module in HybridNCM.
4.2.2. Evaluation Methodology.
Following previous related work (Sordoni et al., 2015; Li et al., 2016a; Ghazvininejad et al., 2018), we use BLEU and ROUGE-L for the automatic evaluation of the generated responses. The corpus-level BLEU is known to better correlate with human judgments including conversation response generation (Galley et al., 2015) comparing with sentence-level BLEU. We also report lexical diversity as an automatic measure of informativeness and diversity. The lexical diversity metrics include Distinct-1 and Distinct-2, which are respectively the number of distinct unigrams and bigrams divided by the total number of generated words in the responses. In additional to automatic evaluation, we also perform human evaluation (Section 4.3.2) of the generated responses of different systems on the appropriateness and informativeness following previous work (Ghazvininejad et al., 2018).
|# LSTM layers in encoder||2||2|
|# LSTM layers in decoder||2||2|
|LSTM hidden state size||512||256|
|Learning rate decay||0.5||0.5|
|# Steps between validation||10000||5000|
|Patience of early stopping||10||10|
4.2.3. Parameter Settings
All models are implemented with PyTorch999https://pytorch.org/ and MatchZoo101010https://github.com/NTMC-Community/MatchZoo toolkit. Hyper-parameters are tuned with the validation data. The hyper-parameter settings in the generation-based baselines and the generation module in the proposed hybrid neural conversation model is shown in Table 4. For the hyper-parameter settings in the hybrid ranking module, we set the window size of the convolution and pooling kernels as . The number of convolution kernels is . The dropout rate is set to . The margin in the pairwise-ranking hinge loss is . The distant supervision signals and the number of positive samples per context in the hybrid ranking module are tuned with validation data. The used distant supervision signal is BLEU-1 and we treat top
response candidates ranked by BLEU-1 as positive samples. All models are trained on a single Nvidia Titan X GPU by stochastic gradient descent with Adam(Kingma and Ba, 2014) algorithm. The initial learning rate is . The parameters of Adam, and are and respectively. The batch size is . The maximum conversation context/ response length is . Word embeddings in the neural ranking model will be initialized by the pre-trained GloVe111111https://nlp.stanford.edu/projects/glove/ word vectors and updated during the training process.
4.3. Evaluation Results
4.3.1. Automatic Evaluation
We present evaluation results over different methods on Twitter/ Foursquare data in Table 5. We summarize our observations as follows: (1) If we compare retrieval-based methods and HybridNCM with pure generation based methods such as Seq2Seq, Seq2Seq-Facts and KNCM-MTask-R, we find that retrieval-based methods and HybridNCM with a retrieval module achieve better performance in terms of BLEU and ROUGE-L. This verifies the competitive performance of retrieval-based methods for conversation response generation reported in previous related works (Song et al., 2018). (2) Both HybridNCM-RS and HybridHCM-RSF outperforms all the baselines including KNCM-MTask-R with multi-task learning proposed recently by Ghazvininejad et al. (Ghazvininejad et al., 2018) under BLEU and ROUGE-L. The results demonstrate that combining both retrieved and generated response candidates could help produce better responses in conversation systems. For the two variations of HybridNCM, HybridNCM-RSF achieves better BLEU and worse ROUGE-L. Overall the performances of these two variations of HybridNCM are similar to each other. One possible reason is that, the main gain over baselines comes from the retrieval module and the re-ranking process in hybrid ranking module. So the differences in the generation module do not change the results too much. (3) For lexical diversity metrics like 1-gram/ 2-gram diversity, generation-based methods are far behind retrieval-based methods and HybridNCM, even for KNCM-MTask-R with external grounded knowledge and multi-task learning. This result shows that retrieved response candidates have much better diversity comparing with generated response candidates by Seq2Seq models. Researchers have studied Maximum Mutual Information (MMI) object functions (Li et al., 2015) in neural models in order to generate more diverse responses. It would be interesting to compare MMI models with IR models for conversation response generation. We leave this study to our future work.
measured by the Student’s paired t-test. Note that we can only do significance test for ROUGE-L since the other metrics are corpus-level metrics. The results of KNCM-MTask-R are directly cited from Ghazvininejad et al.(Ghazvininejad et al., 2018) since we used the same 1 million grounded Twitter conversation data set from this work. Thus we don’t have the ROUGE-L result for this baseline method.
|HNCM-RS v.s. Seq2Seq||0.71/0.15/0.14||0.84/0.10/0.06|
|HNCM-RSF v.s. Seq2Seq||0.68/0.16/0.16||0.82/0.11/0.07|
|HNCM-RS v.s. Seq2Seq-F||0.70/0.15/0.15||0.80/0.12/0.08|
|HNCM-RSF v.s. Seq2Seq-F||0.65/0.19/0.17||0.77/0.15/0.09|
|HNCM-RS v.s. Retrieval||0.43/0.31/0.26||0.50/0.31/0.18|
|HNCM-RSF v.s. Retrieval||0.41/0.30/0.29||0.50/0.28/0.22|
4.3.2. Human Evaluation
Automatic evaluation of response generation is still a challenging problem. To complement the automatic evaluation results, we also perform human evaluation to compare the performance of different methods following previous related works (Shang et al., 2015; Ghazvininejad et al., 2018; Song et al., 2018). We ask three educated annotators to do the human evaluation. We randomly sample conversation contexts from the test data, and instruct the annotators to rate the output responses of different systems.121212We mainly performed human evaluation on our methods and three baselines Seq2Seq, Seq2Seq-Facts and Retrieval. We didn’t include KNCM-MTask-R into human evaluation since there is no open source code or official implementation from (Ghazvininejad et al., 2018). The results of KNCM-MTask-R in Table 5 are cited numbers from (Ghazvininejad et al., 2018) since we used the same experimental data sets. We hide the system ids and randomly permute the output responses to rule out human bias. In the annotation guidelines, we ask the annotators to evaluate the quality of output responses by different systems from the following 2 dimensions:
Appropriateness: evaluate whether the output response is appropriate and relevant to the given conversation context.
Informativeness: evaluate whether the output response can provide useful and factual information for the users.
Three different labels “0” (bad), “+1” (neutral), “+2” (good) are used to evaluate the quality of system output responses. Table 6 shows the comparison of different models with human evaluation. The table contains the mean score, ratio of three different categories of labels and the agreement scores among three annotators. The agreement score is evaluated by Fleiss’ kappa (Fleiss et al., 1971) which is a statistical measure of inter-rater consistency. Most agreement scores are in the range from 0.2 to 0.5, which can be interpreted as “fair agreement” or “moderate agreement”. 131313https://en.wikipedia.org/wiki/Fleiss%27_kappa The annotators have relative higher agreement scores for the informativeness of generation-based methods like Seq2Seq and Seq2Seq-Facts, since these methods are likely to generate short responses or even responses containing fluency and grammatical problems.
We summarize our observations on the human evaluation results in Table 6 as follows: (1) For the mean scores, we can see both HybridNCM-RS and HybridNCM-RSF achieve higher average rating scores compared with all baselines, in terms of both appropriateness and informativeness. These results from human evaluation verify that hybrid models could help improve the response generation performances of conversation systems. For baselines, the retrieval-based baseline is stronger than generation-based baselines. For HybridNCM-RS and HybridNCM-RSF, HybridNCM-RS achieves relatively higher average human rating scores with a small gap. (2) For the ratios of different categories of labels, we can see more than of output responses by HybridNCM-RS ( for HybridNCM-RSF) are labeled as “good (+2)” or “neutral (+1)” for appropriateness, which means that most output responses of hybrid models are semantically relevant to the conversation contexts. Generation-based methods like Seq2Seq and Seq2Seq-Facts perform worse than both the retrieval-based method and hybrid models. The retrieval-based method, although quite simple, achieves much higher ratios for the categories “good (+2)” and “neutral (+1)” compared with generation-based methods. For informativeness, the hybrid models HybridNCM-RS and HybridNCM-RSF are still the best, beating both generation-based baselines and retrieval-based baselines. These results show that the re-ranking process in the hybrid ranking module trained with distant supervision in hybrid conversation models can further increase the informativeness of results by promoting response candidates with more factual content. (3) For the statistical significance test, both HybridNCM-RS and HybridNCM-RSF outperform all baseline methods with measured by the Student’s paired t-test in terms of human evaluation scores. We also show the side-by-side human evaluation results in Table 7. The results clearly confirm that performances of hybrid models are better than or comparable to the performances of all baselines for most test conversation contexts.
4.4. Analysis of Top Responses Selected by Re-ranker
The number and percentage of top responses selected from retrieved/ generated response candidates by the neural ranking model are shown in Table 8. We summarize our observation as follows: (1) most picked results ( for HybridNCM-RS and for HybridNCM-RSF) are from the retrieved response candidates. This is reasonable because we have multiple retrieved response candidates but only one generated response candidate. In some cases, generated responses are preferred to retrieved responses. (2) Although the percentage of generated responses is not high, this does not mean we can just directly use the results returned by the retrieval method. If we look at the row “PickedTop1BM25”, we can find that only very few responses ranked as the 1st by BM25 are ranked as the 1st again by HybridNCM. Thus, HybridNCM changed the order of these responses candidates significantly. In particular, the hybrid ranking module in HybridNCM did the following two tasks: a) re-evaluate and re-rank the previous generated/ retrieved responses to promote the good response; b) try to inject some generated responses by Seq2Seq models into retrieved results if possible. (3) We notice that response candidates generated by Seq2Seq-Facts model are more likely to be picked compared to those generated by Seq2Seq. When a generated response contains rich factual content, the hybrid ranking module is more likely to pick it, which also helps boost the BLEU metrics.
4.5. Impact of Distant Supervision Signals
We investigate the impact of different distant supervision signals on the response generation performance in Table 10
. We find that distant supervision signals like BLEU-1 are quite effective for training the hybrid ranking module. The sentence-level BLEU is not a good choice for the distant supervision signal. The reason is that the sentence-level BLEU is computed only based on the n-gram precision statistics for a given sentence pair. This score has a larger variance compared with the corpus-level BLEU. Since sentence-level BLEU scores would become very small smoothed values if there are no 4-gram or trigram matches between two sentences, which may happen frequently in short text pairs.
4.6. Impact of Ratios of Positive Samples
We further analyze the impact of the ratios of positive/ negative training samples on the response generation performance. Table 9 shows the results. The value of is the number of positive response candidates for each conversation context when we train the hybrid ranking module. When , we select one positive candidate from the ground truth responses in the training data, which is equivalent to the negative sampling technique. As increases, we construct the positive candidates by selecting one positive sample from the ground truth responses and positive samples from the top ranked candidates by distant supervision. We find that larger can improve the response generation performance. This is reasonable since larger means the model can observe more positive training samples and positive/ negative response pairs in the pairwise ranking loss minimization process. However, increasing the value of also adds risks of introducing noisy positive training data. Thus, there is a trade-off for choices of values of in the practice of training with distant supervision.
|Context||Donated to the [ENTITY] last night and now I have to listen to automated phone calls. It’s enough to make me want to cancel.|
|Method||r/g||System Output Response|
|Ground Truth||-||Ask them to put you on their internal dnc list. They will likely respect this, because future calls can get them charged.|
|Seq2Seq||g||I didn’t get it. I didn’t.|
|Seq2Seq-Facts||g||I’m sorry to hear that. Please dm us your email address so we can look into this. Thanks!|
|Retrieval||r||It’s a known issue in LA county. I just got an email from my dm and tech and it should be good in 30 minutes or so.|
|HNCM-RS||r||We’re listening and would like to know more and help with your experience. Please follow us so i can dm you our contact info. [ENTITY]|
|HNCM-RSF||g||We’re sorry to hear this. Please dm us if you need assistance. Please dm us your contact info so we can look into this.|
4.7. Examples and Case Study
We perform a case study in Table 11 on the outputs by different methods. In this example, we can find that the response produced by Seq2Seq is very general and it does not provide any useful information for the user. Seq2Seq-Facts generates a much better response by injecting more factual content into response generation process. The response returned by the Retrieval method is also relevant to the context. However, it provides very specific information like “LA county”, “30 minutes”, which may have negative impact on the appropriateness of this response for some users. The responses produced by hybrid models achieve a good balance between specificity and generalization. The response by HybridNCM-RS is from retrieved results and the response by HybridNCM-RSF is from generated results, which shows that both retrieval-based methods and generation-based methods have the capacity to produce good responses for certain contexts.
5. Conclusions and Future Work
In this paper, we perform a comparative study of retrieval-based and generation-based methods for building conversation systems. We propose a hybrid neural conversation model with the capability of both response retrieval and generation in order to combine the merits of these two types of methods. For the training of the hybrid ranking module, we propose a distant supervision approach to automatically infer labels for retrieved/ generated response candidates. Experimental results with Twitter/ Foursquare data show that the proposed model can outperform both retrieval-based and generation-based methods including a recently proposed knowledge-grounded neural conversation model under both automatic evaluation and human evaluation. Our research findings provide insights on how to integrate text retrieval and text generation models for building conversation systems. For the future work, we would like to study reinforcement learning methods for response selection in order to directly optimize metrics like BLEU/ ROUGE. User intent modeling for response ranking in conversation systems is another interesting direction to explore.
This work was primarily done during Liu Yang’s internship at Microsoft Research Redmond. This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF IIS-1715095. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. The authors would like to thank Kaixi Zhang for the contribution on data annotation and proofreading on this work.
- Bahdanau et al. (2014) D. Bahdanau, K. Cho, and Y. Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014).
- Bordes et al. (2017) A. Bordes, Y. Boureau, and J. Weston. 2017. Learning end-to-end goal-oriented Dialog. ICLR ’17.
- Chung et al. (2014) J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR (2014).
- Dhingra et al. (2017) B. Dhingra, L. Li, X. Li, J. Gao, Y. Chen, F. Ahmed, and L. Deng. 2017. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. In ACL’17.
- Fleiss et al. (1971) J.L. Fleiss et al. 1971. Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin 76, 5 (1971), 378–382.
- Galley et al. (2015) M. Galley, C. Brockett, A. Sordoni, Y. Ji, M. Auli, C. Quirk, M. Mitchell, J. Gao, and B. Dolan. 2015. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets. CoRR abs/1506.06863 (2015).
- Gao et al. (2018) J. Gao, M. Galley, and L. Li. 2018. Neural Approaches to Conversational AI. CoRR abs/1809.08267 (2018).
- Ghazvininejad et al. (2018) M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley. 2018. A Knowledge-Grounded Neural Conversation Model. In AAAI ’18.
- Guo et al. (2016) J. Guo, Y. Fan, Q. Ai, and W. B. Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM ’16.
- Henderson (2015) M. Henderson. 2015. Machine Learning for Dialog State Tracking : a Review.
- Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997).
- Hu et al. (2014) B. Hu, Z. Lu, H. Li, and Q. Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In NIPS ’14.
- Huang et al. (2013) P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. P. Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM ’13.
- Ji et al. (2014) Z. Ji, Z. Lu, and H. Li. 2014. An Information Retrieval Approach to Short Text Conversation. CoRR abs/1408.6988 (2014).
- Kingma and Ba (2014) D. P. Kingma and J. Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR (2014).
- Koehn et al. (2003) P. Koehn, F. Och, and D. Marcu. 2003. Statistical Phrase-based Translation. In NAACL ’03.
- Li et al. (2015) J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2015. A Diversity-Promoting Objective Function for Neural Conversation Models. CoRR abs/1510.03055 (2015).
- Li et al. (2016a) J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and W. B. Dolan. 2016a. A Persona-Based Neural Conversation Model. In ACL’16.
- Li et al. (2016b) J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao. 2016b. Deep Reinforcement Learning for Dialogue Generation. In EMNLP’16.
- Lowe et al. (2015) R. Lowe, N. Pow, I. Serban, and J. Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. CoRR abs/1506.08909 (2015).
et al. (2015)
T. Luong, H. Pham, and
C. D. Manning. 2015.
Effective Approaches to Attention-based Neural Machine Translation. InEMNLP ’15.
- Mintz et al. (2009) M. Mintz, S. Bills, R. Snow, and D. Jurafsky. 2009. Distant Supervision for Relation Extraction without Labeled Data. In ACL ’09.
et al. (2017)
B. Mitra, F. Diaz, and
N. Craswell. 2017.
Learning to Match Using Local and Distributed Representations of Text for Web Search. InWWW ’17.
- Pandey et al. (2018) G. Pandey, D. Contractor, V. Kumar, and S. Joshi. 2018. Exemplar Encoder-Decoder for Neural Conversation Generation. In ACL ’18.
- Pang et al. (2016) L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. Text Matching as Image Recognition. In AAAI ’16.
- Ponte and Croft (1998) Jay M. Ponte and W. B. Croft. 1998. A Language Modeling Approach to Information Retrieval. In SIGIR ’98.
- Qiu et al. (2017) M. Qiu, F. Li, S. Wang, X. Gao, Y. Chen, W. Zhao, H. Chen, J. Huang, and W. Chu. 2017. AliMe Chat: A Sequence to Sequence and Rerank based Chatbot Engine. In ACL ’17.
- Ritter et al. (2011) A. Ritter, C. Cherry, and W. B. Dolan. 2011. Data-Driven Response Generation in Social Media. In ACL ’11.
- Robertson and Walker (1994) S. Robertson and S. Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In SIGIR ’94.
- Shang et al. (2015) L. Shang, Z. Lu, and H. Li. 2015. Neural Responding Machine for Short-Text Conversation. In ACL ’15.
- Song et al. (2018) Y. Song, C. Li, J. Nie, M. Zhang, D. Zhao, and R. Yan. 2018. An Ensemble of Retrieval-Based and Generation-Based Human-Computer Conversation Systems. In IJCAI ’18.
- Sordoni et al. (2015) A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan. 2015. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In NAACL ’15.
- Tian et al. (2017) Z. Tian, R. Yan, L. Mou, Y. Song, Y. Feng, and D. Zhao. 2017. How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In ACL ’17.
- Vinyals and Le (2015) O. Vinyals and Q. V. Le. 2015. A Neural Conversational Model. CoRR abs/1506.05869 (2015).
- Wu et al. (2018) Y. Wu, F. Wei, S. Huang, Z. Li, and M. Zhou. 2018. Response Generation by Context-aware Prototype Editing. CoRR (2018).
- Wu et al. (2017) Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In ACL ’17.
- Xiong et al. (2017) C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR ’17.
- Yan et al. (2016a) R. Yan, Y. Song, and H. Wu. 2016a. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR ’16.
- Yan et al. (2016b) R. Yan, Y. Song, X. Zhou, and H. Wu. 2016b. ”Shall I Be Your Chat Companion?”: Towards an Online Human-Computer Conversation System. In CIKM ’16.
- Yan et al. (2017) R. Yan, D. Zhao, and W. E. 2017. Joint Learning of Response Ranking and Next Utterance Suggestion in Human-Computer Conversation System. In SIGIR ’17.
- Yang et al. (2016) L. Yang, Q. Ai, J. Guo, and W. B. Croft. 2016. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model. In CIKM ’16.
- Yang et al. (2018) L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. Croft, J. Huang, and H. Chen. 2018. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In SIGIR ’18.
- Yang et al. (2017) L. Yang, H. Zamani, Y. Zhang, J. Guo, and W. B. Croft. 2017. Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation. CoRR (2017).
et al. (2018)
J. Yu, M. Qiu,
J. Jiang, J. Huang, S.
Song, W. Chu, and H. Chen.
Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce.WSDM ’18.
- Zhang et al. (2018b) R. Zhang, J. Guo, Y. Fan, Y. Lan, J. Xu, and X. Cheng. 2018b. Learning to Control the Specificity in Neural Response Generation. In ACL ’18.
- Zhang et al. (2018a) Y. Zhang, M. Galley, J. Gao, Z. Gan, X. Li, C. Brockett, and B. Dolan. 2018a. Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization. CoRR (2018).
- Zhou et al. (2018) L. Zhou, J. Gao, D. Li, and H. Shum. 2018. The Design and Implementation of XiaoIce, an Empathetic Social Chatbot. CoRR (2018).
- Zhou et al. (2016) X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu, and R. Yan. 2016. Multi-view Response Selection for Human-Computer Conversation. In EMNLP ’16.