Improving Retrieval Modeling Using Cross Convolution Networks And Multi Frequency Word Embedding

02/15/2018 ∙ by Guozhen An, et al. ∙ CUNY Law School Columbia University Oath Inc. 0

To build a satisfying chatbot that has the ability of managing a goal-oriented multi-turn dialogue, accurate modeling of human conversation is crucial. In this paper we concentrate on the task of response selection for multi-turn human-computer conversation with a given context. Previous approaches show weakness in capturing information of rare keywords that appear in either or both context and correct response, and struggle with long input sequences. We propose Cross Convolution Network (CCN) and Multi Frequency word embedding to address both problems. We train several models using the Ubuntu Dialogue dataset which is the largest freely available multi-turn based dialogue corpus. We further build an ensemble model by averaging predictions of multiple models. We achieve a new state-of-the-art on this dataset with considerable improvements compared to previous best results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the primary objectives in Artificial Intelligence (AI) is the task of building a conversational agent that can naturally and coherently communicate with humans. The solution can significantly change the interaction between clients and customers, and has appealing applications to service providers in many different areas. This task can be simplified to define different problems whose solutions move us toward accurate understanding and modeling of human conversation. The two mainstream models can be distinguished as follows: a

Generative model that tries to generate responses in multi-turn conversations Sordoni et al. (2015); Wen et al. (2015b, a); Shang et al. (2015), and a Retrieval model which retrieves potential responses from the massive repository and selects the best one as the output Yan et al. (2016); Ji et al. (2014). While the first model is a more flexible and powerful model, it is considerably harder to implement. According to the current state of AI, we are far away from a generative model for long and multi-domain conversations.

Until recently, proposed solutions for building dialogue systems required significant hand-engineering of features. This limits the number of responses and situations in which the system can be deployed. More recently, researchers tempt to apply machine and deep learning methods to create a model that can learn the essential information in conversational data. One vital aspect of human conversation is the contextual and semantic relevance among sentences. The sequence modeling approaches are shown to be effective to capture these information. More specifically, Recurrent Neural Networks (RNN) built by Long Short-Term Memory (LSTM) units 

Hochreiter and Schmidhuber (1997) have been effectively utilized for extracting contextual and semantic information in other language related problems such as speech recognition, state tracking, image captioning, etc. Sutskever et al. (2014); Cho et al. (2014); Henderson et al. (2013); Graves et al. (2013); Xian and Tian (2017).

In this paper, we consider the problem of next response ranking for multi-turn human-computer conversation with a given context. The model provides candidates’ ranking and selects the one with the highest rank as the next utterance. This problem is an important and challenging task for the retrieval-based dialogue model.

Previous RNN based approaches to response selection take the context and the response candidate as two separate word sequences and feed them to the RNN in order to obtain two embedding vectors. The response is then selected based on the similarity of the candidate embedding with the context embedding

Lowe et al. (2015); Kadlec et al. (2015); Baudiš et al. (2016); Xu et al. (2016). There are mainly two shortcomings of previous solutions. The first shortcoming corresponds to the method of representing words in the context and candidates. More specifically, in order to efficiently represent words of either the context or the response and feed them to the RNN, we use word vector embeddings. A word embedding is a vector which represents word’s semantic and synthetic features. Intuitively, we map words to -dimensional vectors, where vectors that are relatively close represent words with similar or related meaning Mikolov et al. (2013). However, in order to have sufficient semantic and syntactic information for a word, and also to make this method computationally efficient, we require the word to appear in the corpus for at least a certain number of times. Hence, rare words, which are usually technical words and carry important information, are missed by such word embedding method. The second shortcoming relates to the performance of RNN. In fact, LSTM units are vulnerable to losing information when the input sequence is long (which is the case in multi-turn response selection. See Section 2

). Furthermore, in the RNN based models, the inputs to the RNN are the sequence of word embeddings of the entire context (or response), and the output is a single vector which represent the contextual and semantic dependency and relevance of the words in the entire context (or response) and does not carry word level information. To address the first problem, we utilize two word embedding layers: one for frequent words and one for rare words. To address the second issue, we extract the similarity between individual words in the context and the response. We note that when an utterance shares a rare word with some context, it is more probable that it is the correct response for the context. Therefore, we design a layer to extract the information of rareness of shared words in the context and the response.

In this paper, we propose a model that integrates sequence and word level information. We train our model on the Ubuntu Dialogue dataset, which consists of roughly one million two-way conversations extracted from the Ubuntu chat logs. Moreover, this data set is considered to be unstructured dialogues where there is no a priori logical representation for the information exchanged during the conversation Lowe et al. (2017) which is a desirable property to test a retrieval model.

We summarize our contributions as follows:

  • We design Cross Convolution Network (CCN) that gets two inputs (matrix representations of two sentences, feature matrices of two images, etc.) and extracts similarity of the inputs.

  • We propose Multi Frequency Word Embedding that efficiently capture both frequent and rare words of the corpus.

  • Our experimental results show a considerable improvement over previous results on Ubuntu Dialogue Corpus Lowe et al. (2015).

The remainder of this paper is structured as follows. In Section 2, we review previous work. Description of the dataset can be found in Section 3. In Section 4, we present our methods to capture sequence level and word level information in detail. Section 5 focuses on the experimental setup while our results are presented in Section 6. Finally, We conclude and discuss future research directions in Section 7.

2 Related Work

The problem of next response selection in multi-turn conversation is more general than a traditional question answering (QA) problem Yih et al. (2015); Yu et al. (2014). The prediction is made based on the entire conversation context which does not necessarily include a question. In single turn response selection, the model ignores the entire context and only leverages the last utterance to select response Lu and Li (2013); Ji et al. (2014); Wang et al. (2015). Since an utterance can change the topic or negate/affirm the previous utterances, it is of paramount importance that models for response selection in multi-turn conversation have a certain understanding of the entire context. Moreover, next response selection system is a supervised dialogue system since it incorporates explicit signals specifying whether the provided response is correct or not Lowe et al. (2016)

. This system is of interest because it admits a natural evaluation metric, namely the recall and precision measures (See Section 

3 for a detailed explanation.). We consider Ubuntu Dialogue Corpus Lowe et al. (2015) to evaluate our retrieval-based model since the dataset is the most relevant public dataset to supervised dialogue systems Lowe et al. (2016).

The original paper that introduced the Ubuntu Dialogue dataset have implemented a TF-IDF model in addition to neural network models with vanilla RNN and LSTM Lowe et al. (2015). Later, Kadlec et al. (2015)

evaluated the performances of various LSTMs, Bi-LSTMs and CNNs (Convolutional Neural Networks

Kalchbrenner et al. (2014)) on the dataset and created an ensemble by averaging predictions of multiple models. An RNN-CNN model combined with attention vectors is implemented by Baudiš et al. (2016). Further, Multi-view Response Selection Zhou et al. (2016) proposed an RNN-CNN model which integrates information from both word sequence view and utterance sequence view. A deep learning model incorporating background knowledge to enhance the sequence semantic modeling ability of LSTM is implemented in Xu et al. (2016) that achieved the state-of-the-art result.

In spite of these efforts, the study of Lowe et al. (2016) found that the automated dialogue systems built using machine and deep learning methods perform worse than human experts in Ubuntu system. This confirms that further investigation in retrieval dialogue system using this dataset is worthwhile and motivates us to conduct this research.

3 Data

The Ubuntu Dialogue Corpus Lowe et al. (2015) is the largest freely available multi-turn based dialogue corpus which consists of almost one million two-way conversations extracted from the Ubuntu chat logs. We use the second version of this dataset in this paper. The dataset was preprocessed as follows:

  • Named entities were replaced with corresponding tags (name, location, organization, url, path).

  • Two special symbols; namely, and are used to denote the end of utterances and turns, respectively.

  • The training set consists of tuples of the form , where the indicates whether the provided is the correct response for the or not. For any instance of the form , the set includes an instance of the form where the is a randomly sampled utterance from the entire data to create balanced dataset.

  • The test set is created by of the whole dataset which is approximately 20k instances. Each test instance consists of a followed by 10 candidate responses with the first candidate being the correct one. The other responses are drawn randomly from the entire corpus Lowe et al. (2015). Furthermore, a validation set of the same size and structure is provided.

The system is required to rank the candidate responses and output the highest ranking. We note that some of the sampled candidates which are labeled as incorrect can be relevant to the context, and hence, considered as correct. Hence, we may examine the system’s ranking as correct if the correct response is among the first candidates. This quantity is denoted by . Most of the previous papers have used the pairs of to be , , and to report their models’ performance Lowe et al. (2015); Zhou et al. (2016); Kadlec et al. (2015).

4 Method

In this section, we provide details on the networks and layers we used to build our models. We start by introducing the Cross Convolution Network that captures semantic similarity of the context and response. We then elaborate on Multi Frequency Word Embedding. It is followed by explanation of LSTM network and Common Words Frequency layer.

4.1 Cross Convolution Network

In many instances, a handful number of words reveal the purpose of conversation; therefore, one may expect to see the exact same words in the context or their derivations in the correct response. Our experiments show that RNN models fall short of capturing all these information especially when the input sequence is long. Motivated by this, we design a Cross Convolution Network that intrinsically can be deployed in any classifying problem of a pair of objects. We would like to mention that CCN is different with the architecture proposed in

Wan et al. (2016)

which utilizes Bi-LSTM and requires to learn parameters of an interaction tensor to capture semantic matching of two sentences.

Figure 1: Cross Convolution Network (CCN). and are the inputs to the network. CCN computes convolution of over and passes the output result to a

-Max Pooling to extract the first k largest output of each column. A dense layer is then applied to calculate a number that measures similarity of

and .

At a high level, a Cross Convolution Network accepts two matrices, and , and computes convolution of over . -Max Pooling is then applied to the output matrix in order to take the largest element of each of its column. The output is then fed to a Dense layer (vanilla RNN) that measures similarity of and

. As in Convolutional Neural Network, we need to specify the window and stride sizes for computing the convolution of inputs. Figure 

1 shows structure of Cross Convolution Network.

For the task of response selection, we include the following layers to extract the word level information in the context and in the corresponding response:

Dot Product Layer Given the sequence of embedded word vectors of a context and a response, for each word in the response, we calculate its inner product with every word in the context. In other words, we calculate convolution of the context with each of the response words (as Convolutional filter) while window and stride sizes are equal to one.

-Max Pooling and Dense Layer Given the output of the Dot Product Layer, we pick the first maximum values for each filter

. We then use a dense layer (DNN with some activation function) to calculate the probability of the corresponding label of the instance to be one.

In matrix representation of the context and response, in which the -th column of the context (response) matrix is the embedding vector representing the -th word in the context (response), we formulate the layer operation as


where is the response matrix, is the context matrix, and is the dot product output. is -Max Pooling function which picks the first maximum values of each of the column of matrix . Moreover, and are trainable weight vector and bias of the dense layer, respectively. can be any activation function. is a hyper-parameter for the model, and

is the maximum number of words in the contexts and responses (the smaller contexts and responses are padded using zero vectors.).

4.2 Multi Frequency Word Embedding

To have high quality representation of words that capture syntactic and semantic word relationships, we use two types of word embedding layers in our models. As noted in Lowe et al. (2017), failure of understanding semantic similarity of context and response is the largest source of error from Dual-LSTM model (see Section 4.3). We observed that our dual LSTM model performed worse when rare words appeared in either or both context and response. One potential explanation is that when training the word embeddings, rare words are removed for the purpose of computational efficiency. However, this weakens the word embeddings due to the loss of information that occurs by ignoring rare words. In order to capture these rare word relation, we utilize multiple word embedding layers instead of one, we attempt to use two word embedding layers, which we refer to as low frequency, and high frequency layers.

Given the word sequences of context and response, words are mapped into -dimensional embedding vectors. While is a hyper-parameter and needs to be specified, the word embeddings can be initialized with random vectors or with pre-trained word vectors. We use two independent word embedding layers inside a single model. First, we count total appearance of each word in train set of context and response, then filter frequent words and rare words during training from each context and response. The high frequency word embedding layer is the same as the word embedding of Lowe et al. (2015), which captures word relation of frequent words and feed to LSTM in future stage to get internal representation of context and response. The low frequency word embedding layer is trained using only rare words from train set of context and response. We denote the high frequency word filter, low frequency word filter, and embedding layer by , and , respectively. Therefore,


where () is the corresponding word embeddings for high (low) frequency word embedding layer that is train on high (low) frequency words in the entire training set.

4.3 Lstm

4.3.1 Context and Response Embedding

Long Short-Term Memory (LSTM) is well known for capturing information of long sequences. Inspired by Lowe et al. (2015), we use two LSTM networks with shared weights to produce the final representation of context and response by feeding word embeddings one at a time to the respective LSTM. Word embeddings are initialized using the pre-trained -dimensional GloVe Word Vector Model Pennington et al. (2014), and updated during training phase. We use one hidden layer for each LSTM with output size . We denote the LSTM layer by . Therefore,


where and are the final hidden state of LSTM layer. We refer to this model as Dual-LSTM. This is the baseline model proposed in Lowe et al. (2015) where the response is then selected based on the similarity of and which is measured by the inner product of the two embeddings.

Finally, we dense the hidden state of LSTM layer, and calculate the probability of the response to be the correct one. More precisely, we compute the following:


where is a trainable weight matrix.

4.3.2 Common Words Embedding

Another issue raised by Lowe et al. (2017) is that direct word copying between the context and true response was not captured by Dual-LSTM model. In order to overcome this issue, we extract common word list from context and response, and feed common word embeddings to the LSTM network. We use the same word embedding layer as the other inputs of the Dual-LSTM model.


Finally, we dense the hidden state of LSTM layer, and calculate the score of the corresponding response to be correct.


where is a trainable weight vector.

In the case of having both embedding layers, we concatenate the scores of common word embedding layer and Dual-LSTM layer and using sigmoid function to calculate the probability of the response to be the correct one:


where and are trainable weights.

4.4 Context-Response Common Words Frequency

This is a well-known fact that words which are more frequent (such as the, is, and that) contain less information compared to more rare words (such as technical words in Ubuntu Corpus). We observe that when an utterance shares a rare word with some context, it is more probable that it is the correct response for the context. To capture this, we first create a table that stores word occurrence count. We use this table to calculate a variable that is the summation over reciprocal of common words’ occurrences in a context and a response. More precisely; denoting by the number of times that word appeared in train dataset, for any context and response, we define as follows:


where .

We note that unlike the TF-IDF model that computes TF-IDF vectors of both the context and the response and calculates the cosine similarity between the two vectors, our proposed layer only considers common words in the context and the response and is intended to reflect how informative a common word in context and response is.

5 Experiment

In this section, we provide details on our experiments including data preparation, experimental settings, the models we build using the networks and layers introduced in Section 4, and training parameters and functions.

5.1 Data Preparation

We preprocessed the ubuntu dataset by normalizing every context and response using TweetMotif O’Connor et al. (2010). We use the tokenization function of TweetMotif to treat hashtags, @-replies, abbreviations, strings of punctuation, emoticons and unicode glyphs (e.g., musical notes) as tokens. In order to reproduce the original result, we kept all the train, validation, and test sets same as the original sets provided by Lowe et al. (2015).

5.2 Experimental Setting

The experiment was executed on Amazon AWS p2 xlarge machine with NVIDIA Tesla K80 GPU. We use Keras 

Chollet et al. (2015)

to implement all our models. All models were trained using Root Mean Square Propagation optimizer (RMSProp

Hinton et al. (2012) with learning rate set to without decay. Also, the batch size is set to during training.

5.3 Model Training

We trained two different models with combination of few methods:

  • Apply Multi Frequency Word Embedding and Common Word Embedding to LSTM structure. We refer to this model as MFCW-LSTM. Figure 2 depicts the model.

Figure 2: Diagram of MFCW-LSTM model. is the list of common words between context and response pair. Every single word in Context, Response, is filtered by High Frequency Word Filter and Low Frequency Word Filter , then feed to low frequency word embedding layer and high frequency word embedding layer, respectively. All the embedded results are fed into the correspond LSTM word-by-word. The LSTMs have tied weights for output of each word embedding layer respectively.
  • Implement Cross Convolution Network and LSTM structure. High frequency layer of word embedding is the only embedding we use here. We refer to this model as CCN-LSTM. Figure 3 depicts the model.

Figure 3: Diagram of CCN-LSTM model. The LSTMs have tied weights, and the two word embedding layers do not share weights. The dashed box is the Context-Response Words Relation networks with Dot Product, k-Max Pooling, and Dense layers.

All LSTM structures have hidden units. The maximum size of the context and response is set to words plus zero padding. The word embedding size is as in GloVe embedding.

For MFCW-LSTM, we tried different thresholds for high and low frequency word boundary, and found is the best threshold. We also learned parameters of LSTM with both shared weights and separate weights. For CCN-LSTM, we used two separated word embedding layers to feed words to the Context-Response Words Relation and LSTM networks. Parameter is set to and for different models; however, we did not see any improvement in models with over models with .

To predict the label of a response for a context during training phase, we consider weighted sum of the response scores calculated by each of the networks in the considered model and apply sigmoid function that yields a number between and

. We then penalize the predicted label using square error loss function. The Context-Response Common Words Frequency layer is not used in the models during training. We discuss this issue more in the Result Section (Section 


5.4 Ensemble

Ensemble of multiple models can help us obtain better predictive performance than what we can get from any of the constituent models Opitz and Maclin (1999); Polikar (2006); Sollich and Krogh (1996). Similar to Kadlec et al. (2015), we found that averaging the prediction result of multiple models gives a decent improvement. We found that the best classifier is ensemble of 16 MFCW-LSTMs and 4 CCN-LSTMs.

6 Result

In this section we provide our experiments result in two subsections. Firstly, we report our models’ performance and discuss the results. Afterwards, we provide comparison of our best model performance with previous work’s.

6.1 Evaluation

We use the same evaluation metric as Lowe et al. (2015), namely . Among response candidates provided in evaluation and test set, positive and negative responses are used. The model ranks responses, and prediction is considered correct if the correct response is in top candidates. we are reporting with of , and .

Figure 4: Loss change for validation set.
Figure 5:

Accuracy change for validation set. We found accuracy start decreasing after third epoch for various models.

We choose the best models using the accuracy on validation set (Figure 4 and Figure 5). The performance of these models on test set are reported in Table 1. We reproduce the result of Dual-LSTM model as our baseline Lowe et al. (2015), and compare our models to this. We use same default hyper-parameter setting for Dual-LSTM as our other models, and initialize word embedding using Glove word vector Pennington et al. (2014). The reproduced performance of Dual-LSTM model in here is better than original result Lowe et al. (2017) since we are using high frequency word embedding and preprocessed dataset.

MFCW-LSTM and CCN-LSTM are the models described in Subsection 5.3. The CCN-LSTM model has and two dense parallel layers of linear and sigmoid in Context-Response Words Relation network as in Figure 3. Both of these models outperform Dual-LSTM model by approximately 9%. We also incorporate the MFCW-LSTM and CCN-LSTM to a single model, and the result is not better than any one of two.

We now investigate the effect of Context-Response Common Words Frequency (see Subsection 4.4 for details.) in response prediction. The scaled scores computed using Equation 8 are added to the resulting probabilities out of MFCW-LSTM and CCN-LSTM models in validation and test. Particularly, in order to predict the correct response on validation and test dataset, we first compute the score for each of the ten candidates responses using our model. After that we calculate for each response, scale it, and add it to the model output. We then rank the responses based on this final score and choose the highest ranking as output. The scaling factor is optimized on the validation set. The combined models’ results can be found in the MFCW-LSTM-CWF and CCN-LSTM-CWF rows (Table 1), where CWF stands for common words frequency. As we can see, CWF layer improve performance of both baseline models of MFCW-LSTM and CCN-LSTM. We see a noticeable improvement for CCN-LSTM which can be attributed to the fact that CCN-LSTM model does not include rare words in its decision.

Model 1 in 10R@1 1 in 10R@2 1 in 10R@5
LSTM 57.6% 75.3% 94.5%
MFCW-LSTM 66.5% 80.4% 95.4%
CCN-LSTM 66.3% 80.8% 95.6%
MFCW-LSTM-CWF 67.3% 81.2% 95.6%
CCN-LSTM-CWF 69.0% 82.2% 96.0%
Ensemble   72.7% 85.8% 97.1%
Table 1: Result of our models on Ubuntu Dialogue Corpus test set for different recall measures. Numbers in bold is our best model performance.
Model 1 in10R@1 1 in10R@2 1 in10R@5
Dual-LSTM Lowe et al. (2017) 55.2% 72.1% 92.4%
RNN-CNN Baudiš et al. (2016) 67.2% 80.9% 95.6%
r-LSTM Xu et al. (2016) 64.9% 78.5% 93.2%
Ensemble Kadlec et al. (2015) 68.3% 81.8% 95.7%
SMN Wu et al. (2017) 72.6% 84.7% 96.2%
Our Best Model 72.7% 85.8% 97.1%
Table 2: Performance comparison of our best models and different recent papers. Numbers in bold mean that the improvement is statistically significant compared with previous baseline.

6.2 Comparison

Table 2 shows performance comparison of our best models and different recent papers. Since we used the Ubuntu Dialogue Corpus v2 dataset, and we compare our result to other works based on the same version. Recently, Wu et al. (2017) achieved decent improvement over the previous state-of-the-art. As we can see our best model which is the ensemble model outperforms SMN Wu et al. (2017) by , , and for 1 in10R@1, 1 in10R@2, and 1 in10R@5 metrics, respectively (Table  2). Therefore, we set a new state-of-the-art to 72.7%, 85.8% and 97.1%.

7 Conclusion and Future Work

In this paper, we considered the problem of next response selection for multi-turn conversation. Motivating by the large gap between machine and expert performances on this task for Ubuntu Dialogue Corpus, we presented new networks and layers and we evaluated our models using this dataset. We proposed Cross Convolution Network (CCN) that is potentially useful for the general task of classifying a pair of objects. We implemented CCN combined with LSTM as one of our single model. The CNN tries to capture word level information on word pairs of the context and response, while the LSTM captures the information on the entire context and the entire response. We also investigated the effect of Multi Frequency Word Embedding and Common Words Embedding combined with LSTM as our other model. The Multi Frequency Word Embedding tries to embed both rare and frequent words in an efficient way, and it is able to capture important low frequency key words without increasing too much computational complexity. Our experimental results showed a promising improvement over previous models; specifically when we ensemble our models to select the next response.

For future work, we will explore the fusion of these findings in other multi-turn response selection dataset and other related problems, and evaluate whether the gains achieved here are orthogonal to other methods for improving performance. We also see the potential of extending our framework to generative models for dialogue systems.

8 Acknowledgments

We thank Morten Pedersen and David Guy Brizan for their contributions to this study. We gratefully acknowledge financial support for this work by AOL of OATH.