Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots

11/16/2019 ∙ by Jia-Chen Gu, et al. ∙ USTC Anhui USTC iFLYTEK Co 0

This paper proposes an utterance-to-utterance interactive matching network (U2U-IMN) for multi-turn response selection in retrieval-based chatbots. Different from previous methods following context-to-response matching or utterance-to-response matching frameworks, this model treats both contexts and responses as sequences of utterances when calculating the matching degrees between them. For a context-response pair, the U2U-IMN model first encodes each utterance separately using recurrent and self-attention layers. Then, a global and bidirectional interaction between the context and the response is conducted using the attention mechanism to collect the matching information between them. The distances between context and response utterances are employed as a prior component when calculating the attention weights. Finally, sentence-level aggregation and context-response-level aggregation are executed in turn to obtain the feature vector for matching degree prediction. Experiments on four public datasets showed that our proposed method outperformed baseline methods on all metrics, achieving a new state-of-the-art performance and demonstrating compatibility across domains for multi-turn response selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 11

Code Repositories

U2U-IMN

TASLP: Utterance-to-Utterance Interactive Matching Network for Multi-Turn Response Selection in Retrieval-Based Chatbots


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Building a chatbot that can converse naturally with humans on open-domain topics is a challenging yet intriguing problem in artificial intelligence. Recently, human-computer conversation has attracted increasing attention due to its promising potential and commercial value

[2, 3, 4]. Existing approaches to building chatbots include generation-based methods [5, 6, 7] and retrieval-based methods [8, 9, 10, 11, 12, 13, 14]. Response selection, which aims to select the best-matched response from a set of candidates given the context of a conversation, is the key technique for building retrieval-based chatbots.

Conversation
Speaker A: How do I put myself in desktop in CUI? _eou_
Speaker A: I mean CLI. _eou_ _eot_
Speaker B: cd ~/ desktop. _eou_ _eot_
Speaker A: Is that the right code? cd / desktop? _eou_ _eot_
Response Candidates
Speaker B: No. read it again. _eou_ Are you root? _eou_ That’s
why new Ubuntu man’s method will work for you. _eou_
Speaker B: sebdc is talking nonsense. _eou_ You do not need
cpufreqd. _eou_
TABLE I: An example of a conversation in the Ubuntu V2 dataset whose response is composed of multiple utterances. “_eou_” denotes end-of-utterance, and “_eot_” denotes end-of-turn.

In recent years, neural networks have been adopted to calculate the matching degrees between a context and its response candidates for response selection. Existing studies on neural network-based multi-turn response selection follow either context-to-response matching or utterance-to-response matching frameworks. The former adopts a coarse granularity for both contexts and responses that concatenates all utterances in a context or in a response into a single word sequence for matching degree calculation

[8, 9, 10]. The latter adopts a fine granularity for contexts that separates a context into utterances but still concatenates all utterances in a response [11, 12, 13]. However, both contexts and responses may contain multiple utterances in the response selection task, as illustrated in Table I. Both frameworks mentioned above neglect the relationships among the utterances in a response.

Therefore, this paper proposes a neural network model named the utterance-to-utterance interactive matching network (U2U-IMN) for multi-turn response selection in retrieval-based chatbots. This model follows a new utterance-to-utterance (U2U) matching framework in order to deal with the situation in which both contexts and responses may contain multiple utterances. Different from the context-to-response matching and utterance-to-response matching frameworks, the U2U matching framework treats both contexts and responses as sequences of utterances when calculating the matching degrees between them. Therefore, the U2U-IMN model first encodes each utterance separately for a context-response pair. A previous study on natural language inference (NLI) [15] found that performing interactions between sentence pairs can provide useful matching information. Inspired by this, an attention-based interaction between the context and the response is conducted to collect the matching information between them. Here, the interaction is global (i.e., crossing utterance boundaries) and bidirectional (i.e., considering both context-to-response and response-to-context directions) in order to enrich the relevance representations of contexts and responses. The distances between context and response utterances are employed as a prior component when calculating the attention weights in order to distinguish the semantic contributions of different utterances in a context. Finally, sentence-level aggregation and context-response-level aggregation are executed in turn to obtain the feature vector for matching degree prediction.

Our proposed methods were evaluated on two English datasets, the Ubuntu Dialogue Corpus V1 [8] and Ubuntu Dialogue Corpus V2 [10], along with two Chinese datasets, the Douban Conversation Corpus [11] and E-commerce Dialogue Corpus [13], which are all public datasets widely used in studies on multi-turn conversation. The results showed that our proposed method outperformed baseline methods on all metrics, achieved a new state-of-the-art performance, and demonstrated compatibility across domains for multi-turn response selection.

In summary, the main contributions of this paper are twofold. First, this paper proposes a neural network model named U2U-IMN to deal with the situation in which both contexts and responses may contain multiple utterances. In this model, a matching module with attention-based global and bidirectional interactions is designed to collect the matching information between context and response utterances. Second, experimental results demonstrate that our proposed method achieves a new state-of-the-art performance on four public datasets for multi-turn response selection.

Ii Related Work

Chatbots aim to engage users in human-computer conversations in the open domain and are currently receiving increasing attention because they can target unstructured dialogue without a priori logical representation of the information exchanged during the conversation. Existing work on building chatbots includes generation-based methods [5, 6, 7, 16, 17] and retrieval-based methods [8, 9, 10, 11, 12, 13]

. Generation-based models maximize the probability of generating a response given the previous dialogue. This approach enables the incorporation of rich context when mapping between consecutive dialogue turns. Retrieval-based chatbots have the advantage of generating informative and fluent responses because they select a proper response for the current conversation from a repository by means of response selection algorithms.

Early studies on retrieval-based chatbots focused on single-turn conversation [18, 19]. Recently, researchers have extended their attention to multi-turn conversation, which is more practical for real applications. A straightforward approach to multi-turn conversation is to match a response with the literal concatenation of context utterances [8, 9, 10]. Then, a multi-view model [20], including an utterance view and a word view, was studied. Wu et al. [11]

proposed the sequential matching network (SMN), which first matched the response with each context utterance and then accumulated the matching information using a recurrent neural network (RNN). Zhang et al.

[13] employed self-matching attention to route the vital information in each utterance based on SMN. The method of constructed representations at different granularities with stacked self-attention [12] has also been presented.

Our proposed U2U-IMN model has three main differences from the studies mentioned above. (1) U2U-IMN adopts a more fine-grained utterance-to-utterance (U2U) matching framework, while previous studies followed the framework of either context-to-response matching or utterance-to-response matching. (2) U2U-IMN derives the matching information between contexts and responses through global and bidirectional interactions, while the interactions used in previous studies were usually local and unidirectional [11]. (3) U2U-IMN employs the distances between context and response utterances as a prior component for calculating the attention weights in the interactive matching module.

Iii Utterance-to-Utterance
Interactive Matching Network

Iii-a Model Overview

Given a dialogue dataset , an example of the dataset can be represented as . Specifically, represents a context with as its utterances and as its utterance number. Similarly, represents a response candidate with as its utterances and as its utterance number. Here, both the context and the response may be composed of multiple utterances, and the utterances in and are both chronologically ordered. denotes a label. indicates that is a proper response for ; otherwise, . Our goal is to learn a matching model from . For any context-response pair , measures the matching degree between and . We learn by minimizing the sigmoid cross-entropy on . Let denote the set of model parameters. Then, the objective function of learning can be formulated as

(1)
Fig. 1: The overall architecture of our proposed U2U-IMN model.

The U2U-IMN model is designed to calculate the matching degree for a context-response pair. It is composed of a word representation module, a sentence encoding module, an interactive matching module, an aggregation module and a prediction module, as shown in Fig. 1. Details about each module are provided in the following subsections.

Iii-B Word Representation Module

One challenge of word representation for dialogue is the large number of out-of-vocabulary (OOV) words. To address this issue, we combine the general pretrained word embeddings with those estimated on a task-specific training set

[21]

. To further enhance the word embeddings, a convolutional neural network (CNN) is employed to model the morphology information at the character-level

[22].

Formally, the word embeddings of the m-th utterance in a context and the n-th utterance in a response candidate are denoted and , respectively, where and are utterance lengths. Each or is an embedding vector of d dimensions.

Iii-C Sentence Encoding Module

First, each utterance in a context or in a response candidate is encoded by a bidirectional long short-term memory network (BiLSTM)

[23]. We denote the calculations as follows:

(2)
(3)

The parameters in these two BiLSTMs are shared in our implementation.

To consider long-term dependency and highlight the semantic influences among adjacent words at the same time, a self-attention layer [24] with a Gaussian prior [25] is employed to enhance the performance of BiLSTM-based sentence encoding. For a word in a context utterance, its representation after self-attention is calculated as

(4)

where is the word-level distance between the -th word and the -th word, and and are scalar parameters estimated by model training. Similarly, for each word in a response utterance, we have

(5)

Finally, the outputs of the sentence encoding module are for context utterances and for response utterances.

Iii-D Interactive Matching Module

Interactions between the context and the response provide useful information for determining the matching degree between them. Unlike previous work [11, 12, 13], which matched the response to each utterance in the context separately, the U2U-IMN model matches the whole response with the whole context in a global and bidirectional way. Both the context and the response are treated as single word sequences, and attention weights are calculated between every word in the context and every word in the response. Then, the relevance representations are derived along both the context-to-response and response-to-context directions. This global and bidirectional strategy is expected to help neglect the irrelevant utterances and enrich the relevance representations between the context and the response. Furthermore, considering that the context utterances adjacent to the response may contribute more in response selection than the distant ones, we propose to introduce an exponential prior based on the distance between context and response utterances when calculating the attention weights.

First, the context representation is formed by concatenating all context utterance representations , where is the total number of words in the context. Similarly, we obtain and for the response.

Then, an attention-based alignment is employed to collect relevance information between these two sequences by computing the attention weight between each pair of {} as111Actually, the attention weights for context-to-response alignment and response-to-context alignment are different because of different normalization terms. Here, we use the same symbol , and the normalization term is not shown in Eq. (6) for simplification.

(6)

where is the sentence-level distance between these two words, and is an exponential prior with decay constant and initial value . Here, and are model parameters that need to be estimated.

Next, the attention weights computed above are used to bidirectionally obtain the local relevance between a context and a response. For a word in the context, its context-to-response relevance representation carried by the response is composed using as

(7)

where the contents in relevant to are selected to form . The same calculation is also performed for each word in the response to form the response-to-context representation as

(8)

For the whole context and the whole response, we have and . Following a previous study on interactive matching for NLI [15], we compute the differences and the element-wise products between {} and between {}. The differences and the element-wise products are then concatenated with the original vectors to obtain the enhanced representations as follows:

(9)
(10)

Thus far, the relevant information between the context and the response has been collected, which is further converted back to the matching matrices of separated utterances as

(11)
(12)

where the Separate operation is conducted by segmenting the whole sequences of relevant information according to utterance length.

Iii-E Aggregation Module

The aggregation module converts the matching matrices of separated utterances into a final matching vector. Previous studies [11, 12, 13] adopted the utterance-to-response matching framework and only aggregated the matching matrices of utterances in a context. In contrast, the U2U-IMN model needs to conduct the aggregation operation for both the context and the response.

First, the matching matrix or

for each utterance is processed by a BiLSTM and aggregated by max pooling and last-hidden-state pooling operations. For the matching matrix

of each context utterance, the calculations are as follows:

(13)
(14)

where and denote the results of max pooling and last-hidden-state pooling for the sequence of . The same calculations are also performed for the matching matrix of each response utterance as follows:

(15)
(16)

The weights for these two BiLSTMs are shared in our implementation. Thus far, we have obtained two sets of utterance embeddings and for the context and the response, respectively. The next step is to convert them into aggregated context and response embeddings.

The embedding vector of the context is derived in a way similar to the utterance-level aggregation method mentioned above. The utterance embeddings in are sent into another BiLSTM following the chronological order of utterances in the context. Combined max pooling and last-hidden-state pooling operations are also performed to obtain the context embedding vector as

(17)
(18)

For the response, two aggregation strategies are designed in this paper.

Iii-E1 RNN Aggregation

This is identical to the context aggregation in which the chronological relationships among utterances in the response are modelled. The operations can be written as

(19)
(20)

Iii-E2 Attention Aggregation

Different from contexts that usually contain approximately ten utterances, a response is composed of much fewer utterances (see Fig. 2 in the next section for detailed statistics). We suppose that chronological relationships in short sequences are not as important as those in long sequences. Therefore, attention aggregation is designed to replace the RNN aggregation for deriving the response embedding vector. Mathematically, we have

(21)

where denotes softmax-normalized position-dependent utterance weights. During model training, the maximum number of utterances in a response is set manually. For each , a group of weights is estimated with the constraint .

The final matching feature vector is the concatenation of the context embedding vector and the response embedding vector:

(22)

Iii-F Prediction Module

The matching feature vector m

is then sent into a multi-layer perceptron (MLP) classifier. An MLP is a feedforward neural network estimated in a supervised manner using examples of features together with known labels. Here, the MLP is designed to predict whether a context-response pair matches appropriately according to the matching feature vector

m. Finally, the MLP returns a score to denote the degree of matching.

Iv Experiments

Iv-a Datasets

Dataset Ubuntu V1 Ubuntu V2 Douban E-commerce
Train Valid Test Train Valid Test Train Valid Test Train Valid Test
pairs 1M 0.5M 0.5M 1M 195k 189k 1M 50k 10k 1M 10k 10k
positive:negative 1: 1 1: 9 1: 9 1: 1 1: 9 1: 9 1: 1 1: 1 1: 9 1: 1 1: 1 1: 9
positive/context 1 1 1 1 1 1 1 1 1.18 1 1 1
turns/context 8.44 2.66 2.65 6.29 5.86 6.03 6.69 6.75 5.95 5.51 5.48 5.64
words/utterance 20.38 21.16 21.17 14.06 15.28 15.28 18.56 18.50 20.74 7.02 6.99 7.11
TABLE II: Statistics of the datasets for evaluating our proposed methods.

Two English public multi-turn response selection datasets, the Ubuntu Dialogue Corpus V1 [8] and Ubuntu Dialogue Corpus V2 [10], and two Chinese datasets, the Douban Conversation Corpus [11] and E-commerce Dialogue Corpus [13], were adopted to evaluate our proposed methods. In our experiments, we followed the splits of training, validation, and test sets provided by the original authors of the four datasets. The Ubuntu Dialogue Corpus V1 and V2 contain multi-turn dialogues about Ubuntu system troubleshooting in English. Here, we adopted the version of the Ubuntu Dialogue Corpus V1 shared by Xu et al. [26], in which numbers, paths and URLs were replaced by placeholders. Compared with the Ubuntu Dialogue Corpus V1, the training, validation and test dialogues in the V2 dataset were generated in different periods without overlap. Moreover, the V2 dataset discriminated between the end of an utterance (_eou_) and the end of a turn (_eot_). In both of the Ubuntu corpora, the positive responses are true responses from humans, and the negative responses are randomly sampled. The Douban Conversation Corpus was crawled from a Chinese social network on open-domain topics. It was constructed in a similar way to the Ubuntu corpus. The Douban Conversation Corpus collected responses via a small inverted-index system, and labels were manually annotated. The E-commerce Dialogue Corpus collected real-world conversations between customers and customer service staff from the largest e-commerce platform in China. Some statistics of these datasets are provided in Table II.

Fig. 2: Distribution of responses in the Ubuntu Dialogue Corpus V2 across the number of utterances in a response.

It is worth noting that the Ubuntu Dialogue Corpus V2 was the only dataset in our experiments that explicitly segmented utterances in responses. Specifically, approximately 30% of the responses in this dataset consisted of multiple utterances, as shown in Fig. 2, which made this dataset a very suitable one for evaluating our proposed U2U matching framework. The U2U-IMN model can also be applied to the other three datasets by considering a whole response as a single utterance.

Iv-B Evaluation Metrics

The evaluation metrics used in previous work

[8, 10, 11, 13] were adopted in our experiments. Each model was tasked with selecting the best-matched responses from available candidates for the given conversation context . We calculated the recall of the true positive replies among the selected responses, denoted , as the main evaluation metric. The mean average precision (MAP) [27] was also adopted for reference since previous work did not list their results in terms of MAP on the Ubuntu V1, Ubuntu V2 and E-commerce datasets. In addition to and MAP, we also adopted the mean reciprocal rank (MRR) [28] and precision-at-one () metrics for the Douban corpus, following the settings of previous work [11]. The reason was that the Douban Conversation Corpus was different from the other three datasets in that it included multiple correct candidates for a context in the test set, which may lead to low .

Iv-C Training Details

Ubuntu Corpus V1 Ubuntu Corpus V2
MAP MAP
TF-IDF [8, 10] - 0.659 0.410 0.545 0.708 - 0.749 0.488 0.587 0.763
RNN [8, 10] - 0.768 0.403 0.547 0.819 - 0.777 0.379 0.561 0.836
LSTM [8, 10] - 0.878 0.604 0.745 0.926 - 0.869 0.552 0.721 0.924
DL2R [29] - 0.899 0.626 0.783 0.944 - - - - -
Match-LSTM [30] - 0.904 0.653 0.799 0.944 - - - - -
MV-LSTM [31] - 0.906 0.653 0.804 0.946 - - - - -
Multi-View [20] - 0.908 0.662 0.801 0.951 - - - - -
RNN-CNN [32] - - - - - - 0.911 0.672 0.809 0.956
CompAgg [33] - 0.884 0.631 0.753 0.927 - 0.895 0.641 0.776 0.937
BiMPM [34] - 0.897 0.665 0.786 0.938 - 0.877 0.611 0.747 0.921
HRDE-LTC [35] - 0.916 0.684 0.822 0.960 - 0.915 0.652 0.815 0.966
SMN [11] - 0.926 0.726 0.847 0.961 - - - - -
DUA [13] - - 0.752 0.868 0.962 - - - - -
DAM [12] - 0.938 0.767 0.874 0.969 - - - - -
U2U-IMN 0.866 0.945 0.790 0.886 0.973 0.852 0.943 0.762 0.877 0.975
TABLE III: Evaluation results of U2U-IMN and previous methods on the Ubuntu Dialogue Corpus V1 and V2.
Douban Conversation Corpus E-commerce Corpus
MAP MRR MAP
TF-IDF 0.331 0.359 0.180 0.096 0.172 0.405 - 0.159 0.256 0.477
RNN 0.390 0.422 0.208 0.118 0.223 0.589 - 0.325 0.463 0.775
LSTM 0.485 0.527 0.320 0.187 0.343 0.720 - 0.365 0.536 0.828
Multi-View 0.505 0.543 0.342 0.202 0.350 0.729 - 0.421 0.601 0.861
DL2R 0.488 0.527 0.330 0.193 0.342 0.705 - 0.399 0.571 0.842
MV-LSTM 0.498 0.538 0.348 0.202 0.351 0.710 - 0.412 0.591 0.857
Match-LSTM 0.500 0.537 0.345 0.202 0.348 0.720 - 0.410 0.590 0.858
SMN 0.529 0.569 0.397 0.233 0.396 0.724 - 0.453 0.654 0.886
DUA 0.551 0.599 0.421 0.243 0.421 0.780 - 0.501 0.700 0.921
DAM 0.550 0.601 0.427 0.254 0.410 0.757 - - - -
U2U-IMN 0.564 0.611 0.429 0.259 0.430 0.791 0.759 0.616 0.806 0.966
TABLE IV: Evaluation results of U2U-IMN and previous methods on the Douban Conversation Corpus and the E-commerce Corpus. All the results except ours are copied from [11, 13, 12].

The Adam method [36] was employed for optimization, with a batch size of 128. The initial learning rate was 0.001 and was exponentially decayed by 0.96 every 5000 steps. Dropout [37] with a rate of 0.2 was applied to the word embeddings and all hidden layers.

The word representations for the English datasets were concatenations of the 300-dimensional GloVe embeddings [38], the 100-dimensional embeddings estimated on the training set using the Word2Vec algorithm [39] and the 150-dimensional character-level embeddings with window sizes of {3, 4, 5}, each consisting of 50 filters. The word embeddings for the Chinese datasets were concatenations of the 200-dimensional embeddings from previous work [40] and the 200-dimensional embeddings estimated on the training set using the Word2Vec algorithm. Character-level embeddings were not employed for the two Chinese datasets due to the large number of Chinese characters. The word embeddings were not updated during training.

All hidden states of LSTMs had 200 dimensions. The MLP of the prediction module had a hidden unit size of 256 with ReLU

[41]

activation. The maximum word length, the maximum utterance length, the maximum number of utterances in a context, and the maximum number of utterances in a response were set as 18, 50, 10 and 3 respectively. We padded with zeros if the number of utterances in a context was less than 10 and the number of utterances in a response was less than 3. Otherwise, the last 10 utterances in the context or the last 3 utterances in the response were kept. The development set was used to select the best model for testing.

All codes were implemented in the TensorFlow framework

[42] and have been published to help replicate our results222https://github.com/JasonForJoy/U2U-IMN.

Iv-D Experimental Results

Table III and Table IV present the evaluation results of U2U-IMN and previous methods 333In our previous conference paper, IMN employed an attentive hierarchical recurrent encoder (AHRE) as its sentence encoder, which aggregated multi-layer RNNs through attentive pooling. However, since the sentence encoder is not the key point of this paper, we replaced AHRE with a single-layer RNN for sentence encoding in U2U-IMN in order to simplify the model structure and focus on how to perform interactions between contexts and responses..

All the results except ours are copied from the existing literature. For each dataset, all results listed in Table III or Table IV are comparable with each other since they used the same training, validation and test data. Here, the U2U-IMN models adopted the attention aggregation strategy introduced in Section III-E. It can be observed from these two tables that U2U-IMN outperformed the other models on all metrics and datasets, which demonstrates its ability to select the correct response and its compatibility across domains (e.g., the domains of system troubleshooting, social networks and e-commerce covered by these datasets).

V Analysis

Model Subset
U2R-IMN 1 utt. 0.936 0.733 0.863 0.974
U2U-IMN 1 utt. 0.937 0.737 0.863 0.972
U2R-IMN 2 utt. 0.952 0.823 0.904 0.979
U2U-IMN 2 utt. 0.956 0.831 0.911 0.984
U2R-IMN 3 utt. 0.965 0.873 0.923 0.982
U2U-IMN 3 utt. 0.976 0.904 0.955 0.994
TABLE V: Comparisons between U2R-IMN and U2U-IMN models on several subsets of the test set of the Ubuntu Dialogue Corpus V2. U2R-IMN denotes the model with concatenation of the utterances in a response. In each subset, the correct responses are composed of 1, 2, or 3 utterances.

V-a Effectiveness of U2U matching

To further verify the effectiveness of our proposed U2U matching framework, we split the test set of the Ubuntu Dialogue Corpus V2 dataset according to the number of utterances in their correct responses. Then, the performances on these subsets of the U2U-IMN model were compared with those of the model (denoted U2R-IMN) that considered each response as a single utterance, as shown in Table V. As demonstrated, the U2U framework can help improve the performance by exploiting the relationships among the utterances in a response. We can see that the advantage of the U2U-IMN model over the U2R-IMN model became larger when the correct responses were composed of more utterances. This was consistent with the motivation of the U2U matching framework. Considering that only 30% of responses in the Ubuntu Dialogue Corpus V2 dataset consisted of multiple utterances, a larger overall improvement may be achieved when applying our proposed U2U models to datasets containing more responses with multiple utterances.

V-B Response aggregation strategies

One key characteristic of the U2U matching framework is the response aggregation step that generates a single embedding vector based on the embedding vectors of response utterances. Table VI shows the evaluation results of the two response aggregation strategies introduced in Section III-E, where the RNN suffix indicates the U2U-IMN model using the RNN aggregation strategy instead of the attention aggregation. We can see than the U2U-IMN model with the default attention strategy for response aggregation achieved slightly better performance than that with RNN aggregation, which supported our assumption that the chronological relationships among utterances in short sequences may not be essential in the aggregation module. Some further analysis on these two aggregation strategies are given in the following.

Model
U2U-IMN 0.943 0.762 0.877 0.975
U2U-IMN 0.942 0.758 0.875 0.974
TABLE VI: Evaluation results of our proposed U2U matching framework on the test set of the Ubuntu Dialogue Corpus V2. The RNN suffix of the U2U model denotes replacing the attention aggregation with the RNN aggregation in the aggregation module.

V-B1 RNN Aggregation

Fig. 3: The input gate values of the LSTM in Eq. (19) of the U2U-IMN model for a response example in the test set of the Ubuntu Dialogue Corpus V2. The darker units correspond to larger values.

To investigate how RNN aggregation identifies important utterances in a response, the input gate values of the LSTM in Eq. (19) for a response example were visualized, as shown in Fig.3. The response was composed of three utterances, {U: not as vboxnet0 though, windows names them local area connection # 1,2,3 … _eou_; U: exactly! _eou_; U: i don’t know how to do it though :-lrb- _eou_ }, and U was the most informative one. From Fig.3, we can see that the input gates had larger values for U than for the other two utterances. This means that more information from this utterance was preserved when aggregating the three utterances to form the embedding vector of the whole response.

V-B2 Attention Aggregation

Fig. 4: of U2U-IMN models with different tuned on the validation set of the Ubuntu Dialogue Corpus V2.
1 1.0 - -
2 0.5986 0.4014 -
3 0.4495 0.3014 0.2491
TABLE VII: Attention weights of the U2U-IMN model with estimated on the training set of the Ubuntu Dialogue Corpus V2.

The maximum number of utterances in a response, i.e., in Section III-E, was tuned on the validation set, and the optimal one for the U2U-IMN model was , as shown in Fig. 4. The estimated attention weights of the U2U-IMN model with are shown in Table VII. We can see that when , each utterance in the response contributed to forming the final response embeddings, and the first utterance contributed more than the last one. As we can see from the first row of Table VII and Eq. (21), if there was only one utterance in a response, then the U2U-IMN model degenerated to follow the conventional utterance-to-response matching framework.

V-C Bidirectional and global interactive matching

The bidirectional and global interactive matching between the context and the response in the U2U-IMN model is expected to help collect matching information and make matching decisions. Ablation tests and visualizations of attention weights were performed to demonstrate the effectiveness of both the bidirectional matching and the global matching.

V-C1 Bidirectional Matching

Dataset Model
Ubuntu V1 U2U-IMN 0.790 0.886 0.973
 - C2R 0.774 0.876 0.968
 - R2C 0.780 0.880 0.971
 - C2R&R2C 0.650 0.806 0.954
Ubuntu V2 U2U-IMN 0.762 0.877 0.975
 - C2R 0.738 0.866 0.972
 - R2C 0.749 0.871 0.972
 - C2R&R2C 0.608 0.786 0.956
Douban U2U-IMN 0.259 0.430 0.791
 - C2R 0.251 0.424 0.784
 - R2C 0.250 0.429 0.785
 - C2R&R2C 0.188 0.352 0.744
E-commerce U2U-IMN 0.616 0.806 0.966
 - C2R 0.575 0.774 0.957
 - R2C 0.567 0.766 0.961
 - C2R&R2C 0.538 0.751 0.938
TABLE VIII: Ablation tests of the context-to-response (C2R) and response-to-context (R2C) representations in the U2U-IMN model on the four datasets.
(a) Context-to-response
(b) Response-to-context
Fig. 5: Visualizations of the (a) context-to-response and (b) response-to-context attention weights in the interactive matching module for a test sample of the Ubuntu Dialogue Corpus V2. The darker units correspond to larger values.

The bidirectional context-to-response and response-to-context representations in the U2U-IMN model were ablated. Specifically, when the context-to-response representation was ablated, the context representation given by the sentence encoding module was sent to the aggregation module directly, and only the response representation was enhanced by the interactive matching module to obtain before aggregation. Similar operations were conducted to ablate the response-to-context representation. The results are shown in Table VIII. We can see that ablation of either the context-to-response or response-to-context representations resulted in a performance degradation, which indicates the effectiveness of the bidirectional matching between contexts and responses in the interactive matching module. A serious performance degradation can be observed when ablating the matching representations of both directions.

A case study was further conducted by visualizing the bidirectional context-to-response and response-to-context attention weights for a test sample of the Ubuntu Dialogue Corpus V2. The context of the sample contained three utterances:

  • Have you tried using different channels ? _eou_ _eot_

  • No, how do I do that ? _eou_ _eot_

  • Can you connect to router via ethernet cable and check the settings ? _eou_ _eot_

The response was composed of two utterances:

  • I can connect to the router via ethernet, yes What settings should I check ? _eou_

  • I have to go now. I am grateful for your help _eou_

The results are shown in Fig. 5. We can see that some important words, such as “connect”, “router” and “ethernet”, in the context selected the relevant words in the response, and some unimportant words, such as “grateful”, “help” and “the”, in the response occupied small weights when forming the context-to-response representations. Identically, some important words in the response also selected the relevant words in the context, and some unimportant words in the context were also neglected when forming the response-to-context representations.

V-C2 Global Matching

Dataset Model
Ubuntu V1 U2U-IMN 0.790 0.886 0.973
- Global 0.786 0.885 0.972
Ubuntu V2 U2U-IMN 0.762 0.877 0.975
- Global 0.754 0.873 0.975
Douban U2U-IMN 0.259 0.430 0.791
- Global 0.254 0.424 0.785
E-commerce U2U-IMN 0.616 0.806 0.966
- Global 0.586 0.792 0.961
TABLE IX: Ablation tests of replacing the global context-response matching with local utterance-utterance matching on the four datasets.
(a) Context to first response utterance
(b) Context to second response utterance
Fig. 6: Visualizations of attention weights between the context and each response utterance in the interactive matching module for a test sample of the Ubuntu Dialogue Corpus V2. The darker units correspond to larger values.
(a) Response to 1st context utterance
(b) Response to 2nd context utterance
(c) Response to 3rd context utterance
Fig. 7: Visualizations of attention weights between the response and each context utterance in the interactive matching module for a test sample of the Ubuntu Dialogue Corpus V2. The darker units correspond to larger values.

To demonstrate the superiority of the global context-response matching used by the U2U-IMN model, an ablation test was conducted by replacing it with local utterance-utterance matching. In the ablated model, the interactions introduced in Section III-D were performed between each utterance in the context and each utterance in the response. Thus, we obtained a set of matching representations for each utterance in the context and each utterance in the response. Then, an additional pooling operation was performed over the set of representations to obtain the final matching representation for each utterance in the context and each utterance in the response. The pooling outputs were sent into the aggregation module for the following procedures. The results of the ablation test are shown in Table IX, and the performance degradation demonstrated the superiority of our proposed global context-response matching to the local utterance-utterance matching in the interactive matching module.

Furthermore, a case study was conducted by visualizing the context-to-utterance and response-to-utterance attention weights. The sample was the same as that used in Fig. 5. The results are shown in Fig. 6 and Fig. 7, where the interactive matching was performed between the whole context and separated response utterances or between the whole response and separated context utterances. Comparing Fig. 5 (a) with Fig. 6 (b), we can see that the second response utterance “I have to go now. I am grateful for your help _eou_” was less informative and occupies small weights in our proposed global context-response matching but occupies large weights in the context-to-utterance manner. The small weights of less informative utterances can help filter out irrelevant information in responses for deriving context representations. Similarly, comparing Fig. 5 (b) with Fig. 7 (a), we can find the same phenomenon for the first context utterance “Have you tried using different channels _eou_ _eot_ ?”. These results verified the effectiveness of the global context-response interactive matching in our proposed U2U-IMN model.

V-D Distance-based prior for interactive matching

Dataset Model
Ubuntu V1 U2U-IMN 0.790 0.886 0.973
- Prior 0.787 0.884 0.973
Ubuntu V2 U2U-IMN 0.762 0.877 0.975
- Prior 0.761 0.874 0.976
Douban U2U-IMN 0.259 0.430 0.791
- Prior 0.251 0.431 0.782
E-commerce U2U-IMN 0.616 0.806 0.966
- Prior 0.600 0.795 0.968
TABLE X: Ablation tests of the distance-based prior for interactive matching on the four datasets.
Fig. 8: The estimated prior function in Eq. (7) for the E-commerce Corpus.

The exponential prior based on sentence-level distances in Eq. (6) of the interactive matching module was ablated, and the results on the test set of the four datasets are shown in Table X. We can see that the performance decreased on most metrics. Meanwhile, we can see that this distance-based prior provided larger improvements on the two Chinese datasets than on the two English datasets. The estimated prior function in Eq. (7) for the E-commerce Corpus is drawn in Fig. 8. We can see that larger weights were assigned to the utterances closer to the response.

Vi Conclusion

In this paper, we propose an utterance-to-utterance interactive matching network (U2U-IMN) for the multi-turn response selection task. Our proposed model first attempts to simultaneously explore the relationships among utterances in a context and those in a response. Then, U2U-IMN explores the matching information between contexts and responses through the global and bidirectional interactions between them. Meanwhile, distances are introduced into the interactions to distinguish the semantic contributions of utterances in a context according to their distances to the response. Experimental results show that our proposed model outperforms the baseline models on all metrics, achieving a new state-of-the-art performance and demonstrating compatibility across domains for multi-turn response selection in retrieval-based chatbots. Our future work includes (1) improving this proposed method to integrate more information, such as persona descriptions, for response selection, (2) applying the U2U framework to other matching scenes to further verify its effectiveness, and (3) employing pretrained models as effective resources for multi-turn response selection.

Acknowledgment

This work was partially funded by the National Key R&D Program of China (Grant No. 2017YFB1002202), the National Nature Science Foundation of China (Grant No. 61871358, U1636201) and the Key Science and Technology Project of Anhui Province (Grant No.17030901005).

References