Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots

04/07/2020 ∙ by Jia-Chen Gu, et al. ∙ Queen's University USTC Anhui USTC iFLYTEK Co 0

In this paper, we study the problem of employing pre-trained language models for multi-turn response selection in retrieval-based chatbots. A new model, named Speaker-Aware BERT (SA-BERT), is proposed in order to make the model aware of the speaker change information, which is an important and intrinsic property of multi-turn dialogues. Furthermore, a speaker-aware disentanglement strategy is proposed to tackle the entangled dialogues. This strategy selects a small number of most important utterances as the filtered context according to the speakers' information in them. Finally, domain adaptation is performed in order to incorporate the in-domain knowledge into pre-trained language models. Experiments on five public datasets show that our proposed model outperforms the present models on all metrics by large margins and achieves new state-of-the-art performances for multi-turn response selection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chatbots aim to engage users in open-domain human-computer conversations and are currently receiving increasing attention. The existing work on building chatbots includes generation-based methods and retrieval-based methods. The first type of methods synthesize a response with a natural language generation model

[Shang et al.2015, Serban et al.2016, Li et al.2017]. In this paper, we focus on the second type and study the problem of multi-turn response selection. This task aims to select the best-matched response from a set of candidates, given the context of a conversation which is composed of multiple utterances [Lowe et al.2015, Lowe et al.2017, Wu et al.2017]. An example of this task is illustrated in Table 1.

Conversation
Human How are you doing?
Chatbot I am going to hold a drum class in Shanghai. Anyone wants to join?
Human Interesting! Do you have coaches who can help me practice drum?
Chatbot Of course.
Human Can I have a free first lesson?
Response Candidates
Chatbot Sure. Have you ever played drum before?
Chatbot What lessons do you want?
Table 1: An example for illustrating the task of multi-turn response selection. Here, we present only one negative candidate due to limit space.

Pre-trained language models have recently shown to achieve state-of-the-art performances on a wide range of NLP tasks [Peters et al.2018, Devlin et al.2019, Yang et al.2019]. DBLP:conf/acl/HendersonVGCBCS19 made the first attempt to employ pre-trained models for multi-turn response selection. It adopted a simple strategy by concatenating the context utterances and the response literally, and then sending them into the model for classification. However, this shallow concatenation has three main drawbacks. First, it neglects the fact that speakers are always changing in turn as a conversation progresses. Second, it weakens the relationships between the context utterances as they are organized in the chronological order. Third, due to the maximum sequence length limit (e.g., 512 for BERT-Base), pre-trained language models are unable to tackle sequences that are composed of thousands of tokens, which, however, is a typical setup in multi-turn conversations.

In this paper, we attempt to employ the pre-trained language model and adjust it to fit the task of multi-turn response selection, in which BERT [Devlin et al.2019] is adopted as the basis of our work. We propose a new model, named Speaker-Aware BERT (SA-BERT). First, to make the pre-trained language model aware of the speaker change information during the conversation, the model is enhanced by adding the speaker embeddings

to the token representation and adding the special segmentation tokens between the context utterances. These two strategies are designed to improve the conversation understanding capability of multi-turn dialogue systems. Furthermore, to tackle the entangled dialogues which are mixed with multiple conversation topics and are composed of hundreds of utterances, we propose a heuristic speaker-aware disentanglement strategy, which helps to select a small number of most important utterances according to the speaker information in them. Finally, domain adaptation is designed to incorporate specific in-domain knowledge into pre-trained language models. We perform the adaptation process with the same domain but different sets under the same setting. We can conclude that adaptation on a domain-specific corpus can help to incorporate more domain-specific knowledge, and the more similar to the task this adaptation corpus is, the more improvement it can help to achieve.

We test our model on five datasets, Ubuntu Dialogue Corpus V1 [Lowe et al.2015], Ubuntu Dialogue Corpus V2 [Lowe et al.2017], Douban Conversation Corpus [Wu et al.2017], E-commerce Dialogue Corpus [Zhang et al.2018b], and DSTC 8-Track 2-Subtask 2 Corpus [Seokhwan Kim2019]. Experimental results show that the proposed model outperforms the existing models on all metrics by large margins. Specifically, 5.5% on Ubuntu Dialogue Corpus V1, 5.9% on Ubuntu Dialogue Corpus V2, 3.2% MAP and 2.7% MRR on Douban Conversation Corpus, 8.3% on E-commerce Corpus, and 15.5% on DSTC 8-Track 2-Subtask 2 Corpus, leading to new state-of-the-art performances for multi-turn response selection.

In summary, our contributions in this paper are three-fold:

  • A new model, named Speaker-Aware BERT (SA-BERT), is designed by employing speaker embeddings and speaker-aware disentanglement strategy, in order to make BERT aware of the speaker change information as the conversation progresses.

  • We make further analysis on the effect of adaptation to the performance of response selection.

  • Experimental results show that our model achieves new state-of-the-art performances on five datasets for multi-turn response selection.

2 Related Work

The existing methods used to build an open domain dialogue system can be generally categorized into generation-based methods and retrieval-based methods. The generation-based methods synthesize a response with a natural language generation model by maximizing its generation probability given the previous conversation context. This approach enables the incorporation of rich context when mapping between consecutive dialogue turns

[Shang et al.2015, Serban et al.2016, Li et al.2017]. Recently, some extended work has been made to incorporate external knowledge into generation with specific personas or emotions [Li et al.2016, Zhang et al.2018a, Zhou et al.2018a].

Our work belongs to the retrieval-based methods, which learn a matching model for a pair of a conversational context and a response candidate. This approach has the advantage of providing informative and fluent responses because they select a proper response for the current conversation from a repository by means of response selection algorithms [Lowe et al.2015, Lowe et al.2017, Wu et al.2017, Zhang et al.2018b]. Previous work on retrieval-based chatbots focused on single-turn response selection [Wang et al.2013, Ji et al.2014]. Recently, researchers have extended the focus to the multi-turn conversation, which is more practical for real applications. Some earlier work on multi-turn response selection matched a response with concatenating the context utterances literally into a single long sequence, and calculating its matching score with a response candidate [Lowe et al.2015, Kadlec et al.2015, Lowe et al.2017]

. Recent work has kept utterances separated and performed matching within a representation-interaction-aggregation framework, which improved the performance on this task. For example, DBLP:conf/emnlp/ZhouDWZYTLY16 proposed a multi-view model, including an utterance view and a word view. DBLP:conf/acl/WuWXZL17 proposed the sequential matching network (SMN) which first matched the response with each utterance and then accumulated the matching information by recurrent neural network. DBLP:conf/coling/ZhangLZZL18 proposed the deep utterance aggregation network (DUA) which refined utterances and employed self-matching attention to route the vital information in each utterance. DBLP:conf/acl/WuLCZDYZL18 proposed the deep attention matching network (DAM) which constructed representations at different granularities with stacked self-attention and cross-attention. DBLP:conf/wsdm/TaoWXHZY19 proposed the multi-representation fusion network (MRFN) with multiple types of representations. DBLP:conf/cikm/GuLL19 proposed the interactive matching network (IMN) which performed the global and bidirectional interactions between the context and response. DBLP:journals/taslp/GuLL20 proposed the utterance-to-utterance interactive matching network (U2U-IMN) which treated both contexts and responses as sequences of utterances when calculating the matching degrees between them. DBLP:conf/acl/TaoWXHZY19 proposed the interaction over interaction (IOI) model which performed matching by stacking multiple interaction blocks. DBLP:conf/emnlp/YuanZLLZHH19 proposed the multi-hop selector network (MSN) which utilized a multi-hop selector to select the relevant utterances as context. DBLP:conf/acl/HendersonVGCBCS19 made the first attempt to employ pre-trained language models for multi-turn response selection which concatenated the context utterances and the response literally and sent into the model for classification.

3 Task Definition

Given a dialogue dataset , an example of the dataset is denoted as , where represents a conversation context with as the utterances, is a response candidate, and denotes a label. Specifically, indicates that is a proper response for ; otherwise . Our goal is to learn a matching model

by minimizing a cross-entropy loss function from

. For any context-response pair , measures the matching degree between and . Let denote the parameters of model . Then, the loss function for learning can be formulated as

(1)

4 Methodology

Figure 1: The input representation of SA-BERT. The final input embeddings are the sum of the token embeddings, the segmentation embeddings, the position embeddings and the speaker embeddings.

We present here our proposed model, named Speaker-Aware BERT (SA-BERT), and a visual architecture of our input representation is illustrated in Figure 1. Due to limited space, we omit an exhaustive background description of BERT. Readers can refer to [Devlin et al.2019] for details.

4.1 Speaker Embeddings & Segmentation Tokens

To represent a pair of sentence A and sentence B, the original BERT concatenates this pair of sentence with a [SEP] token. For a given token, its input representation of the original BERT is constructed by summing the corresponding token, segment and position embeddings.

In order to distinguish utterances in a context and model the speaker change in turn as the conversation progresses, we use two strategies to construct the input sequence for multi-turn response selection as follows.

First, in order to model the speaker change, we propose to add additional speaker embeddings

to token representations. The embedding functions as indicating the speaker’s identity for each utterance. For conversations with two speakers, two speaker embedding vectors need to be estimated during the training process. The first vector is added to each token of the utterances in the first conversation turn. When the speaker changes, the second vector is employed. This is performed alternatively and can be extended to conversations with more speakers.

Second, empirical results in DBLP:journals/corr/abs-1802-02614 show that segmentation tokens play an important role for multi-turn response selection. To model conversation, it is natural to extend that to further model turns and utterances. In this work we propose and empirically show that using an [EOU] token at the end of an utterance and an [EOT] token at the end of a turn model interactions between utterances in a context implicitly and improve the performance consistently.

4.2 Speaker-Aware Disentanglement Strategy

When a group of people communicate in a common channel there are often multiple conversation topics occurring concurrently. In terms of a specific conversation topic, utterances relevant to it are useful and other utterances could be considered as noise for them. Note that BERT is not good at dealing with sequences which are composed of more tokens than the limit (i.e., maximum length of time steps is set to be 512). In order to select a small number of most important utterances, in this paper, we propose a heuristic speaker-aware disentanglement strategy as follows.

First, we define the speaker who is uttering an utterance as the spoken-from speaker, and define the speaker who is receiving an utterance as the spoken-to speaker. Each utterance usually has the labels of both spoken-from and spoken-to speakers. But some utterances may have only the spoken-from speaker label while the spoken-to speaker is unknown which is set to None in our experiments. Second, given the spoken-from speaker of the response, we select the utterances which have the same spoken-from or spoken-to speaker as the spoken-from speaker of the response. Third, these selected utterances are then organized in their original chronological order and used to form the filtered context. Finally, the utterances selected according to their spoken-from or spoken-to speaker labels are assigned with the two speaker embedding vectors respectively.

4.3 Multi-Task Learning for Domain Adaptation

The original BERT is trained on a large text corpus to learn general language representations. To incorporate specific in-domain knowledge, adaptation on in-domain corpora are designed. In our experiments, we employ the training set of each dataset for domain adaptation without additional external knowledge. Furthermore, domain adaptation is done by performing the multi-task learning that optimizing a combination of two loss functions: (i) a next sentence prediction (NSP) loss, and (ii) a masked language model (MLM) loss [Devlin et al.2019].

Mlm

We follow the experimental settings in the original BERT by masking some percentage of the input tokens at random and then predicting only those masked tokens to train a deep bidirectional representation. In more detail, we replace the word with the [MASK] token at 80% of the time, with a random word at 10% of the time, and with the original word at 10% of the time.

Nsp

Here, the sentence A and sentence B are constructed with the same method as that used in the fine-tuning process. The positive responses are true responses that follow the context, and the negative responses are randomly sampled. The embedding of the [CLS] token is used as the aggregated representation for classification. Specifically, the speaker embeddings can be pre-trained in the task of NSP. If there is no any adaptation processes, the speaker embeddings have to be initialized randomly at the beginning of the fine-tuning process.

4.4 Output Representation

The first token of each concatenated sequence is the [CLS]

token, with its embedding being used as the aggregated representation for a context-response pair classification. This embedding captures the matching information between a context-response pair, which is sent into a classifier with a sigmoid output layer. Parameters of this classifier need to be estimated during the fine-tuning process. Finally, the classifier returns a score to denote the matching degree of this context-response pair.

5 Experiments

5.1 Datasets

Ubuntu V1 Ubuntu V2 Douban E-commerce DSTC 8
pairs pos.: neg. pairs pos.: neg. pairs pos.: neg. pairs pos.: neg. pairs pos.: neg.
Train 1M 1: 1 1M 1: 1 1M 1: 1 1M 1: 1 11M 1: 99 or 0: 100
Valid 0.5M 1: 9 195k 1: 9 50k 1: 1 10k 1: 1 1M 1: 99 or 0: 100
Test 0.5M 1: 9 189k 1: 9 10k 1.2: 8.8 10k 1: 9 1M 1: 99 or 0: 100
Table 2: Statistics of the datasets that our model is tested on.

We tested SA-BERT on five public multi-turn response selection datasets, Ubuntu Dialogue Corpus V1 [Lowe et al.2015], Ubuntu Dialogue Corpus V2 [Lowe et al.2017], Douban Conversation Corpus [Wu et al.2017], E-commerce Dialogue Corpus [Zhang et al.2018b] and DSTC 8-Track 2-Subtask 2 Corpus [Seokhwan Kim2019]. The first four datasets have been disentangled in advance and our proposed speaker-aware disentanglement strategy has been applied to only the last DSTC 8-Track 2-Subtask 2 Corpus. Ubuntu Dialogue Corpus V1, V2 and DSTC 8-Track 2-Subtask 2 Corpus contain multi-turn dialogues about Ubuntu system troubleshooting in English. Here, we adopted the version of Ubuntu Dialogue Corpus V1 shared in DBLP:journals/corr/XuLWSW16, in which numbers, paths and URLs were replaced by placeholders. Compared with Ubuntu Dialogue Corpus V1, the training, validation and test dialogues in the V2 dataset were generated in different periods without overlap. In the DSTC 8-Track 2-Subtask 2 Corpus, the candidate pool may not contain the correct response, so we need to choose a threshold. When the probability of positive labels was smaller than the threshold, we predicted that candidate pool did not contain the correct response. The threshold was selected among [0.6, 0.65, .., 0.95] based on the validation set. In all of the Ubuntu corpora, the positive responses are true responses from humans, and the negative responses are randomly sampled. The Douban Conversation Corpus was crawled from a Chinese social network on open-domain topics. It was constructed in a similar way to the Ubuntu corpus. The Douban Conversation Corpus collected responses via a small inverted-index system, and labels were manually annotated. The Douban Conversation Corpus is different from the other three datasets in that it includes multiple correct candidates for a context in the test set, which leads to low , e.g., if there are 3 correct responses, the maximum is 0.33. Hence, MAP and MRR are recommended for reference. The E-commerce Dialogue Corpus collected real-world conversations between customers and customer service staff from the largest e-commerce platform in China. The DSTC 8-Track 2-Subtask 2 Corpus does not release the labels of the test set. Participants should submit their results on the test set to the official and then be evaluated by them. Thus, we submitted only one result to the official and we provide other results on the validation set for reference. Some statistics of these datasets are provided in Table 2. 111There are multiple correct candidates for a context in the test set of the Douban Conversation Corpus. Here we list an average ratio over the whole set.

5.2 Evaluation Metrics

We used the same evaluation metrics as those used in previous work

[Lowe et al.2015, Lowe et al.2017, Wu et al.2017, Zhang et al.2018b, Seokhwan Kim2019]. Each model was tasked with selecting the best-matched responses from available candidates for the given conversation context , and we calculated the recall of the true positive replies among the selected responses, denoted as , as the main evaluation metric. In addition to , we considered the mean average precision (MAP) [Baeza-Yates and Ribeiro-Neto1999], mean reciprocal rank (MRR) [Voorhees1999] and precision-at-one (), especially for the Douban corpus, following the settings of previous work.

5.3 Training Details

In our experiments, the base version of BERT was adopted. Most hyper-parameters of the original BERT were followed except the following configurations. The initial learning rate was set to 2e-5 and was linearly decayed by L2 weight decay. The maximum sequence length of the concatenation of a context-response pair was set to 512. The training batch size was set to 25. The maximum number of training epochs was set to 3. We used the validation set to set the stop condition in order to select the best model for testing. All codes were implemented in the TensorFlow framework

[Abadi et al.2016] and will be published to help replicate our results after paper acceptance.222Code will be released at https://github.com/JasonForJoy

5.4 Experimental Results

Ubuntu Corpus V1
TF-IDF [Lowe et al.2015] 0.659 0.410 0.545 0.708
RNN [Lowe et al.2015] 0.768 0.403 0.547 0.819
LSTM [Lowe et al.2015] 0.878 0.604 0.745 0.926
DL2R [Yan et al.2016] 0.899 0.626 0.783 0.944
Match-LSTM [Wang and Jiang2016b] 0.904 0.653 0.799 0.944
MV-LSTM [Wan et al.2016] 0.906 0.653 0.804 0.946
Multi-View [Zhou et al.2016] 0.908 0.662 0.801 0.951
CompAgg [Wang and Jiang2016a] 0.884 0.631 0.753 0.927
BiMPM [Wang et al.2017] 0.897 0.665 0.786 0.938
HRDE-LTC [Yoon et al.2018] 0.916 0.684 0.822 0.960
SMN [Wu et al.2017] 0.926 0.726 0.847 0.961
DUA [Zhang et al.2018b] - 0.752 0.868 0.962
DAM [Zhou et al.2018b] 0.938 0.767 0.874 0.969
MRFN [Tao et al.2019a] 0.945 0.786 0.886 0.976
IMN [Gu et al.2019] 0.946 0.794 0.889 0.974
IOI [Tao et al.2019b] 0.947 0.796 0.894 0.974
MSN [Yuan et al.2019] - 0.800 0.899 0.978
BERT 0.950 0.808 0.897 0.975
SA-BERT 0.965 0.855 0.928 0.983
Table 3: Evaluation results of SA-BERT and previous methods on Ubuntu Dialogue Corpus V1.
Ubuntu Corpus V2
TF-IDF [Lowe et al.2017] 0.749 0.488 0.587 0.763
RNN [Lowe et al.2017] 0.777 0.379 0.561 0.836
LSTM [Lowe et al.2017] 0.869 0.552 0.721 0.924
RNN-CNN [Baudis and Sedivý2016] 0.911 0.672 0.809 0.956
CompAgg [Wang and Jiang2016a] 0.895 0.641 0.776 0.937
BiMPM [Wang et al.2017] 0.877 0.611 0.747 0.921
HRDE-LTC [Yoon et al.2018] 0.915 0.652 0.815 0.966
U2U-IMN [Gu et al.2020] 0.943 0.762 0.877 0.975
IMN [Gu et al.2019] 0.945 0.771 0.886 0.979
BERT 0.950 0.781 0.890 0.980
SA-BERT 0.963 0.830 0.919 0.985
Table 4: Evaluation results of SA-BERT and previous methods on Ubuntu Dialogue Corpus V2.
Douban Conversation Corpus
MAP MRR
TF-IDF [Lowe et al.2015] 0.331 0.359 0.180 0.096 0.172 0.405
RNN [Lowe et al.2015] 0.390 0.422 0.208 0.118 0.223 0.589
LSTM [Lowe et al.2015] 0.485 0.527 0.320 0.187 0.343 0.720
Multi-View [Zhou et al.2016] 0.505 0.543 0.342 0.202 0.350 0.729
DL2R [Yan et al.2016] 0.488 0.527 0.330 0.193 0.342 0.705
MV-LSTM [Wan et al.2016] 0.498 0.538 0.348 0.202 0.351 0.710
Match-LSTM [Wang and Jiang2016b] 0.500 0.537 0.345 0.202 0.348 0.720
SMN [Wu et al.2017] 0.529 0.569 0.397 0.233 0.396 0.724
DUA [Zhang et al.2018b] 0.551 0.599 0.421 0.243 0.421 0.780
DAM [Zhou et al.2018b] 0.550 0.601 0.427 0.254 0.410 0.757
MRFN [Tao et al.2019a] 0.571 0.617 0.448 0.276 0.435 0.783
IMN [Gu et al.2019] 0.570 0.615 0.433 0.262 0.452 0.789
IOI [Tao et al.2019b] 0.573 0.621 0.444 0.269 0.451 0.786
MSN [Yuan et al.2019] 0.587 0.632 0.470 0.295 0.452 0.788
BERT 0.591 0.633 0.454 0.280 0.470 0.828
SA-BERT 0.619 0.659 0.496 0.313 0.481 0.847
Table 5: Evaluation results of SA-BERT and previous methods on the Douban Corpus.
E-commerce Corpus
TF-IDF [Lowe et al.2015] 0.159 0.256 0.477
RNN [Lowe et al.2015] 0.325 0.463 0.775
LSTM [Lowe et al.2015] 0.365 0.536 0.828
Multi-View [Zhou et al.2016] 0.421 0.601 0.861
DL2R [Yan et al.2016] 0.399 0.571 0.842
MV-LSTM [Wan et al.2016] 0.412 0.591 0.857
Match-LSTM [Wang and Jiang2016b] 0.410 0.590 0.858
SMN [Wu et al.2017] 0.453 0.654 0.886
DUA [Zhang et al.2018b] 0.501 0.700 0.921
DAM [Zhou et al.2018b] 0.526 0.727 0.933
IOI [Tao et al.2019b] 0.563 0.768 0.950
MSN [Yuan et al.2019] 0.606 0.770 0.937
IMN [Gu et al.2019] 0.621 0.797 0.964
BERT 0.610 0.814 0.973
SA-BERT 0.704 0.879 0.985
Table 6: Evaluation results of SA-BERT and previous methods on the E-commerce Corpus.
DSTC 8-Track 2-Subtask 2
Set Model MRR
Valid IMN [Gu et al.2019] 0.443 0.322
IMN [Gu et al.2019] + SDS 0.504 0.375
BERT 0.335 0.258
BERT + SDS 0.560 0.440
SA-BERT - SDS 0.344 0.265
SA-BERT 0.594 0.477
SA-BERT (Ensemble) 0.611 0.496
Test SA-BERT (Ensemble) 0.621 0.506
Table 7: Evaluation results of SA-BERT and ablation tests of the speaker-aware disentanglement strategy (SDS) on the DSTC 8-Track 2-Subtask 2 Corpus.

Table 3, Table 4, Table 5, Table 6 and Table 7 present the evaluation results of SA-BERT and previous methods on the five datasets respectively. All the results except ours are from the existing literature. Due to previous methods did not make use of pre-trained language models, we reproduced the results of BERT baseline by fine-tuning on the training set for reference, denoted as BERT for fair comparisons. As we can see that, BERT has already outperformed the present models on most metrics, except on Ubuntu Dialogue Corpus V1 and on E-commerce Corpus. Furthermore, our proposed SA-BERT outperforms the other models on all metrics and datasets, which demonstrates its ability to select the best-matched response and its compatibility across domains (system troubleshooting, social network and e-commerce). These results show that our proposed SA-BERT has achieved a new state-of-the-art performance for multi-turn response selection.

In more detail, SA-BERT outperformed the present state-of-the-art performance by large margins of 5.5% on Ubuntu Dialogue Corpus V1, 5.9% on Ubuntu Dialogue Corpus V2, 3.2% MAP and 2.7% MRR on Douban Conversation Corpus, 8.3% on E-commerce Corpus, and 15.5% on DSTC 8-Track 2-Subtask 2 Corpus. Compared with BERT, SA-BERT outperformed it by large margins of 4.7% on Ubuntu Dialogue Corpus V1, 4.9% on Ubuntu Dialogue Corpus V2, 2.8% MAP and 2.6% MRR on Douban Conversation Corpus, 9.4% on E-commerce Corpus, and 21.9% on DSTC 8-Track 2-Subtask 2 Corpus. These results show that our proposed SA-BERT has achieved a new state-of-the-art performance on all datasets for multi-turn response selection.

6 Analysis

6.1 Adaptation Corpus

We make some further analysis on the effect of adaptation corpus to the performance of multi-turn response selection. We performed the adaptation process with the same domain but different sets. Here, three different sets of Ubuntu were employed: DSTC 8-Track 2, Ubuntu Dialogue Corpus V1, and Ubuntu Dialogue Corpus V2. And then the fine-tuning process was all performed on the training set of Ubuntu Dialogue Corpus V2. The results on the test set of Ubuntu Dialogue Corpus V2 were shown in Table 8.

As we can see that, the adaptation process can help to improve the performance no matter which adaptation corpus was used. Furthermore, adaptation and fine-tuning on the same corpus achieved the best performance. One explanation may be that although pre-trained language models are designed to provide general linguistic knowledge, some domain-specific knowledge is also necessary for a specific task. Thus, adaptation on a domain-specific corpus can help to incorporate more domain-specific knowledge, and the more similar to the task this adaptation corpus is, the more improvement it can help to achieve.

6.2 Speaker Embeddings

Adaptation corpus
None 0.950 0.786 0.890 0.981
DSTC8 0.954 0.803 0.902 0.981
Ubuntu V1 0.961 0.824 0.914 0.985
Ubuntu V2 0.963 0.830 0.919 0.985
Table 8: Results on the test set of Ubuntu Dialogue Corpus V2, by domain adaptation with different corpora and fine-tuning all on the training set of Ubuntu Dialogue Corpus V2.
Pre-train Speaker embeddings
No No 0.950 0.781 0.890 0.980
No Yes 0.950 0.786 0.890 0.981
Yes No 0.961 0.825 0.915 0.984
Yes Yes 0.963 0.830 0.919 0.985
Table 9: Results on the test set of Ubuntu Dialogue Corpus V2, by ablating the speaker embeddings.

The speaker embeddings were ablated and the results were reported in Table 9. The first two lines discussed the situation in which the adaptation process were omitted, and the last two lines discussed the adaptation process were equipped with. The performance drop verified the effectiveness of speaker embeddings. First, without the pre-training process, the speaker embeddings were initialized at random which would be updated during the fine-tuning process. It can be seen that adding the speaker embeddings only during the fine-tuning process can provide an improvement of 0.5% in terms of , which shows its effectiveness for modelling the speaker change during the conversation. Furthermore, we could observe the similar results with the pre-training process included, which verified the effectiveness of our method again.

6.3 Speaker-Aware Disentanglement Strategy

To show the effectiveness of the speaker-aware disentanglement strategy, we also applied it to the existing model, such as IMN [Gu et al.2019]. The original IMN did not employ any disentanglement strategy and selected the last 70 utterances as the context, which achieved a performance of 32.2% . After employing the strategy, about 25 utterances were selected to form the context, which achieved a performance of 37.5% . Similar results can also be observed by employing this strategy to BERT and ablating this strategy in SA-BERT, as shown in Table 7, which verified the effectiveness of the speaker-aware disentanglement strategy again.

7 Conclusion

In this paper, we study the problem of employing pre-trained language models for multi-turn response selection in retrieval-based chatbots. A speaker-aware BERT model is proposed to improve BERT by adding speaker embeddings, introducing a speaker-aware disentanglement strategy and adapting to the specific domain. Experiments on five public datasets show that our proposed method achieves new state-of-the-art performances for multi-turn response selection. Adjusting pre-trained language models to fit multi-turn response selection and designing new disentanglement strategies will be a part of our future work.

References

  • [Abadi et al.2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016.

    Tensorflow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016., pages 265–283.
  • [Baeza-Yates and Ribeiro-Neto1999] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press / Addison-Wesley.
  • [Baudis and Sedivý2016] Petr Baudis and Jan Sedivý. 2016. Sentence pair scoring: Towards unified framework for text comprehension. CoRR, abs/1603.06127.
  • [Devlin et al.2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186.
  • [Dong and Huang2018] Jianxiong Dong and Jim Huang. 2018. Enhance word representation for out-of-vocabulary on ubuntu dialogue corpus. CoRR, abs/1802.02614.
  • [Gu et al.2019] Jia-Chen Gu, Zhen-Hua Ling, and Quan Liu. 2019. Interactive matching network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pages 2321–2324. ACM.
  • [Gu et al.2020] Jia-Chen Gu, Zhen-Hua Ling, and Quan Liu. 2020. Utterance-to-utterance interactive matching network for multi-turn response selection in retrieval-based chatbots. IEEE ACM Trans. Audio Speech Lang. Process., 28:369–379.
  • [Henderson et al.2019] Matthew Henderson, Ivan Vulic, Daniela Gerz, Iñigo Casanueva, Pawel Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrksic, and Pei-Hao Su. 2019. Training neural response selection for task-oriented dialogue systems. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 5392–5404.
  • [Ji et al.2014] Zongcheng Ji, Zhengdong Lu, and Hang Li. 2014. An information retrieval approach to short text conversation. CoRR, abs/1408.6988.
  • [Kadlec et al.2015] Rudolf Kadlec, Martin Schmid, and Jan Kleindienst. 2015. Improved deep learning baselines for ubuntu corpus dialogs. CoRR, abs/1510.03753.
  • [Li et al.2016] Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and William B. Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • [Li et al.2017] Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017

    , pages 2157–2169.
  • [Lowe et al.2015] Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the SIGDIAL 2015 Conference, The 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2-4 September 2015, Prague, Czech Republic, pages 285–294.
  • [Lowe et al.2017] Ryan Thomas Lowe, Nissan Pow, Iulian Vlad Serban, Laurent Charlin, Chia-Wei Liu, and Joelle Pineau. 2017. Training end-to-end dialogue systems with the ubuntu dialogue corpus. D&D, 8(1):31–65.
  • [Peters et al.2018] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237.
  • [Seokhwan Kim2019] Chulaka Gunasekara Sungjin Lee Adam Atkinson Baolin Peng Hannes Schulz Jianfeng Gao Jinchao Li Mahmoud Adada Minlie Huang Luis Lastras Jonathan K. Kummerfeld Walter S. Lasecki Chiori Hori Anoop Cherian Tim K. Marks Abhinav Rastogi Xiaoxue Zang Srinivas Sunkara Raghav Gupta Seokhwan Kim, Michel Galley. 2019. The eighth dialog system technology challenge. arXiv preprint.
  • [Serban et al.2016] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

    , pages 3776–3784.
  • [Shang et al.2015] Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural responding machine for short-text conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1577–1586.
  • [Tao et al.2019a] Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. 2019a. Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, pages 267–275.
  • [Tao et al.2019b] Chongyang Tao, Wei Wu, Can Xu, Wenpeng Hu, Dongyan Zhao, and Rui Yan. 2019b. One time of interaction may not be enough: Go deep with an interaction-over-interaction network for response selection in dialogues. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1–11.
  • [Voorhees1999] Ellen M. Voorhees. 1999. The TREC-8 question answering track report. In Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17-19, 1999.
  • [Wan et al.2016] Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016. Match-srnn: Modeling the recursive matching structure with spatial RNN. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2922–2928.
  • [Wang and Jiang2016a] Shuohang Wang and Jing Jiang. 2016a. A compare-aggregate model for matching text sequences. CoRR, abs/1611.01747.
  • [Wang and Jiang2016b] Shuohang Wang and Jing Jiang. 2016b. Learning natural language inference with LSTM. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1442–1451.
  • [Wang et al.2013] Hao Wang, Zhengdong Lu, Hang Li, and Enhong Chen. 2013. A dataset for research on short-text conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 935–945.
  • [Wang et al.2017] Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 4144–4150.
  • [Wu et al.2017] Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 496–505.
  • [Xu et al.2016] Zhen Xu, Bingquan Liu, Baoxun Wang, Chengjie Sun, and Xiaolong Wang. 2016. Incorporating loose-structured knowledge into LSTM with recall gate for conversation modeling. CoRR, abs/1605.05110.
  • [Yan et al.2016] Rui Yan, Yiping Song, and Hua Wu. 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, July 17-21, 2016, pages 55–64.
  • [Yang et al.2019] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237.
  • [Yoon et al.2018] Seunghyun Yoon, Joongbo Shin, and Kyomin Jung. 2018. Learning to rank question-answer pairs using hierarchical recurrent encoder with latent topic clustering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1575–1584.
  • [Yuan et al.2019] Chunyuan Yuan, Wei Zhou, Mingming Li, Shangwen Lv, Fuqing Zhu, Jizhong Han, and Songlin Hu. 2019. Multi-hop selector network for multi-turn response selection in retrieval-based chatbots. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 111–120. Association for Computational Linguistics.
  • [Zhang et al.2018a] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2204–2213.
  • [Zhang et al.2018b] Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, Hai Zhao, and Gongshen Liu. 2018b. Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 3740–3752.
  • [Zhou et al.2016] Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016. Multi-view response selection for human-computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 372–381.
  • [Zhou et al.2018a] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. 2018a. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 730–739.
  • [Zhou et al.2018b] Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu. 2018b. Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1118–1127.