Log In Sign Up

Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection

by   Taesun Whang, et al.

In this paper, we study the task of selecting optimal response given user and system utterance history in retrieval-based multi-turn dialog systems. Recently, pre-trained language models (e.g., BERT, RoBERTa, and ELECTRA) have shown significant improvements in various natural language processing tasks. This and similar response selection tasks can also be solved using such language models by formulating them as dialog-response binary classification tasks. Although existing works using this approach successfully obtained state-of-the-art results, we observe that language models trained in this manner tend to make predictions based on the relatedness of history and candidates, ignoring the sequential nature of multi-turn dialog systems. This suggests that the response selection task alone is insufficient in learning temporal dependencies between utterances. To this end, we propose utterance manipulation strategies (UMS) to address this problem. Specifically, UMS consist of several strategies (i.e., insertion, deletion, and search), which aid the response selection model towards maintaining dialog coherence. Further, UMS are self-supervised methods that do not require additional annotation and thus can be easily incorporated into existing approaches. Extensive evaluation across multiple languages and models shows that UMS are highly effective in teaching dialog consistency, which lead to models pushing the state-of-the-art with significant margins on multiple public benchmark datasets.


Sequential Neural Networks for Noetic End-to-End Response Selection

The noetic end-to-end response selection challenge as one track in the 7...

Sequential Attention-based Network for Noetic End-to-End Response Selection

The noetic end-to-end response selection challenge as one track in Dialo...

Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots

In this paper, we study the problem of employing pre-trained language mo...

Measuring the `I don't know' Problem through the Lens of Gricean Quantity

We consider the intrinsic evaluation of neural generative dialog models ...

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Stickers with vivid and engaging expressions are becoming increasingly p...

Does Dialog Length matter for Next Response Selection task? An Empirical Study

In the last few years, the release of BERT, a multilingual transformer b...

Improving Multi-Turn Response Selection Models with Complementary Last-Utterance Selection by Instance Weighting

Open-domain retrieval-based dialogue systems require a considerable amou...

1 Introduction

Figure 1: An example of multi-turn response selection. BERT-based model tends to calculate the matching score of a dialog-response pair depending on its semantic relatedness ((a) (b)). More details are described in Section 5.2.

In recent years, building intelligent conversational agents has gained increased attention in the field of natural language processing (NLP). Among widely used dialog systems, retrieval-based dialog systems

Lowe et al. (2015); Wu et al. (2017); Zhang et al. (2018) are implemented in a variety of industries since they provide accurate, informative, and promising responses. In this study, we focus on multi-turn response selection in retrieval-based dialog systems. This is a task of predicting the most likely response under given dialog history from a set of candidates.

Existing works Wu et al. (2017); Zhou et al. (2018); Tao et al. (2019a); Yuan et al. (2019) have studied utterance-response matching based on attention mechanisms including self-attention Vaswani et al. (2017). Most recently, as pre-trained language models (e.g., BERT Devlin et al. (2019), RoBERTa Liu et al. (2019), and ELECTRA Clark et al. (2020)) have achieved substantial improvements in performance in diverse NLP tasks, multi-turn response selection also has been resolved by using such language models Whang et al. (2019); Lu et al. (2020); Gu et al. (2020); Humeau et al. (2020).

However, we tackle three crucial problems in applying language models to response selection. 1) Domain adaptation based on an additional training on target corpus is extremely not only time-consuming but also costly in computation. 2) Formulating response selection as a dialog-response binary classification task is insufficient in representing intra- and inter-utterance interaction, since dialog context is formed by concatenating all utterances. 3) The models tend to select the optimal response depending on how semantically similar it is to a given dialog. As shown in Figure 1

, we experiment to verify that BERT-based response selection model is trained properly to select the next utterance rather than dialog-related response. The result shows that the model tends to give higher probability score to the response which is more semantically related to the dialog context rather than consistent response. Although it is obvious that the ground truth is suitable for being the next utterance, the model highly depends on its semantic meaning.

To address these issues, this paper proposes Utterance Manipulation Strategies (UMS) for multi-turn response selection. Specifically, UMS consist of three powerful strategies (i.e., insertion, deletion, and search), which effectively help the response selection model to learn temporal dependencies between utterances and maintain dialog coherence. In addition, these strategies are fully self-supervised methods that do not require additional annotation and can be easily adapted to existing studies. We briefly summarize the main contributions of this paper: 1) We show that existing response selection models are more likely to predict a semantically relevant response with its dialog rather than the next utterance. 2) We propose simple but novel utterance manipulation strategies, which are highly effective in predicting the next utterance. Our model has strengths in effectively performing in-domain classification. 3) Experimental results on three benchmarks (i.e., Ubuntu, Douban, and E-commerce) show that our proposed model outperforms state-of-the-art methods. We also obtain significant improvements in performance on a new Korean open-domain corpus compared to the baselines.

2 Related Work

Early approaches to response selection have focused on single-turn response selection Wang et al. (2013); Hu et al. (2014); Wang et al. (2015). Recently, multi-turn response selection has obtained more attention by researchers. Lowe et al. (2015) proposed dual encoder architecture which uses an RNN-based models to match the dialog and response. Zhou et al. (2016) proposed the multi-view model that encodes dialog context and response both on word-level and utterance-level. However, these models have limitations to fully reflect the relationship between the dialog and response. To alleviate this, Wu et al. (2017) proposed the sequential matching network which utilizes matching metrics to match each utterance with response. As self-attention Vaswani et al. (2017) mechanism has been proved its effectiveness, it is applied in subsequent works Zhou et al. (2018); Tao et al. (2019a, b). Yuan et al. (2019) recently pointed out that previous approaches construct dialog representation with abundant information but noisy, which would deteriorate the performance. They proposed an effective history filtering technique to avoid using excessive history information.

Most recently, many researches based on pre-trained language models including BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) are proposed. Generally, most models formulate the response selection task as a dialog-response binary classification task. Whang et al. (2019) first applied BERT for multi-turn response selection and obtained state-of-the-art results through further training BERT on domain-specific corpus. Subsequent researches Lu et al. (2020); Gu et al. (2020) focused on modeling speaker information and showed its effectiveness in response retrieval. Humeau et al. (2020) investigated the trade-off relationship between model complexity and computation efficiency in the language models. They proposed poly-encoders that ensure fast inference speed, even though the performance is slightly lower than that of the cross-encoder.

Figure 2: An overview of Utterance Manipulation Strategies. Input sequence for each manipulation strategy is dynamically constructed by extracting consecutive utterances from the original dialog context during the training period. Also, target utterance is randomly chosen from either the dialog context (Insertion, Search) or the random dialog (Deletion).

3 Proposed Method

3.1 Language Models for Response Selection

Pre-trained Language Models

Recently, pre-trained language models, such as BERT Devlin et al. (2019) and ELECTRA Clark et al. (2020), were successfully adapted to a wide range of NLP tasks, including multi-turn response selection, achieving state-of-the-art results. In this work, we build upon this success and evaluate our method by incorporating it into BERT and ELECTRA.

Domain-specific Post-training

Since contextual language models are pre-trained on general corpora, such as the Toronto Books Corpus and Wikipedia, it is less effective to directly fine-tune these models on downstream tasks if there is a domain shift. Hence, it is a common practice to further train such models with the language modeling objective using texts from the target domain to reduce the negative impact. This has shown to be effective in various tasks including review reading comprehension Xu et al. (2019) and SuperGLUE Wang et al. (2019). Existing works on multi-turn response selection Whang et al. (2019); Gu et al. (2020); Humeau et al. (2020) also adapted this post-training approach and obtained state-of-the-art results. We also employ this post-training method in this work and show its effectiveness in improving performance (Section 5.1).

Training Response Selection Models

Following several researches based on contextual language models for multi-turn response selection Whang et al. (2019); Lu et al. (2020); Gu et al. (2020), a pointwise approach is used to learn a cross-encoder that receives both dialog context and response simultaneously. Suppose that a dialog agent is given a dialog dataset . Each triplet consists of 1) a sequence of utterances representing the historical context, where is a single utterance, 2) a response , and 3) a label . Each utterance and response are composed of multiple tokens including a special “End Of Turn” token [eot]  at the end of each utterance, following the work of Whang et al. (2019). In general, input sequence,

is fed into pre-trained language models (i.e., BERT, ELECTRA), then output representation of [cls]  token,

, is used to classify whether dialog-response pair is consistent. A relevance score of the dialog utterances and response is formulated as,


where and is trainable parameters. We use binary cross-entropy loss to optimize the models.

3.2 Utterance Manipulation Strategies

Figure 2 describes the overview of our proposed method, utterance manipulation strategies. We propose a multi-task learning framework, which consists of three highly effective auxiliary tasks for multi-turn response selection, utterance 1) insertion, 2) deletion, and 3) search. These tasks are jointly trained with the response selection model during the fine-tuning period. To train the auxiliary tasks, we add new special tokens, [ins], [del], and [srch]  for the utterance insertion, deletion, and search tasks, respectively. We cover how we train the model with these special tokens in the following sections.

Utterance Insertion

Despite the huge success of BERT, it has limitations to understand discourse-level semantic structure since NSP, one of BERT’s objectives, only performs to distinguish whether the given sentence pairs are irrelevant. In multi-turn response selection, the model needs the ability not only to distinguish the utterances with different semantic meanings but also to discriminate whether the utterances are consecutive even if they are semantically related. We propose utterance insertion to resolve the issues above.

We first extract consecutive utterances from the original dialog context, then randomly select one of the utterances to be inserted. To train the model to find where the selected utterance should be inserted, [ins]  tokens are positioned before and after each utterance. [ins]  tokens are represented as possible position of the target utterance. Input sequence for utterance insertion is denoted as,

where is the target utterance and [ins] is the target insertion token.

Utterance Deletion

Recent BERT-based models for multi-turn response selection regard the task as a dialog-response binary classification. Even though they are extended in a multi-turn manner by using separating token (e.g., [sep], [eot] ), it lacks utterance-level interaction between dialog context and response. To alleviate this, we propose a novel auxiliary task, utterance deletion, for enriching utterance-level interaction in multi-turn conversation.
Same as the utterance insertion, consecutive utterances are extracted from the original dialog context, then an utterance from a random dialog is inserted among the extracted utterances. In other words, utterances are composed of utterances from the original conversation and one from the different dialog. To train the model to find unrelated utterance, [del]  tokens are positioned before each utterance. The objective of the utterance deletion task is to predict which utterance causes inconsistency. We denote the input sequence for utterance deletion as,

where is the utterance from the random dialog and is the target deletion token.

Utterance Search

Whereas two previous auxiliary tasks are performed in properly ordered dialog, we design a novel task, utterance search, which aims to find an appropriate utterance from the randomly shuffled utterances. The objective of this task is to learn temporal dependencies between semantically similar utterances.
Given consecutive utterances same as the previous tasks, we shuffle utterances except the last utterance and insert [srch]  tokens before each shuffled utterance. Utterance search aims to find the previous utterance of the last utterance from the jumbled utterances. Input sequence for utterance search is denoted as,

where is a set of utterances which are randomly shuffled except the last utterance . The previous utterance of is denoted as (i.e., ) and [srch] is the target search token.


Dataset Ubuntu Douban E-Commerce Kakao
Train Val Test Train Val Test Train Val Test Train Val Test (Web) Test (Clean)
# pairs 1M 500K 500K 1M 50K 6670 1M 10K 10K 1M 50K 5139 7164
pos:neg 1:1 1:9 1:9 1:1 1:1 1.2:8.8 1:1 1:1 1:9 1:1 1:1 1.6:7.4 2:7
# avg turns 10.13 10.11 10.11 6.69 6.75 6.45 5.51 5.48 5.64 3.00 3.00 3.49 3.25
Table 1: Corpus statistics of multi-turn response selection datasets.

3.3 Multi-Task Learning Setup

The input sequence of each task is fed into the language models. The output representations of special tokens (i.e., [ins], [del], and [srch] ) are used to classify whether each token is in a correct position to be inserted, deleted, and searched. Target tokens for each task (i.e., [ins], [del], and [srch]) are labeled as 1, otherwise 0. We calculate the probability of the token being a target, denoted as follows.


where task{ins, del, srch} and is the output representation of each special token. We use binary cross-entropy loss for all auxiliary tasks to optimize each model. The final loss is determined by summing up response selection loss and UMS losses with the same ratio111We obtained the best results by summing up all the losses equally..

4 Experimental Setup

4.1 Datasets

We evaluate our model on three widely used response selection benchmarks, Ubuntu Corpus V1 Lowe et al. (2015), Douban Corpus Wu et al. (2017), and E-Commerce Corpus Zhang et al. (2018). Also, a new open-domain dialog corpus, Kakao Corpus, is utilized to evaluate our model. All datasets consist of dyadic multi-turn conversations and their statistics are summarized in Table 1.

Ubuntu Corpus V1

Ubuntu dataset is a large multi-turn conversation corpus, which is constructed from Ubuntu internet relay chat. It mainly consists of conversations of two participants who discuss how to troubleshoot Ubuntu operating system. We utilize the data released by Xu et al. (2017), where numbers, urls, paths are replaced with special placeholders following the previous works Wu et al. (2017); Zhou et al. (2018).

Douban Corpus

Douban dataset is a Chinese open-domain dialog corpus, while the Ubuntu Corpus is a domain specific dataset. It is constructed by web-crawling from Douban group222, which is a popular social networking service (SNS) in China.

E-commerce Corpus

E-Commerce dataset is another Chinese multi-turn conversation corpus. It is collected from real-world customer consultation dialogs from Taobao333, which is the largest Chinese e-commerce platform. It consists of several types of conversations (e.g., commodity consultation, recommendation, negotiation) based on various commodities.

Kakao Corpus

Kakao dataset is a large Korean open-domain dialog corpus, which is constructed by Kakao corporation444 It is mainly web-crawled from Korean SNS such as Korean Twitter and Reddit. In a similar manner that Ubuntu dataset was constructed, we take the last utterance of the dialog as a positive response and the rest as dialog context. Negative responses are randomly sampled from the other conversations. We split the test set into two sets; 1) web is same as the training set. 2) clean consists of grammatically correct conversations which are constructed by human annotators and inspected by the NLP experts.

4.2 Evaluation Metrics

We evaluated our model using several retrieval metrics, following the previous researches Lowe et al. (2015); Wu et al. (2017); Zhou et al. (2018); Yuan et al. (2019). First, we employ 1 in recall at , denoted as (), which gets 1 when a ground truth is positioned in selected list and 0 otherwise. Also, three other metrics, mean average precision (MAP), mean reciprocal rank (MRR), and precision at one (P@1), are used especially for Douban and Kakao since these two datasets may contain more than one positive response among candidates.

Models Ubuntu Douban E-commerce
CNN Kadlec et al. (2015) 0.549 0.684 0.896 0.417 0.440 0.226 0.121 0.252 0.647 0.328 0.515 0.792
LSTM Kadlec et al. (2015) 0.638 0.784 0.949 0.485 0.537 0.320 0.187 0.343 0.720 0.365 0.536 0.828
BiLSTM Kadlec et al. (2015) 0.630 0.780 0.944 0.479 0.514 0.313 0.184 0.330 0.716 0.365 0.536 0.825
MV-LSTM Wan et al. (2016) 0.653 0.804 0.946 0.498 0.538 0.348 0.202 0.351 0.710 0.412 0.591 0.857
Match-LSTMWang and Jiang (2016) 0.653 0.799 0.944 0.500 0.537 0.345 0.202 0.348 0.720 0.410 0.590 0.858
Multi-View Zhou et al. (2016) 0.662 0.801 0.951 0.505 0.543 0.342 0.202 0.350 0.729 0.421 0.601 0.861
DL2R Yan et al. (2016) 0.626 0.783 0.944 0.488 0.527 0.330 0.193 0.342 0.705 0.399 0.571 0.842
SMN Wu et al. (2017) 0.726 0.847 0.961 0.529 0.569 0.397 0.233 0.396 0.724 0.453 0.654 0.886
DUA Zhang et al. (2018) 0.752 0.868 0.962 0.551 0.599 0.421 0.243 0.421 0.780 0.501 0.700 0.921
DAM Zhou et al. (2018) 0.767 0.874 0.969 0.550 0.601 0.427 0.254 0.410 0.757 0.526 0.727 0.933
IoI Tao et al. (2019b) 0.796 0.894 0.974 0.573 0.621 0.444 0.269 0.451 0.786 0.563 0.768 0.950
MSN Yuan et al. (2019) 0.800 0.899 0.978 0.587 0.632 0.470 0.295 0.452 0.788 0.606 0.770 0.937
BERT Gu et al. (2020) 0.808 0.897 0.975 0.591 0.633 0.454 0.280 0.470 0.828 0.610 0.814 0.973
BERT-SS-DA Lu et al. (2020) 0.813 0.901 0.977 0.602 0.643 0.458 0.280 0.491 0.843 0.648 0.843 0.980
SA-BERT Gu et al. (2020) 0.855 0.928 0.983 0.619 0.659 0.496 0.313 0.481 0.847 0.704 0.879 0.985
BERT (ours) 0.820 0.906 0.978 0.597 0.634 0.448 0.279 0.489 0.823 0.641 0.824 0.973
ELECTRA 0.826 0.908 0.978 0.602 0.642 0.465 0.287 0.483 0.839 0.609 0.804 0.965
UMS 0.843 0.920 0.982 0.597 0.639 0.466 0.285 0.471 0.829 0.674 0.861 0.980
UMS 0.854 0.929 0.984 0.608 0.650 0.472 0.291 0.488 0.845 0.648 0.831 0.974
BERT+ 0.862 0.935 0.987 0.609 0.645 0.463 0.290 0.505 0.838 0.725 0.890 0.984
ELECTRA+ 0.861 0.932 0.985 0.612 0.655 0.480 0.301 0.499 0.836 0.673 0.835 0.974
UMS 0.875 0.942 0.988 0.625 0.664 0.499 0.318 0.482 0.858 0.762 0.905 0.986
UMS 0.875 0.941 0.988 0.623 0.663 0.492 0.307 0.501 0.851 0.707 0.853 0.974
Table 2: Results on Ubuntu, Douban, and E-Commerce datasets. All the evaluation results except ours are cited from published literature Tao et al. (2019b); Yuan et al. (2019); Lu et al. (2020); Gu et al. (2020). The underlined numbers mean the best performance for each block and the bold numbers mean state-of-the-art performance for each metric.

4.3 Training Details

We implement our model by using PyTorch deep learning framework

Paszke et al. (2019)

based on the open source code

555 Wolf et al. (2019). Since we experiment on three different languages (i.e., English, Chinese, Korean), initial checkpoints for BERT and ELECTRA are adapted from several works Devlin et al. (2019); Clark et al. (2020); Cui et al. (2020); Lee et al. (2020). Specifically, we employ base pre-trained models for all languages except for Chinese (whole word masking (WWM) strategy is used for Chinese BERT666 Since ELECTRA for Korean is not available, we do not conduct ELECTRA-based experiments on the Kakao Corpus. All our experiments, both post-training and fine-tuning, are run on 4 Tesla V100 GPUs. For fine-tuning, we trained the models with a batch size of 32 using adam optimizer with a initial learning rate of 3e-5. The maximum sequence length is set to 512 and for UMS is set to 5. Our code and post-trained checkpoints for all benchmarks are publicly available777

4.4 Baselines

Single-turn Matching Models

These baselines, including CNN, LSTM, BiLSTM Kadlec et al. (2015), MV-LSTM Wan et al. (2016), and Match-LSTM Wang and Jiang (2016), are based on matching between a dialog context and a response. They constructed the dialog context by concatenating utterances and regarded it as a long document.

Multi-turn Matching Models

Multi-View Zhou et al. (2016) utilize both word- and utterance-level representations; DL2R Yan et al. (2016) reformulates the last utterance with previous utterances in the dialog context; SMN Wu et al. (2017)

first constructs attention matrices based on word and sequential representations of each utterance and response, then obtains matching vectors by using CNN; DUA

Zhang et al. (2018) utilizes deep utterance aggregation to form a fine-grained context representation; DAM Zhou et al. (2018) obtains matching representations of the utterances and response using self- and cross-attention based on Transformer architecture Vaswani et al. (2017); IoI Tao et al. (2019b) lets utterance-response interaction go deep in a matching model; MSN Yuan et al. (2019) filters only relevant utterances using a multi-hop selector network.

BERT-based Models

Recently, BERT Devlin et al. (2019) is also applied to response selection, such as vanilla BERTGu et al. (2020), BERT-SS-DA Lu et al. (2020), and SA-BERT Gu et al. (2020). In these models, the dialog context is represented as a long document as in single-turn matching models. They mainly utilize speaker information of each utterance in the dialog context to extend BERT into a multi-turn fashion.

5 Results and Discussion

5.1 Quantitative Results

Table 2 reports the quantitative results on Ubuntu, Douban, and E-Commerce datasets. In our experiments, we set two conditions for pre-trained language models. 1) Two different pre-trained language models (i.e., BERT Devlin et al. (2019), ELECTRA Clark et al. (2020)) are utilized for fine-tuning. 2) We adapt domain-specific post-training approach Whang et al. (2019); Humeau et al. (2020); Gu et al. (2020) (each post-trained model is denoted as BERT+ and ELECTRA+). Based on these initial settings, we explore how effective UMS are for multi-turn response selection.

For all datasets, models with UMS significantly outperform the previous state-of-the-art methods. Specifically, UMS achieves absolute improvement of 2.0% and 5.8% in on Ubuntu and E-Commerce datasets, respectively. For Douban datset, MAP and MRR are considered to be main metrics rather than because test set contains more than one ground truth in the candidates. UMS achieves absolute improvement of 0.5% in these metrics.

To evaluate the effectiveness of UMS, we compare the models with UMS and those without them. Since existing BERT-based approaches Lu et al. (2020); Gu et al. (2020) reported different performance of BERT, we reimplement it for a fair comparison with our proposed UMS. The models with UMS consistently show performance improvement regardless of whether language models are post-trained on each corpus or not. For the models without post-training, different results are obtained depending on the dataset. ELECTRA mainly shows better results for Ubuntu and Douban datasets, while BERT shows better results for E-Commerce dataset. On the contrary, BERT+ achieves the best performance for all corpora in comparison among the models with post-training. We believe that post-training on domain-specific corpus gives the model more opportunities to learn whether given two dialogs are relevant through NSP, which has the effect of data augmentation.


Test Split Approach MAP MRR
Web BERT 0.671 0.720 0.555 0.391 0.599 0.890
UMS 0.699 0.751 0.606 0.428 0.623 0.911
Clean BERT 0.726 0.792 0.648 0.395 0.612 0.888
UMS 0.761 0.834 0.716 0.431 0.663 0.903
Table 3: Evaluation Results on Kakao Corpus.

Results on Kakao Corpus

We report evaluation results on Kakao Corpus in Table 3. Since ELECTRA for Korean is unavailable, we only compare BERT and UMS for two test splits. Clean shows better results than Web with respect to all metrics regardless of using UMS. This might be because Clean contains less grammatical errors and typos, which interfere with accurate understanding of the context. Also, UMS significantly improves performance compared to the baseline for both split, specifically it achieves absolute improvement of 5.1% and 6.8% in P@1 on Web and Clean, respectively.

5.2 Adversarial Experiment


Approach Model Original Adversarial
Baselines BERT 0.820 0.887 0.199 0.561
BERT+ 0.862 0.915 0.203 0.573
ELECTRA 0.826 0.890 0.304 0.614
ELECTRA+ 0.861 0.914 0.329 0.636
Avg 0.842 0.902 0.259 0.596
UMS BERT 0.843 0.902 0.310 0.622
BERT+ 0.875 0.923 0.363 0.656
ELECTRA 0.854 0.910 0.397 0.668
ELECTRA+ 0.875 0.922 0.437 0.692
Avg 0.862 0.914 0.377 0.660
Table 4: Adversarial experimental results on Ubuntu Corpus. All models are evaluated using and MRR metrics.

Even though BERT-based models have shown state-of-the-art performance for response selection task, we experiment to know if these models are trained to predict the next utterance properly. Inspired by Jia and Liang (2017) and Yuan et al. (2019), we design an adversarial experiment to investigate whether language models for response selection are trained properly. First, we train the models using the original training set, then evaluate them on either original or adversarial test set. To construct the adversarial test set, we randomly extract an utterance from the dialog context and replace it with one of negative responses among candidates (See Figure 1). In adversarial test set, assuming there are candidates per conversation, a set of candidates consists of a ground truth, an extracted utterance from the dialog context, and 2 negative responses. The extracted utterance is not deleted from the original dialog since it can be crucial for selecting the optimal response.

Table 4 reports the experimental results of BERT(+) and ELECTRA(+) models. We compare the models without UMS and those with, denoted as baselines and UMS, respectively. Even though the performances drop significantly in the adversarial set regardless of using UMS, we observe that UMS decline less than baselines. To be specific, score is decreased by 58% and 48% on average for baselines and UMS, respectively. It is also encouraging that UMS show absolute improvement of 12% with respect to on the adversarial set compared to the 2% improvement on the original set (See Table 4). In addition, while baselines tend to drop the performance on adversarial set as training progressed, UMS show a tendency to increase significantly. Hence, it is reasonable to assume that our UMS are robust to adversarial examples and good at in-domain classification.

Figure 3: comparison of adversarial example for each model. Lower means that it is good at predicting the next utterance (ground truth).


Auxiliary Tasks MRR
1 None 0.826 0.908 0.978 0.890
2 INS 0.836 0.917 0.980 0.897
3 DEL 0.848 0.924 0.983 0.905
4 SRCH 0.834 0.915 0.981 0.896
5 INS + DEL 0.853 0.927 0.984 0.909
6 INS + SRCH 0.841 0.920 0.982 0.901
7 DEL + SRCH 0.852 0.927 0.983 0.908
8 INS + DEL + SRCH 0.854 0.929 0.984 0.910
Table 5: Ablation Study on Ubuntu Corpus. We choose ELECTRA as the baseline in this analysis. INS, DEL, and SRCH denote that the model trained with utterance insertion, deletion, and search, respectively.
Figure 4: t-SNE embeddings of UMS output representations for each special token in UMS (i.e., [INS], [DEL], and [SRCH]). All embeddings are sampled from test sets of each dataset. Orange and blue denote the target and other remaining tokens for each auxiliary task.

Figure 3 describes the performance of each model, ranking adversarial example (i.e., randomly sampled utterance from the conversation) as the most likely response. While BERT- and ELECTRA-based models show similar performance on the original set, ELECTRA-based models outperform BERT-based models with significant margins (a gap of 10%) on the adversarial set regardless of whether they are trained from post-trained checkpoints. For example, different patterns of the evaluation results between BERT+ and ELECTRA are observed according to the test sets (original : BERT+ ELECTRA, adversarial : BERT+ ELECTRA). We have two perspectives on these results. 1) Next sentence prediction in BERT overfits the model to predict semantically relevant sentence rather than the next sentence. 2) Since ELECTRA is trained through replaced token detection in which the model learns to discriminate between real input tokens and replacements generated from small Masked Language Model, it is more effective in representing contextual information from the sequence.

5.3 Ablation Study

We performed ablation studies on the Ubuntu Corpus to investigate which auxiliary tasks are more crucial for response selection. As shown in Table 5, we explore the impact of each auxiliary task by constructing all the combinations of possible subsets. Based on the observations of using only one auxiliary task (i.e., 3 2 4) and two tasks (i.e., 5 7 6), we obtain the results, DEL INS SRCH, with respect to the importance of manipulation strategy. Since DEL consists of input sequence that contains an irrelevant utterance to the original dialog context, it may be more advantageous for learning to distinguish dialog consistency and coherence than INS and SRCH. We obtain the best results when all the auxiliary tasks are trained altogether simultaneously with the response selection criterion.

5.4 Visualization

As shown in Figure 4, we visualize the output representations of special tokens learned by our proposed UMS through t-SNE embeddings. Scatter plots colored in orange represent target tokens (i.e.,[ins], [del], and [srch] in Section 3.3) and those in blue represent the rest of tokens. All representations are extracted from test sets of three datasets (Ubuntu, Douban, and E-Commerce) in this analysis. In overall, the results show that UMS effectively learns dialog coherence for all datasets. In the case of Ubuntu dataset, insertion and search tasks tend to be less clustered different from the other two datsets. Since many utterances in Ubuntu dataset mainly consist of many technical terminologies which may cause structural ambiguity, the tasks constructed within the same dialog are difficult to be performed. On the contrary, the model can easily learn discourse structure on open-domain datasets such as Douban and E-Commerce.

6 Conclusion

In this paper, we pointed out the limitations of existing works based on pre-trained language models such as BERT in retrieval-based multi-turn dialog systems. To address these, we proposed highly effective utterance manipulation strategies (UMS) for multi-turn response selection. The UMS are fully applied in self-supervised manner and can be easily incorporated into existing models. We obtained new state-of-the-art results on multiple public benchmark datasets (i.e., Ubuntu, Douban, and E-Commerce) and significantly improved results on Korean open-domain dialog corpus. For the future work, we plan to develop a response selection model which is more robust to adversarial examples by designing various adversarial objectives.


  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3.1, §4.3, §5.1.
  • Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu (2020) Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922. External Links: Link Cited by: §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link Cited by: §1, §2, §3.1, §4.3, §4.4, §5.1.
  • J. Gu, T. Li, Q. Liu, Z. Ling, Z. Su, S. Wei, and X. Zhu (2020) Speaker-aware bert for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, External Links: Link Cited by: §1, §2, §3.1, §3.1, §4.4, Table 2, §5.1, §5.1.
  • B. Hu, Z. Lu, H. Li, and Q. Chen (2014) Convolutional neural network architectures for matching natural language sentences. In Advances in neural information processing systems, pp. 2042–2050. External Links: Link Cited by: §2.
  • S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2020) Poly-encoders: architectures and pre-training strategies for fast and accurate multi-sentence scoring. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3.1, §5.1.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2021–2031. External Links: Link Cited by: §5.2.
  • R. Kadlec, M. Schmid, and J. Kleindienst (2015) Improved deep learning baselines for ubuntu corpus dialogs. arXiv preprint arXiv:1510.03753. External Links: Link Cited by: §4.4, Table 2.
  • D. Lee, M. Shin, T. Whang, S. Cho, B. Ko, D. Lee, E. Kim, and J. Jo (2020) Reference and document aware semantic evaluation methods for korean language summarization. arXiv preprint arXiv:2005.03510. External Links: Link Cited by: §4.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. External Links: Link Cited by: §1, §2.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294. External Links: Link Cited by: §1, §2, §4.1, §4.2.
  • J. Lu, X. Ren, Y. Ren, A. Liu, and Z. Xu (2020) Improving contextual language models for response retrieval in multi-turn conversation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1805–1808. External Links: Link Cited by: §1, §2, §3.1, §4.4, Table 2, §5.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. External Links: Link Cited by: §4.3.
  • C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan (2019a) Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 267–275. External Links: Link Cited by: §1, §2.
  • C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan (2019b) One time of interaction may not be enough: go deep with an interaction-over-interaction network for response selection in dialogues. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1–11. External Links: Link Cited by: §2, §4.4, Table 2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, pp. 5998–6008. External Links: Link Cited by: §1, §2, §4.4.
  • S. Wan, Y. Lan, J. Xu, J. Guo, L. Pang, and X. Cheng (2016) Match-srnn: modeling the recursive matching structure with spatial rnn. In

    Proceedings of the 25th International Joint Conference on Artificial Intelligence

    pp. 2922–2928. External Links: Link Cited by: §4.4, Table 2.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. In Proceedings of the Advances in Neural Information Processing Systems, pp. 3266–3280. External Links: Link Cited by: §3.1.
  • H. Wang, Z. Lu, H. Li, and E. Chen (2013) A dataset for research on short-text conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 935–945. External Links: Link Cited by: §2.
  • M. Wang, Z. Lu, H. Li, and Q. Liu (2015) Syntax-based deep matching of short texts. In Proceedings of the 24th International Joint Conference on Artificial Intelligence, pp. 1354–1361. External Links: Link Cited by: §2.
  • S. Wang and J. Jiang (2016) Learning natural language inference with LSTM. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1442–1451. External Links: Link Cited by: §4.4, Table 2.
  • T. Whang, D. Lee, C. Lee, K. Yang, D. Oh, and H. Lim (2019) An effective domain adaptive post-training method for bert in response selection. arXiv preprint arXiv:1908.04812. External Links: Link Cited by: §1, §2, §3.1, §3.1, §5.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. (2019) Transformers: state-of-the-art natural language processing. arXiv preprint arXiv:1910.03771. External Links: Link Cited by: §4.3.
  • Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li (2017) Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 496–505. External Links: Link Cited by: §1, §1, §2, §4.1, §4.1, §4.2, §4.4, Table 2.
  • H. Xu, B. Liu, L. Shu, and S. Y. Philip (2019)

    BERT post-training for review reading comprehension and aspect-based sentiment analysis

    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2324–2335. External Links: Link Cited by: §3.1.
  • Z. Xu, B. Liu, B. Wang, C. Sun, and X. Wang (2017) Incorporating loose-structured knowledge into conversation modeling via recall-gate lstm. In

    2017 International Joint Conference on Neural Networks (IJCNN)

    pp. 3506–3513. External Links: Link Cited by: §4.1.
  • R. Yan, Y. Song, and H. Wu (2016) Learning to respond with deep neural networks for retrieval-based human-computer conversation system. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–64. External Links: Link Cited by: §4.4, Table 2.
  • C. Yuan, W. Zhou, M. Li, S. Lv, F. Zhu, J. Han, and S. Hu (2019) Multi-hop selector network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 111–120. External Links: Link Cited by: §1, §2, §4.2, §4.4, Table 2, §5.2.
  • Z. Zhang, J. Li, P. Zhu, H. Zhao, and G. Liu (2018) Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3740–3752. External Links: Link Cited by: §1, §4.1, §4.4, Table 2.
  • X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu, and R. Yan (2016) Multi-view response selection for human-computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 372–381. External Links: Link Cited by: §2, §4.4, Table 2.
  • X. Zhou, L. Li, D. Dong, Y. Liu, Y. Chen, W. X. Zhao, D. Yu, and H. Wu (2018) Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1118–1127. External Links: Link Cited by: §1, §2, §4.1, §4.2, §4.4, Table 2.