Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems via Fine-Grained Contrastive Learning

Retrieve-based dialogue response selection aims to find a proper response from a candidate set given a multi-turn context. Pre-trained language models (PLMs) based methods have yielded significant improvements on this task. The sequence representation plays a key role in the learning of matching degree between the dialogue context and the response. However, we observe that different context-response pairs sharing the same context always have a greater similarity in the sequence representations calculated by PLMs, which makes it hard to distinguish positive responses from negative ones. Motivated by this, we propose a novel Fine-Grained Contrastive (FGC) learning method for the response selection task based on PLMs. This FGC learning strategy helps PLMs to generate more distinguishable matching representations of each dialogue at fine grains, and further make better predictions on choosing positive responses. Empirical studies on two benchmark datasets demonstrate that the proposed FGC learning method can generally and significantly improve the model performance of existing PLM-based matching models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/29/2020

Dialogue Response Selection with Hierarchical Curriculum Learning

We study the learning of a matching model for dialogue response selectio...
08/10/2018

Lingke: A Fine-grained Multi-turn Chatbot for Customer Service

Traditional chatbots usually need a mass of human dialogue data, especia...
09/22/2021

FCM: A Fine-grained Comparison Model for Multi-turn Dialogue Reasoning

Despite the success of neural dialogue systems in achieving high perform...
09/14/2021

Challenging Instances are Worth Learning: Generating Valuable Negative Samples for Response Selection Training

Retrieval-based chatbot selects the appropriate response from candidates...
04/06/2020

Grayscale Data Construction and Multi-Level Ranking Objective for Dialogue Response Selection

Response selection plays a vital role in building retrieval-based conver...
09/10/2021

An Evaluation Dataset and Strategy for Building Robust Multi-turn Response Selection Model

Multi-turn response selection models have recently shown comparable perf...
12/14/2019

Survivor: A Fine-Grained Intrusion Response and Recovery Approach for Commodity Operating Systems

Despite the deployment of preventive security mechanisms to protect the ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Building an intelligent conversational agent that can naturally and consistently converse with humans has drawn considerable attention in the field of natural language processing (NLP). So far, there are two kinds of approaches for implementing a dialogue system: generation-based

Serban et al. (2016); Vinyals and Le (2015) and retrieval-based methods Lowe et al. (2015); Wu et al. (2017); Zhou et al. (2018). In this work, we focus on the problem of multi-turn response selection for retrieval-based dialogue systems.

Multi-turn response selection is the task of predicting the most proper response using a retrieval model by measuring the matching degree between a multi-turn dialogue context and a set of response candidates. Most recently, pre-trained language models (PLMs) have achieved substantial performance improvements in multi-turn response selection Lu et al. (2020); Gu et al. (2020); Humeau et al. (2020). PLM-based models take a pair of context and response as a whole and represent them as a single sequence representation (i.e., context embedding at the position of the [CLS] token). This sequence representation is also regarded as a matching representation, which is further used to decide a score indicating the matching degree between the dialogue context and the response. By further post-training PLMs with in-domain data and auxiliary self-supervised tasks Whang et al. (2020, 2021a); Xu et al. (2021), PLMs are incorporated with in-domain knowledge and achieve state-of-the-art results on benchmarks.

Figure 1:

Although a standard contrastive learning can provide better uniformity on the vector space, it is still hard to distinguish between positive (yellow) and negative (blue) dialogues with the same context (in the same shape, e.g., star).

Despite the success of PLM-based matching models and their various variants, recent studies reveal that the contextualized word and sentence representations in PLM-based models are anisotropic, i.e., representations occupy a narrow cone in the vector space, which largely limits the expressiveness of representations Ethayarajh (2019); Li et al. (2020). Contrastive learning provides a way to solve this problem, which uniforms representations over a hypersphere space as pointed out by Wang and Isola (2020). Although employing the standard constrative learning method Chen et al. (2020); Fang and Xie (2020) with in-batch negative sampling enhances the uniformity of representations, we further observe by experiments (Section 6.3) that the matching representation of two dialogues with the same context but different responses are too similar, as is shown in Figure 1. This is mainly due to the following two reasons: (1) The overlap of the same context tokens in different context-response pairs makes the matching representation highly similar since the representation is aggregated from all tokens in the context-response pair. (2) The in-batch negative samples are highly likely to be different in both context and response. This phenomenon makes the representations less distinguishable and makes it hard to tell dialogues with positive responses from negative ones.

To address the aforementioned issues, we propose a Fine-Grained Contrastive learning (FGC) approach to fine-tune matching representations for the response selection task. FGC introduces the connections between each dialogue pair with the same context and different responses into the training process with contrastive learning. In contrast to the standard contrastive learning method, which takes every other dialogue as negative examples, our proposed FGC takes context and response as separate parts and focuses on distinguishing between positive and negative dialogues with the same context. During fine-grained contrastive learning, each dialogue is converted into an augmented dialogue via rule-based transformation on the response utterance. Each dialogue is asked to be close to its augmentation, while the one with a positive response should be far away from the one with a negative response. FGC works totally in a self-supervised way that no additional supervision is required besides the classification label used for response selection training.

We conduct experiments on two response selection benchmarks: the Ubuntu Dialogue Corpus Lowe et al. (2015) and the Douban Corpus Wu et al. (2017)

. These two datasets have a large variety of both topics and languages. Moreover, our proposed learning framework is independent of the choice of PLMs-based matching models. Therefore, for a comprehensive evaluation, we test FGC with five representative PLMs-based matching models, including the state-of-the-art one brought by self-supervised learning. Our empirical results demonstrate that FGC is able to consistently improve PLMs by up to 3.2% absolute improvement with an average of 1.7% absolute improvement in terms of R

@1. Besides, We also compare our method with standard-contrastive-learning-enhanced PLMs, which demonstrates the effectiveness of our proposed fine-grained contrastive objective.

In summary, our contributions in the paper are three-fold:

  • We propose FGC, a novel fine-grained contrastive learning method, which helps generate better representations of dialogues and improves the response selection task.

  • FGC shows good generality of effectiveness with various pre-trained language models for enhancing performance.

  • Experimental results on two benchmark datasets demonstrate that FGC can significantly improve the performance of various strong PLM-based matching models, including state-of-the-art ones.

2 Related Work

2.1 Multi-Turn Response Selection

Earlier studies on retrieval-based response selection focused on single-turn response selection Wang et al. (2013); Hu et al. (2015); Wang et al. (2015). Recently, researchers have paid more attention to the multi-turn response selection. The dual LSTM model Lowe et al. (2015) was proposed to match the dialog context and response using a dual RNN-based architecture. Zhou et al. (2016) proposed the multi-view model that encodes dialogue context and response both on the word-level and utterance-level. However, these methods ignore relationships among utterances by simply concatenating the utterances together or converting the whole context to a vector. To alleviate this, Wu et al. (2017) proposed the sequential matching network that each utterance in the context first interacts with the response candidate, and then the matching features are aggregated according to the sequential order of multi-turn context. With the rise of the self-attention mechanism Vaswani et al. (2017), some studies Zhou et al. (2018); Tao et al. (2019) explored how to enhance representations with it. Besides, Yuan et al. (2019) recently revealed that previous methods construct dialogue contextual representation using too much context, which damages the matching effect. They proposed a multi-hop selector to select the relevant utterances in the dialogue context for response matching.

Most recently, pre-trained language models (e.g., BERT Devlin et al. (2019) and RoBERTa Liu et al. (2020)) have shown an impressive performance in the response selection. The post-training method, which helps transfer the representations of BERT from the general domain to the dialogue domain, was proposed by Whang et al. (2020) and obtained state-of-the-art results. Subsequent researches Gu et al. (2020); Lu et al. (2020) focused on incorporating speaker information into BERT and showed its effectiveness in multi-turn response selection. Further, self-supervised learning contributed to the success of pre-trained language models was also applied in several NLP downstream tasks, such as summarization Wang et al. (2019) and the response generation Zhao et al. (2020). In the task of response selection, Whang et al. (2021a) and Xu et al. (2021) indicated that incorporating well-designed self-supervised tasks according to the characteristics of the dialogue data into BERT fine-tuning can help with the multi-turn response selection. Han et al. (2021) proposed a fine-grained post-training method for enhancing the pre-trained language model, while the post-training process is computationally expensive than fine-tuning a classification model. Su et al. (2020) proposed a hierarchical curriculum learning framework for improving response selection with PLMs.

2.2 Contrastive Learning for NLP

There have been several investigations for contrastive loss formulations for deep neural models, primarily in the computer vision domain.

Oord et al. (2018)

proposed a framework for contrastive learning to learn visual representations based on contrastive predictive coding, which predicts the features in latent space by using powerful autoregressive models.

Khosla et al. (2020) investigated supervised contrastive learning, allowing to leverage label information effectively. Following this trend, some researchers verified the effectiveness of constructive learning in specific NLP tasks. Fang and Xie (2020) proposed pre-training language representation models with a contrastive self-supervised learning objective at the sentence level, outperforming previous methods on a subset of GLUE tasks. Gunel et al. (2021) combined the cross-entropy with a supervised contrastive learning objective, showing improvements over fine-tuning RoBERTa-Large on multiple datasets of the GLUE benchmark. Our work differs from previous works in that we do not directly make contrast on one dialogue with all the other dialogues, as the granularity of negative samples constructed using this approach is too coarse to provide sufficient discrimination with the positive ones.

3 Background

3.1 Task Formalization

The response selection task is to select the best candidate to respond a given multi-turn dialogue context from a pool of candidate responses. Suppose that we have a dataset , where is a multi-turn dialogue context with turns, denotes a candidate response, and denotes a label with indicating a proper response for and otherwise

. Our goal is to estimate a matching model

from . For any given context-response pair , returns a score that reflects the matching degree between and .

3.2 Pre-trained Language Model for Response Selection

As a trend in these years, pre-trained language models, e.g., BERTDevlin et al. (2019), have been widely studied and adapted into numerous NLP tasks, showing several state-of-the-art results. Dialogue response selection is one of them.

Applying a pre-trained language model into response selection usually involves two steps. The first step is to make domain-adaptive post-training, which continues to train a standard pre-trained language model with a domain-specific corpus. This step helps to transfer the original pre-trained language model into the target domain.

The second step is to fine-tune the post-trained model with the response selection task. Given a context where is the t-th turn of the dialog context, and a response , the model is asked to predict a score to represent the matching degree between and . To achieve this, a special token [EOT] is added at the end of each turn to distinguish them in the context . Utterances from both the context and response are concatenated with separator [EOT] and [SEP] between them. Taking as input, BERT returns a sequence of vectors with the same length as . The output of the first place is an aggregated representation vector that holds the information of interaction between context and response . A relevance score is computed based on and optimized through a binary classification loss.

(1)

where and are parameters

4 Methodology

4.1 Overview

Figure 2: FGC contains two objectives, i.e., IVC and CVC. IVC pushes away dialogues with the same context but different responses (icons in the same shape), while dialogues that belong to different categories may still be similar. CVC further solves this problem by pulling all dialogues into two distinguishable clusters.

In this paper, we propose the Fine-Grained Contrastive Learning method (FGC) for learning PLMs-based matching models. It consists of two complementary contrastive objectives: (1) an instance-view contrastive objective (IVC); and (2) a category-view contrastive objective (CVC). Figure 2 demonstrates the joint effects of the two contrastive objectives on the space of matching representations. The IVC objective pushes away dialogue instances with the same context and different responses, making the model easier to distinguish between positive and negative responses. However, only pushing the examples with the same context away increases the risk of instances with different contexts getting closer in the representation space. As a remedy, the CVC objective further pulls all context-response pairs into two distinguishable clusters in the matching space according to whether the pair is positive or not. These two objectives are introduced in 4.3 and 4.4 respectively. For simplicity, we take BERT as an example in the following sections in the following sections.

4.2 Dialogue Data Augmentation

Data augmentation takes an important role in contrastive learning Zoph et al. (2020); Ho and Vasconcelos (2020). Similar to standard contrastive learning (e.g., CERT), the first step of FGC is to create augmentations for every context-response pair. Given a context-response pair, we take an augmentation method on the response to generate an augmented response. The context and augmented response pair form the augmentation of the original context-response pair. In order to fine-grained control the difference between a dialogue and its corresponding augmentation and easily perform augmentation on various languages, a fully unsupervised rule-based utterance augmentation method is adopted for utterance augmentation. Inspired by Wei and Zou (2019), we adopt three types of augmentation operations:

  • Random deletion

    : Each token in the utterance is randomly and independently deleted with a probability

    .

  • Random swaping: Each token in the utterance is randomly swapped with another token in the utterance with a probability .

  • Synonym replacing: Randomly replace a non-stop-word token to one of its synonyms with a probability .

Given a response utterance and an augmentation strength , we randomly pick out one of these three augmentation methods and then apply the augmentation on the utterance with the probability being . After augmentation, the response is converted into another augmented response . The augmentation strength is a hyper-parameter that controls how much difference is there between and .

4.3 Instance-View Contrastive Objective

Figure 3: An overview of IVC learning. The input is a dialogue context and a pair of positive and negative responses (, ). Both responses are augmented with a rule-based utterance augmentation model to form a new pair (, ). We concatenate the context with four responses and fed them into the BERT encoder, which outputs a projection vector for each dialogue. IVC aims to maximize the dissimilarity of between positive examples and negative examples, as well as maintains high cohesion within positive and negative cases.

The instance-view contrastive (IVC) objective aims at introducing more discrepancy between a pair of dialogues with the same context and positive/negative responses. Feeding a dialogue into BERT, BERT helps to make internal interactions by attention mechanism and generate latent vectors representing the dialogue. The output vector of the [CLS] position stands for a aggregated sequence representation of both context and response. We also take this vector as the dialogue matching representation used for contrastive learning. Moreover, we apply another projection layer to convert into a smaller vector . This projection is made through an MLP with one hidden layer. Through this projection, each coherent dialogue with positive responses is transformed into a projection vector , and each incoherent dialogue is transformed into . The augmentations of the positive and negative dialogues are also converted into two vectors, i.e., and . Here and indicates the item belongs to the positive class or the negative class, and the bar indicates this item comes from an augmented example.

As illustrated by Ethayarajh (2019) and Li et al. (2020), the embedding vectors of different utterances are distributed in a narrow cone of the vector space, showing less distinguishability. This phenomenon is even worse when two utterances are semantically similar, e.g., two dialogues sharing the same context. Thus, we leverage the IVC objective on these projection vectors to distinguish between positive and negative responses given the same context. IVC objective regards the projection vector as a representation of response given context . This loss is applied on the projection vector , which helps to maximize the similarity between a response with its augmentation given the same context, as well as minimize the similarity between each positive response and negative response pair. The maximum and minimum are achieved as a set of pair-wise comparisons, i.e.,

(2)

Here we use the NT-Xent Loss Chen et al. (2020)

to model the similarities of projection vectors. By writing this pair-wise comparison into a loss function, the IVC loss is formulated as

(3)

where is an adjustable scalar temperature parameter that controls the separation of positive and negative classes; ranges from ; and is the total number of dialogues.

Notice that the IVC objective aims to separate the representation of positive and negative responses given the same context, so that we do not take all other in-batch examples as negative examples in the same way as in standard contrastive learning.

4.4 Category-View Contrastive Objective

The IVC objective ensures a high difference between dialogues with the same context, while it cannot guarantee that the learned representations are suitable for classification. The representations of a positive dialogue may be close to the representation of another negative dialogue with a different context, as is shown in Figure 2. Thus, we introduce another category-view contrastive (CVC) objective into model training. The category-view contrastive objective aims at bunching dialogues that belong to the same category into a cluster and separate these two clusters.

The CVC objective is applied between dialogues from the two classes. It captures the similarity of projection vectors of the same class and contrasts them with projection vectors from the other class, i.e.,

(4)

This category-view contrastive loss works with a batch of representation vectors of size , where the number of both positive examples and negative examples is . Denote to be all representation vectors in a batch, where are representation vectors for positive dialogues and their augmentations, and are representation vectors for negative dialogues and their representations. The CVC objective works as an additional restriction to punish the high similarity between positive-negative pairs and low similarity within all positive and negative dialogues. The following formulas give this loss:

(5)

.

Finally, the BERT model is fine-tuned with the standard response selection loss and both IVC and CVC loss. A weighted summation is computed as

(6)

where is a hyper-parameter that controls the balance between response selection loss and contrastive loss. The model is optimized by minimizing the overall loss value.

5 Experiments

5.1 Dataset

  • Ubuntu Dialogue Corpus V1
    The Ubuntu Dialogue Corpus V1 Lowe et al. (2015) is a domain-specific multi-turn conversation dataset. Conversations in this dataset are dumped from the multi-party chat room whose topic is the Ubuntu operating system. We conducted experiments on a preprocessed data released by Xu et al. (2019), in which numbers, URLs, and system paths are masked by placeholders. Negative responses for each dialogue are randomly sampled from other dialogues.

  • Douban Corpus
    The Douban CorpusWu et al. (2017) is a Chinese dataset collected from an online social network website named Douban. Douban Corpus is an open-domain conversation corpus, whose topic is much wider than that of Ubuntu Corpus.

The statistics of these two datasets are shown in Table 1. These two datasets vary greatly in both language and topic. Following previous works, we take

s as evaluation metrics, which measures the probability of having the positive response in the top

ranked responses. We take for model evaluation.

Dataset Ubuntu Douban
Train Val Test Train Val Test
# dialogues 1M 500K 500K 1M 50K 6670
#pos:#neg 1:1 1:9 1:9 1:1 1:1 1.2:8.8
# avg turns 10.13 10.11 10.11 6.69 6.75 6.45
Table 1: Statistics of two datasets.

5.2 Baseline Methods

We introduce FGC into several open-sourced PLM-based models, including BERT and ELECTRA. We also test the effectiveness of FGC on variants of BERT model, including BERT-small (H=4, L=4, H=512), BERT with domain-adaptive post training named BERT-DPT 

Whang et al. (2020), and BERT with self-supervised tasks named BERT-UMS Whang et al. (2021b). Several non-PLM-based models are also compared with our proposed FGC. 222A pre-trained Chinese BERT-Small is not available, thus we do not conduct experiments on it.

Models Ubuntu Douban
R@1 R@2 R@5 MAP MRR P@1 R@1 R@2 R@5
non-PLM-based methods
Multi-View Zhou et al. (2016) 0.662 0.801 0.951 0.505 0.543 0.342 0.292 0.350 0.729
SMN Wu et al. (2017) 0.726 0.847 0.961 0.529 0.569 0.397 0.233 0.396 0.724
DUA Zhang et al. (2018) 0.752 0.868 0.961 0.551 0.599 0.421 0.243 0.421 0.780
DAM Zhou et al. (2018) 0.767 0.874 0.961 0.550 0.601 0.427 0.254 0.410 0.757
MRFN Tao et al. (2019) 0.786 0.886 0.976 0.571 0.617 0.448 0.276 0.435 0.783
IoI Tao et al. (2019) 0.796 0.894 0.974 0.573 0.621 0.444 0.269 0.451 0.786
IMN Gu et al. (2019) 0.794 0.889 0.974 0.576 0.618 0.441 0.268 0.458 0.796
MSN Yuan et al. (2019) 0.800 0.899 0.978 0.587 0.632 0.470 0.295 0.452 0.788
PLM-based Methods
BERT 0.820 0.906 0.978 0.597 0.634 0.448 0.279 0.489 0.823
BERT+FGC 0.829 0.910 0.980 0.614 0.653 0.495 0.312 0.495 0.850
BERT-DPT Whang et al. (2020) 0.862 0.935 0.987 0.609 0.645 0.463 0.290 0.505 0.838
BERT-DPT+FGC 0.881 0.945 0.990 0.620 0.660 0.495 0.322 0.495 0.850
BERT-UMS Whang et al. (2021b) 0.875 0.942 0.988 0.625 0.664 0.499 0.318 0.482 0.858
BERT-UMS+FGC 0.886 0.948 0.990 0.627 0.670 0.500 0.326 0.512 0.869
ELECTRA 0.826 0.908 0.978 0.602 0.642 0.465 0.287 0.483 0.839
ELECTRA+FGC 0.832 0.912 0.980 0.625 0.668 0.499 0.313 0.502 0.850
BERT-Small 0.792 0.888 0.972 N/A N/A N/A N/A N/A N/A
BERT-Small+FGC 0.800 0.890 0.974 N/A N/A N/A N/A N/A N/A
Table 2: Evaluation results on the two data sets. Numbers in bold indicate that the PLM-based models using FGC outperforms the original models with a significance level -value .

5.3 Implementation Details

All models are implemented based on Pytorch and Huggingface’s implementation. Each PLM model is trained for 5 epochs with a learning rate beginning from 3e-6 to 0 with a linear learning rate decay. Our model is trained with 8 Nvidia Tesla A100 GPUs, which have 40GB of memory for each of them. For more training details, please refer to Appendix

A.

5.4 Experimental Results

The comparison between PLMs and FGC-enhanced PLMs is shown in Table 2. It can be seen from the table that all PLM-based methods outperform non-PLM-based methods. By adding our proposed FGC into PLM-based models, the performance of all models is significantly improved. The maximum improvement of a standard-sized BERT for the two datasets are 1.9% and 3.2% respectively in terms of R@1. The average performance improvement also achieves 1.1% and 2.2%. Besides, our proposed method can also enhance the current state-of-the-art method BERT-UMS by 1.1% and 0.8% on two datasets in terms of R@1. In addition to a standard-sized BERT model, we also find an absolute gain of 0.9% by adding FGC on the BERT-Small model, which is about 10 smaller than a standard one. The success of these two datasets demonstrates the effectiveness of our proposed FGC across different models, languages, and dialogue topics on multi-turn response selection.

FGC separates representation vectors of dialogues into different latent spaces according to their type of relevance between contexts and responses. On the one hand, IVC helps distinguish between positive and negative responses given the same context. On the other hand, CVC separates representations of dialogues from two categories so that these representations can have better distinguishability. As a result, the matching representation of context-response pairs for positive and negative responses are forced to stay away from each other. This better representation ensures higher accuracy in selecting the positive response given a candidate set of responses.

6 Closer Analysis

We conduct closer analysis with BERT-DPT since combining post-training and fine-tuning is the most popular manner of applying BERT for down-streaming tasks. The Ubuntu Corpus is used in the following analysis.

6.1 Ablation Studies

Strategy R@1 R@2 R@5
BERT-DPT + FGC 0.881 0.944 0.990
 - IVC 0.866 0.935 0.986
 - CVC 0.877 0.941 0.988
BERT-DPT 0.862 0.935 0.987
Table 3: Ablation Analysis on the Ubuntu corpus.

As we add two contrastive learning objectives into training for response selection, we test the gain of each objective. The results are shown in Table 3. It can be observed from the table that both IVC and CVC can enhance the performance on response selection, with an absolute improvement of 1.4% and 0.4% respectively in terms of R@1. By applying these two contrastive objectives, we obtain an absolute improvement of 1.9% based on the post-trained BERT model. Both of the two contrastive objectives share the same purpose of separating the representation of dialogues with positive and negative responses, and thus there is a performance overlap by adding these two objectives.

6.2 Sensitive Analysis

Temperature

Temperature

works as a hyperparameter that controls the punishment on the degree of separation of positive and negative classes. A smaller

gives more power to pull away dialogues from different classes. We test how this hyperparameter can influence the response selection performance. We test in the range of on FGC and the results are shown in Table 4. FGC achieves the best performance when is set to be 0.5, while the performance drops given a smaller or a bigger . A suitable can provide a proper differentiation that is neither too strong nor too weak, keeping a balance between contrastive and response selection objectives.

Temperature R@1 R@2 R@5
BERT-DPT 0.862 0.935 0.987
 + FGC (=0.1) 0.872 0.939 0.990
 + FGC (=0.5) 0.881 0.944 0.990
 + FGC (=1.0) 0.876 0.938 0.990
Table 4: Influence of temperature in FGC.

Utterance Augmentation Strength

Utterance augmentation plays an important role in contrastive learning. A dialogue with a context and a positive response is drawn closer to its augmentation while pushed far away from the dialogue with the same context but a negative response. The strength of utterance augmentation decides the boundary of each cluster. We conduct experiments to test how augmentation strength can influence response selection accuracy. We range the augmentation strength from , and the testing results are shown in Table 5. It achieves the best performance when equals 0.2. Augmentation strength being either too large or too small may harm the clustering. On the one hand, a too-large brings too much noise into the clustering process, which blurs the boundary between positive and negative examples. On the other hand, a too-small cannot provide enough variation to the utterance, which harms the generalization of identifying positive and negative responses.

Augment Strength R@1 R@2 R@5
BERT-DPT 0.862 0.935 0.987
 + FGC (=0.1) 0.874 0.938 0.989
 + FGC (=0.2) 0.881 0.944 0.990
 + FGC (=0.5) 0.872 0.935 0.990
Table 5: Influence of utterance augmentation strength in FGC.

6.3 Discussion

Compare with Standard Contrastive Learning

The main difference between our proposed FGC and standard contrastive learning (e.g., CERT Fang and Xie (2020) and SimCSE Gao et al. (2021)) is that we only take dialogues with the same context but different responses as negative examples, instead of using in-batch examples as negative ones. We compare FGC with those methods, whose results are shown in Table 6. Standard contrastive learning can bring less gain (or even harm) on the response selection task, while contrastive learning with fine-grained negative examples leads to a significant gain on this task.

Contrastive Method R@1 R@2 R@5
BERT-DPT 0.862 0.935 0.987
BERT-DPT + CERT 0.855 0.931 0.985
BERT-DPT + SimCSE 0.864 0.936 0.987
BERT-DPT + FGC 0.881 0.944 0.990
Table 6: Influence of utterance augmentation strength in FGC.

Similarity between Dialogues

Strategy Ins Sim Cat Sim
BERT-DPT +0.074 +0.064
 + IVC -0.178 -0.015
 + CVC +0.052 -0.109
BERT-DPT + FGC -0.111 -0.131
Table 7: Similarity Analysis on the Ubuntu corpus.

The goal of FGC is to enlarge distances between dialogue examples with the same context and different responses. To estimate how effective this target is achieved, we compute two average cosine similarities: (1) instance-level similarity, which is the average similarity between dialogue pairs with the same context but different responses; and (2) category-level similarity, which is the average similarity between all positive dialogues and negative dialogues. As can be seen from Table

7, both similarities are lowered from a positive value indicating positive correlation into a negative value indicating negative correlation by adding FGC. By introducing better distinguishability into dialogue representations, our proposed FGC helps to make better response predictions effectively. Though these two similarities can also be lowered by adding IVC alone, the category similarity is not small enough to separate the two categories well. This shortcoming is compensated by further applying CVC as an additional training objective. Besides, CVC alone can neither provide a sufficiently low level of instance-level similarity that separates dialogues with the same context.

Effect of Data Augmentation Alone

Data augmentation, working as a kind of data noise, shows a positive effect on training models with robustness in natural language processing. One may concern that can data augmentation alone help with the response selection task. We conducted experiments with data augmentation alone, i.e., no contrastive learning strategy is included. The results are shown in Table 8. It can be observed from the table that data augmentation alone cannot enhance the model but even harm the accuracy significantly. Data augmentation methods should work with fine-grained contrastive learning to make positive effects for the multi-turn response selection task.

Ubuntu Douban
BERT-DPT 0.862 0.290
 +Aug 0.837 (-2.5%) 0.278 (-1.2%)
BERT-UMS 0.875 0.318
 +Aug 0.851 (-2.4%) 0.292 (-2.6%)
Table 8: Model performance with data augmentation alone.

7 Conclusion

In this paper, we propose FGC, a fine-grained contrastive learning method, which helps to improve the multi-turn response selection task with PLM-based models. FGC consists of an instance-view contrastive (IVC) objective that helps to differentiate positive response and negative response with the same context, and a category-view contrastive (CVC) objective that separate positive dialogues and negative dialogues into two distinguishable clusters. Experiments and analysis on two benchmark datasets and five PLM-based models demonstrates the effectiveness of FGC to significantly improve the performance of multi-turn dialogue response selection.

References

  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §1, §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.1, §3.2.
  • K. Ethayarajh (2019) How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 55–65. External Links: Link, Document Cited by: §1, §4.3.
  • H. Fang and P. Xie (2020) Cert: contrastive self-supervised learning for language understanding. CoRR. Cited by: §1, §2.2, §6.3.
  • T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821. Cited by: §6.3.
  • J. Gu, T. Li, Q. Liu, Z. Ling, Z. Su, S. Wei, and X. Zhu (2020) Speaker-aware bert for multi-turn response selection in retrieval-based chatbots. In CIKM, pp. 2041–2044. Cited by: §1, §2.1.
  • J. Gu, Z. Ling, and Q. Liu (2019) Interactive matching network for multi-turn response selection in retrieval-based chatbots. In CIKM, pp. 2321–2324. Cited by: Table 2.
  • B. Gunel, J. Du, A. Conneau, and V. Stoyanov (2021) Supervised contrastive learning for pre-trained language model fine-tuning. ICLR. Cited by: §2.2.
  • J. Han, T. Hong, B. Kim, Y. Ko, and J. Seo (2021) Fine-grained post-training for improving retrieval-based dialogue systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1549–1558. Cited by: §2.1.
  • C. Ho and N. Vasconcelos (2020) Contrastive learning with adversarial examples. NeurIPS. Cited by: §4.2.
  • B. Hu, Z. Lu, H. Li, and Q. Chen (2015) Convolutional neural network architectures for matching natural language sentences. NIPS. Cited by: §2.1.
  • S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2020) Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. ICLR. Cited by: §1.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. NeurIPS. Cited by: §2.2.
  • B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li (2020) On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 9119–9130. External Links: Link, Document Cited by: §1, §4.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2020) Roberta: a robustly optimized bert pretraining approach. ICLR. Cited by: §2.1.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. ICLR. Cited by: Appendix A.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. Cited by: §1, §1, §2.1, 1st item.
  • J. Lu, X. Ren, Y. Ren, A. Liu, and Z. Xu (2020) Improving contextual language models for response retrieval in multi-turn conversation. In SigIR, pp. 1805–1808. Cited by: §1, §2.1.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.2.
  • I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models

    .
    In AAAI, Vol. 30. Cited by: §1.
  • Y. Su, D. Cai, Q. Zhou, Z. Lin, S. Baker, Y. Cao, S. Shi, N. Collier, and Y. Wang (2020) Dialogue response selection with hierarchical curriculum learning. arXiv preprint arXiv:2012.14756. Cited by: §2.1.
  • C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan (2019) One time of interaction may not be enough: go deep with an interaction-over-interaction network for response selection in dialogues. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1–11. External Links: Link, Document Cited by: Table 2.
  • C. Tao, W. Wu, C. Xu, W. Hu, D. Zhao, and R. Yan (2019) Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In WSDM, pp. 267–275. Cited by: §2.1, Table 2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. NIPS. Cited by: §2.1.
  • O. Vinyals and Q. Le (2015) A neural conversational model. arXiv preprint arXiv:1506.05869. Cited by: §1.
  • H. Wang, Z. Lu, H. Li, and E. Chen (2013) A dataset for research on short-text conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 935–945. External Links: Link Cited by: §2.1.
  • H. Wang, X. Wang, W. Xiong, M. Yu, X. Guo, S. Chang, and W. Y. Wang (2019)

    Self-supervised learning for contextualized extractive summarization

    .
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2221–2227. External Links: Link, Document Cited by: §2.1.
  • M. Wang, Z. Lu, H. Li, and Q. Liu (2015) Syntax-based deep matching of short texts. IJCAI. Cited by: §2.1.
  • T. Wang and P. Isola (2020) Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In

    International Conference on Machine Learning

    ,
    pp. 9929–9939. Cited by: §1.
  • J. Wei and K. Zou (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6382–6388. External Links: Link, Document Cited by: §4.2.
  • T. Whang, D. Lee, C. Lee, K. Yang, D. Oh, and H. Lim (2020) An effective domain adaptive post-training method for bert in response selection. In Proc. Interspeech 2020, Cited by: Appendix A, §1, §2.1, §5.2, Table 2.
  • T. Whang, D. Lee, D. Oh, C. Lee, K. Han, D. Lee, and S. Lee (2021a) Do response selection models really know what’s next? utterance manipulation strategies for multi-turn response selection. Cited by: §1, §2.1.
  • T. Whang, D. Lee, D. Oh, C. Lee, K. Han, D. Lee, and S. Lee (2021b) Do response selection models really know what’s next? utterance manipulation strategies for multi-turn response selection. In AAAI, Cited by: §5.2, Table 2.
  • Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li (2017) Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 496–505. External Links: Link, Document Cited by: §1, §1, §2.1, 2nd item, Table 2.
  • H. Xu, B. Liu, L. Shu, and P. Yu (2019)

    BERT post-training for review reading comprehension and aspect-based sentiment analysis

    .
    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2324–2335. External Links: Link, Document Cited by: 1st item.
  • R. Xu, C. Tao, D. Jiang, X. Zhao, D. Zhao, and R. Yan (2021) Learning an effective context-response matching model with self-supervised tasks for retrieval-based dialogues. Cited by: §1, §2.1.
  • C. Yuan, W. Zhou, M. Li, S. Lv, F. Zhu, J. Han, and S. Hu (2019) Multi-hop selector network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 111–120. External Links: Link, Document Cited by: §2.1, Table 2.
  • Z. Zhang, J. Li, P. Zhu, H. Zhao, and G. Liu (2018) Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 3740–3752. External Links: Link Cited by: Table 2.
  • Y. Zhao, C. Xu, and W. Wu (2020) Learning a simple and effective model for multi-turn response generation with auxiliary tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 3472–3483. External Links: Link, Document Cited by: §2.1.
  • X. Zhou, D. Dong, H. Wu, S. Zhao, D. Yu, H. Tian, X. Liu, and R. Yan (2016) Multi-view response selection for human-computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 372–381. External Links: Link, Document Cited by: §2.1, Table 2.
  • X. Zhou, L. Li, D. Dong, Y. Liu, Y. Chen, W. X. Zhao, D. Yu, and H. Wu (2018) Multi-turn response selection for chatbots with deep attention matching network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1118–1127. External Links: Link, Document Cited by: §1, §2.1, Table 2.
  • B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. V. Le (2020) Rethinking pre-training and self-training. NeurIPS. Cited by: §4.2.

Appendix A More Implementation Details

Domain-adaptive Post-training

For domain-adaptive post-training, we take the same hyper-parameter settings as BERT-DPT Whang et al. (2020). Concretely, the maximum length of input dialogue is set to be 512. A full dialogue is randomly cut into a shorted token sequence with a probability of 10%. A masked language model loss and a next sentence prediction loss is optimized jointly during post-training. For the masked language model training, we masked each token with a probability of 15%. The post-training process traverses all the dialogues for 10 iterations, and the words that are masked during each iteration are independently sampled.

Fine-tuning for Response Selection

The model is fine-tuned with the response selection task. The projection layer for transforming [CLS] vectors into projection vectors is an MLP with one hidden layer with hidden size being 256. For dialogues longer than 512 (i.e. the maximum length supported by BERT), we discard the beginning of its context while keeps a complete response, as the latter part of the dialogue context may have stronger relevance with the response. We take an AdamW optimizerLoshchilov and Hutter (2019) with linear learning rate decay for fine-tuning. The initial learning rate is , and gradually decreases to 0 within 5 epochs. The for controlling the balance between response selection loss and contrastive loss is set to be 1.

All pre-trained language model checkpoints are downloaded from huggingface333https://huggingface.co/, with their names as the keys except for BERT-Small. For the BERT-Small model, the pre-trained model checkpoint is downloaded with model name “prajjwal1/bert-small”. Each model is trained by 3 times, and the best results among them are reported.