Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension

09/08/2021
by   Yiyang Li, et al.
Shanghai Jiao Tong University
0

Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue, resulting in intricate speaker information flows and noisy dialogue contexts. To alleviate such difficulties, previous models focus on how to incorporate these information using complex graph-based modules and additional manually labeled data, which is usually rare in real scenarios. In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows, and capture salient clues in a long dialogue. Experimental results on two benchmark datasets have justified the effectiveness of our method over competitive baselines and current state-of-the-art models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/15/2021

Structural Modeling for Dialogue Disentanglement

Tangled multi-party dialogue context leads to challenges for dialogue re...
02/01/2019

DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension

We present DREAM, the first dialogue-based multiple-choice reading compr...
09/09/2021

Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension

Multi-party multi-turn dialogue comprehension brings unprecedented chall...
04/13/2020

From Machine Reading Comprehension to Dialogue State Tracking: Bridging the Gap

Dialogue state tracking (DST) is at the heart of task-oriented dialogue ...
05/23/2021

Structural Pre-training for Dialogue Comprehension

Pre-trained language models (PrLMs) have demonstrated superior performan...
06/30/2019

Self-Supervised Dialogue Learning

The sequential order of utterances is often meaningful in coherent dialo...
01/28/2019

Personalized Dialogue Generation with Diversified Traits

Endowing a dialogue system with particular personality traits is essenti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialogue machine reading comprehension (MRC, Hermann et al., 2015) aims to teach machines to understand dialogue contexts so that solves multiple downstream tasks (Yang and Choi, 2019; Li et al., 2020; Lowe et al., 2015; Wu et al., 2017; Zhang et al., 2018). In this paper, we focus on question answering (QA) over dialogue, which tests the capability of a model to understand a dialogue by asking it questions with respect to the dialogue context. QA over dialogue is of more challenge than QA over plain text (Rajpurkar et al., 2016; Reddy et al., 2019; Yang and Choi, 2019) owing to the fact that conversations are full of informal, colloquial expressions and discontinuous semantics. Among this, multi-party dialogue brings even more tremendous challenge compared to two-party dialogue (Sun et al., 2019; Cui et al., 2020) since it involves multiple speakers at one dialogue, resulting in complicated discourse structure (Li et al., 2020) and intricate speaker information flows. Besides this, Zhang et al. (2021) also pointed that for long dialogue contexts, not all utterances contribute to the final answer prediction since a lot of them are noisy and carry no useful information.

To illustrate the challenge of multi-party dialogue MRC, we extract a dialogue example from FriendsQA dataset (Yang and Choi, 2019) which is shown in Figure 1.

Figure 1: Right part: A dialogue and its corresponding questions from FriendsQA, whose answers are marked with wavy lines. Left part: The speaker information flows of this dialogue.

This single dialogue involves four different speakers with intricate speaker information flows. The arrows here represent the direction of information flows, from senders to receivers. Let us consider the reasoning process of : a model should first notice that it is Rachel who had a dream and locate , then solve the coreference resolution problem that I refers to Rachel and you refers to Chandler. This coreference knowledge must be obtained by considering the information flow from to , which means Rachel speaks to Chandler. follows a similar process, a model should be aware of that is a continuation of and solves the above coreference resolution problem as well.

To tackle the aforementioned obstacles, we design a self-supervised speaker prediction task to implicitly model the speaker information flows, and a pseudo-self-supervised key-utterance prediction task to capture salient utterances in a long and noisy dialogue. In detail, the self-supervised speaker prediction task guides a carefully designed Speaker Information Decoupling Block (SIDB, introduced in Section 3.4) to decouple speaker-aware information, and the key-utterance prediction task guides a Key-utterance Information Decoupling Block (KIDB, introduced in Section 3.3) to decouple key-utterance-aware information. We finally fuse these two kinds of information and make final span prediction to get the answer of a question.

To sum up, the main contributions of our method are three folds:

  • [leftmargin=*, topsep=1pt]

  • We design a novel self-supervised speaker prediction task to better capture the indispensable speaker information flows in multi-party dialogue. Compared to previous models, our method requires no additional manually labeled data which is usually rare in real scenarios.

  • We design a novel key-utterance prediction task to capture key-utterance information in a long dialogue context and filter noisy utterances.

  • Experimental results on two benchmark datasets show that our model outperforms strong baselines by a large margin, and reaches comparable results to the current state-of-the-art models even under the condition that they utilized additional labeled data.

2 Related work

2.1 Pre-trained Language Models

Recently, pre-trained language models (PrLMs), like BERT

(Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2019), XLNet (Yang et al., 2019) and ELECTRA (Clark et al., 2020), have reached remarkable achievements in learning universal natural language representations by pre-training large language models on massive general corpus and fine-tuning them on downstream tasks (Socher et al., 2013; Wang et al., 2018; Wang et al., 2019; Lai et al., 2017). We argue that the self-attention mechanism (Vaswani et al., 2017) in PrLMs is in essence a variant of Graph Attention Network (GAT, Veličković et al., 2017

), which has an intrinsic capability of exchanging information. Compared to vanilla GAT, a Transformer block consisting of residual connection

(He et al., 2016) and layer normalization (Ba et al., 2016) is more stable in training. Hence, it is chosen as the basic architecture of our SIDB (Section 3.4) and KIDB (Section 3.3) instead of vanilla GAT.

2.2 Multi-party Dialogue Modeling

There are several previous works that study multi-party dialogue modeling on different downstream tasks such as response selection and dialogue emotion recognition. Hu et al. (2019) utilize the response to (@)

labels and a Graph Neural Network (GNN) to explicitly model the speaker information flows.

Wang et al. (2020) design a pre-training task named Topic Prediction to equip PrLMs with the ability of tracking parallel topics in a multi-party dialogue. Jia et al. (2020) make use of an additional labeled dataset to train a dependency parser, then utilize the dependency parser to disentangle parallel threads in multi-party dialogues. Ghosal et al. (2019) propose a window-based heterogeneous Graph Convolutional Network (GCN) to model the emotion flow in multi-party dialogues.

2.3 Speaker Information Incorporation

In dialogue MRC, speaker information plays a significant role in comprehending the dialogue context. In the latest studies, Liu et al. (2021) propose a Mask-based Decoupling-Fusing Network (MDFN) to decouple speaker information from dialogue contexts, by adding inter-speaker and intra-speaker masks to the self-attention blocks of Transformer layers. However, their approach is restricted to two-party dialogue since they have to specify the sender and receiver roles of each utterance. Gu et al. (2020) propose Speaker-Aware BERT (SA-BERT) to capture speaker information by adding speaker embedding at token representation stage of the Transformer architecture, then pre-train the model using next sentence prediction (NSP) and masked language model (MLM) losses. Nonetheless, their speaker embedding lacks of well-designed pre-training task to refine, resulting in inadequate speaker-specific information. Different from previous models, our model is suitable for the more challenging multi-party dialogue and is equipped with carefully-designed task to better capture the speaker information.

3 Methodology

In this part, we will formulate our task and present our proposed model as shown in Figure 2. There are four main parts in our model, a shared Transformer encoder, a key-utterance information decoupling block, a speaker information decoupling block and a final fusion-prediction layer. In the following sections, we will introduce these modules in detail.

Figure 2: The overview of our model, which contains a shared Transformer encoder, a key-utterance information decoupling block, a speaker information decoupling block and a fusion-prediction layer. In speaker information decoupling block, the bi-directional arrow means that the information flows from and to both sides, the unidirectional arrow means that the information only flows from start nodes to end nodes.

3.1 Task Formulation

Let be a dialogue context with utterances. Each utterance consists of a speaker specified by a name and a sequence of words speaker utters. can be denoted as a -length sequence . Let a question corresponds to the dialogue context be , where is the length of the question and each is a token of the question. Given and , a dialogue MRC model is required to find an answer for the question, which is restricted to be a continuous span of the dialogue context. In some datasets, can be an empty string indicating that there is no answer to the question according to the dialogue context.

3.2 Shared Transformer Encoder

To fully utilize the powerful representational ability of PrLMs, we employ a pack and separate method as Zhang et al. (2021), which is supposed to take advantage of the deep Transformer blocks to make the context and question better interacted with each other. We first pack the context and question as a joint input to feed into the Transformer blocks and separate them according to the position for further interaction.

Given the dialogue context and a corresponding question , we pack them to form a sequence:
= {[CLS][SEP]:[SEP] : [SEP]}, where [CLS] and [SEP] are two special tokens and each : pair is the name and utterance of a speaker separated by a colon. This sequence is then fed into layers of Transformer blocks to gain its contextualized representation where is the length of the sequence after tokenized by Byte-Pair Encoding (BPE) tokenizer (Sennrich et al., 2016) and is the hidden dimension of the Transformer block. Here is the total number of Transformer layers specified by the type of the PrLM, is a hyper-parameter which means the number of decoupling layers.

3.3 Key-utterance Information Decoupling Block

Given the contextualized representation from Section 3.2, follow Zhang et al. (2021), we gather the representation of [SEP] tokens from as the representation of each utterance in the dialogue context. These representations are used to initialize utterance nodes and a question node as illustrated in the middle-upper part of Figure 2. The representations of normal tokens are gathered as token nodes where is the number of normal tokens in the dialogue context. Then, another layers of multi-head self-attention Transformer blocks are used to exchange information inter- and intra- the three types of nodes:

(1)

Here , , , are matrices with trainable weights, is the number of attention heads and denotes the concatenation operation.

After stacking layers of multi-head self-attention: to fully exchange information between these nodes, we get a question representation , the utterance representations , and the token representations .

is then paired with each

to conduct the key-utterance prediction task. In detail, we use a heuristic matching mechanism proposed by

(Mou et al., 2016) to calculate the matching score of the question representation and utterance representation. Here we define a matching function , where , as follows:

(2)

Here denotes element-wise multiplication and

is a vector with trainable weights. The

is an activation function to get a probability distribution according to the downstream loss function, which can be chosen from

and . In span-based dialogue MRC datasets, we set the pseudo-self-supervised key-utterance target based on the position of the answer span. We name it pseudo-self-supervised since it is generated from the original span labels, but requires no additional labeled data. Specifically, we set where is the index of the utterance that contains the answer span. Then we calculate the key-utterance distribution by:

(3)

is later expanded to the length of token nodes to get which will be put forward to filter noisy utterances in the fusion-prediction layer (introduce in Section 3.5). We adopt cross-entropy loss to compute the loss of this task:

(4)

The gradient of will flow backwards to refine the representations of the utterance nodes so that they can decouple key-utterance-aware information from the original representations. After the interaction between token nodes and utterance nodes, the token nodes will gather key-utterance-aware information from the utterance nodes. Therefore, we denote the token representations as key-utterance-aware: , which will be forwarded to the fusion-prediction layer described in Section 3.5.

3.4 Speaker Information Decoupling Block

This part is the core of our model, which contributes to modeling the complex speaker information flows. In this section, we first introduce the self-supervised speaker prediction task we proposed, then depict the decoupling process of speaker information.

3.4.1 Self-supervised Speaker Prediction

As defined in Section 3.1, we have a dialogue context where each utterance consists of a speaker specified by a name. We randomly choose an utterance and mask its speaker name. Then for every pair where , the model should determine whether they are uttered by the same speaker, that is to say, whether .

We figure this task a relatively difficult one since it requires the model to have a thorough understanding of the speaker information flows and solve problems such as coreference resolution. Figure 3 is an example of the self-supervised speaker prediction task, where the speaker of the utterance in gray is masked. We human can determine that the masked speaker should be Emily Waltham by considering that Ross and Monica is persuading Emily to attend the wedding by showing her the wedding place, and when Monica and Emily reaches there, it should be Emily who is surprised to say "Oh My God". However, it is not that easy for machines to capture these information flows.

Figure 3: An example of the speaker prediction task, which involves three speakers. Scene here is a narrative description which introduces some additional information about the scene.

3.4.2 Speaker Information Decoupling

To fully utilize the interactive feature of self-attention mechanism (Vaswani et al., 2017) and the powerful representational ability of PrLMs, we also use Transformer blocks to capture the interactive speaker information flows and fulfill this difficult task.

We first detach from the computational graph to get , then as what we do in Section 3.3, the representation of [SEP] tokens are gathered from to initialize unmasked speaker nodes and a masked speaker node . The representation of normal tokens are gathered as token nodes . Then, we add attention mask to the token nodes corresponding to the selected speaker name before they are forwarded into the speaker information decoupling block, as illustrated in the middle-lower part of Figure 2. The reasons why we use this detach-mask strategy are as follows. First, we mask the selected speaker before the speaker information decoupling block instead of at the very beginning before the encoder since it is better to let the utterance decoupling block see all the speaker names. Based on this point, we detach from the computational graph and add attention mask to avoid target leakage. If we use a normal forward instead, the encoder would simply attend to the speaker names, which would hurt performance (discuss in detail in Section 5.3). Besides, this strategy also helps the model better decouple the key-utterance-aware and speaker-aware information from the original representations.

In detail, the mask strategy is similar as Liu et al. (2021). We modify Eq. (1) to:

(5)

Let the start index and end index of the masked speaker tokens be and , to make the selected speaker name unseen to other nodes, the attention mask is obtained as follows:

(6)

By adding this mask, other nodes will not attend to the masked token nodes, thus preventing target leakage. On the mean time, the speaker nodes will have to collect clues from other nodes through deep interaction to make prediction, which implicitly models the complex speaker information flows.

After stacking layers of masked multi-head self-attention: MultiHead([]), we get a masked speaker representation , the normal speaker representation , and the token representation .

is then paired with each to conduct the self-supervised speaker prediction task. We also adopt the matching function defined in Eq. (2):

(7)

For convenience and without loss of generality, we make which means we mask the speaker of the utterance, in the following description. We construct the self-supervised target by:

(8)

Then binary cross entropy loss is applied here to compute the loss of this task:

(9)

The gradient of will flow backwards to refine the representations of speaker nodes so that they can decouple speaker-aware information from the original representations. After the interaction between token nodes and speaker nodes, the token nodes will gather speaker-aware information from the speaker nodes. Therefore, we denote the token representations as speaker-aware: , which will be forwarded to the fusion-prediction layer described in next section.

3.5 Fusion-Prediction Layer

Given the key-utterance-aware token representation and the speaker-aware token representations , we first fuse these two kinds of decoupled representation using the following transformation:

(10)

where

is a linear transformation matrix with trainable weights and

Tanh is a non-linear activation function.
Then we compute the start and end distributions over the tokens by:

(11)

where and are vectors of size with trainable weights, is defined on Section 3.3 and is element-wise multiplication.
Given the ground truth label of answer span , cross entropy loss is adopted to train our model:

(12)

If the dataset contains unanswerable question, the representation of at position is used to predict whether a question is answerable or not:

(13)

where and are vectors of size with trainable weights.
Given the ground truth of answerability , binary cross entropy is applied to compute the answerable loss:

(14)

The final loss is the summation of the above losses:

(15)

4 Experiments

4.1 Benchmark Datasets

We adopt FriendsQA (Yang and Choi, 2019) and Molweni Li et al. (2020), two span-based extractive dialogue MRC benchmarks, to evaluate the effectiveness of our model.
Molweni is derived from the large-scale multi-party dialogue dataset — Ubuntu Chat Corpus (Lowe et al., 2015), whose main theme is technical discussions about problems on Ubuntu system. This dataset features in its informal speaking style and domain-specific technical terms. In total, it contains 10,000 dialogues whose average and maximum number of speakers is 3.51 and 9 respectively. Each dialogue is short in length with the average and maximum number of tokens 104.4 and 208 respectively. Unanswerable questions are asked in this dataset, hence the answerable loss in Eq. (14) is applied. Additionally, this dataset is equipped with discourse parsing annotations which is not used by our model however.
To evaluate our model more comprehensively, another open-domain dialogue MRC dataset FriendsQA is also used to conduct our experiments. FriendsQA excerpts 1,222 scenes and 10,610 open-domain questions from the first four seasons of a well-known American TV show Friends to tackle dialogue MRC on everyday conversations. Each dialogue is longer in length and involves more speakers, resulting in more complicated speaker information flows compared to Molweni. For each dialogue context, at least 4 out of 6 types (5W1H) of questions, are generated. This dataset features in its colloquial language style filled with sarcasms, metaphors, humors, etc.

4.2 Implementation Details

We implement our model based on Transformers Library (Wolf et al., 2020). The number of information decoupling layers

is chosen from 3 - 5 according to the type of the PrLM in our experiments. For Molweni, we set batch size to 8, learning rate to 1.2e-5 and maximum input sequence length of the Transformer blocks to 384. For FriendsQA, they are 4, 4e-6 and 512 respectively. Note that in FriendsQA, there are dialogue contexts whose length (in tokens) are larger than 512. We split those contexts to pieces and choose the answer with highest span probability

as the final prediction.

4.3 Baseline Models

For FriendsQA, we adopt BERT as the baseline model follow Li and Choi (2020) and Liu et al. (2020). For Molweni, we follow Li et al. (2021) who also employ BERT as the baseline model. In addition, we also adpot ELECTRA (Clark et al., 2020) as a strong baseline in both datasets to see if our model still holds on top of stronger PrLMs.

4.4 Results

Table 1 shows our experimental results on FriendsQA. BERT (Li and Choi, 2020) is a method using pretrain-fine-tune form. They first pre-train BERT on FriendsQA and additional transcripts from Seasons 5-10 of Friends using well designed pre-training tasks Utterance-level-Masked-LM (ULM) and Utterance-Order-Prediction (UOP), then fine-tune it on dialogue MRC task. BERT (Liu et al., 2020) is a graph-based model that integrates relation knowledge and coreference knowledge using Relational Graph Convolution Networks (R-GCNs) (Schlichtkrull et al., 2018). Note that this model utilizes additional labeled data on coreference resolution (Chen et al., 2017) and character relation (Yu et al., 2020)

. We adopt the same evaluation metrics as

Li et al. (2020): exactly match (EM) and F1 score. Our model reaches new state-of-the-art (SOTA) result on EM metric and comparable result on F1 metric, even without any additional labeled data. Besides, our model still gains great performance improvement under ELECTRA-based condition, which demonstrates the effectiveness of our model over strong PrLMs.

 

Model
BERT
BERT (Li and Choi)
BERT (Liu et al.)
BERT
ELECTRA
ELECTRA

 

Table 1: Results on FriendsQA

Table 2 presents our experimental results on Molweni. Public Baseline is directly taken from the original paper of Molweni (Li et al., 2020). DADGraph (Li et al., 2021) is the current SOTA model that utilizes Graph Convolution Network (GCN) and the additional discourse annotations in Molweni to explicitly model the discourse structure. We see from the the table that our model outperforms strong baselines and the current SOTA model by a large margin, even under the condition that we do not make use of additional discourse annotations.

 

Model
BERT (Li et al.)
BERT
BERT (Li et al.)
BERT
ELECTRA
ELECTRA

 

Table 2: Results on Molweni

5 Analysis

5.1 Performance Gain Analysis

To get more detailed insights on our proposed method, we analyze the results on different question types of FriendsQA over ELECTRA-based model. Also, we compare our model with the baseline model on these types to see where the performance gains come from. Table 3

shows the results of our model on different question types. Dist. means the distribution of each question type, from which we see that the question type of FriendsQA is nearly uniformly distributed.

Performance gains mainly come from question type Who, When and What. We argue that the speaker information decoupling block is the predominant contributor to Who question type since answering this type of question requires the model to have a deep understanding of speaker information flows and solve problems like coreference resolution, which is the same as our self-supervised speaker prediction task. For question type When, the key-utterance information decoupling block contributes the most. The answer of question type When usually comes from a scene description utterance, hence grabbing key-utterance information helps answer this kind of question. Among these improvements, question type Who benefits the most from our model, demonstrating the strong capability of the self-supervised speaker prediction task.

 

Type Dist.
Who
When
What
Where
Why
How

 

Table 3: Results on different question types, where up arrows represent performance gain and down arrows represent performance drop compared to the baseline model. Significant gains (greater than 3%) are marked as bold.

5.2 Ablation Study

We conduct ablation study to see the contribution of each module. Table 4 shows the results of our ablation study. Here KIDB and SIDB are the abbreviation of Key-utterance Information Decoupling Block and Speaker Information Decoupling Block respectively. We see from the results that both of the two modules contributes to the performance gains of our final model. For FirendsQA, SIDB contributes more and otherwise for Molweni. This is because dialogue contexts in FriendsQA tend to be long, involve more speakers and carry more complex speaker information flows. On the contrary, dialogue contexts in Molweni are short with less turns and most of the questions can be answered by considering only one key-utterance.

To further investigate the effectiveness of our self-supervised speaker prediction task, we design a SpeakerEmb model in which we replace the speaker-aware token representation by speaker representations. The speaker representations are obtained by simply gathering embeddings from a trainable embedding look-up table according to the name of the speaker. Experimental results show that it only makes a slight performance gain compared to SIDB, demonstrating that simply adding speaker information is sub-optimal compared to implicitly modeling speaker information flows using our self-supervised speaker prediction task.

 

Model FriendsQA Molweni
EM F1 EM F1
Our Model
 w/o KIDB
 w/o SIDB
SpeakerEmb

 

Table 4: Results of Ablation Study

5.3 Influence of Detaching Operation

We conduct experiments to investigate the influence of detaching operation mentioned in Section 3.4. As shown in Table 5, if we do not detach from the original computation graph when performing the speaker prediction task, the prediction accuracy reaches 96.8% in the test set of FriendsQA, indicating obvious label leakage. In the meantime, the EM and F1 scores drop to 54.5% and 70.8%, respectively. On the contrary, our model reaches a speaker prediction accuracy of 80.8%, which demonstrates that the detaching operation can effectively prevent label leakage.

5.4 Influence of Speaker and Utterance Numbers

Figure 4 illustrates the model performance with regard to the number of speakers and utterances on FriendsQA. At the beginning, the baseline model has similar performance to our model. However, with the number of speakers and utterances increasing, there is a growing performance gap between the baseline model and our model. This observation demonstrates that our SIDB and KIDB have strong abilities to deal with more complex dialogue contexts with a larger number of speakers and utterances.

(a) Scores vs. Number of Speakers
(b) Scores vs. Number of Utterances
Figure 4: Influence of Speaker and Utterance Numbers

 

Model
Our Model
 w/o Detaching 54.5 70.8 96.8

 

Table 5: Influence of Detaching Operation

5.5 Case Study

To get more intuitive explanations of our model, we select two cases from FriendsQA in which the baseline model fails to answer (F1 = 0, or "exactly not match") but our model is able to answer (exactly match).

Figure 5: Two cases from FriendsQA

Figure 5 illustrates two cases where the context of the first one is shown in Figure 1.

In the first case, the baseline model simply predicts that "you and I" were in Rachel’s dream while fails to notice that "you" here refers to Chandler. On the contrary, our model is able to capture this information since it helps the speaker prediction task. In fact, if we mask Rachel in , our model could tell the masked speaker is Rachel, indicating that it knows it should be Rachel who had a dream and is in response to .

Similar observations can be seen in the second case. The baseline model simply matches the semantic meaning of the question and the context then makes a wrong prediction. Compared with the baseline model, our model has the ability to catch the information flow from Rachel to Monica thus predicts the answer correctly.

6 Conclusion

In this paper, for multi-party dialogue MRC, we propose two novel self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to capture salient clues in a long and noisy dialogue. Experimental results on two multi-party dialogue MRC benchmarks, FriendsQA and Molweni, have justified the effectiveness of our model.

References

  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. External Links: 1607.06450 Cited by: §2.1.
  • H. Y. Chen, E. Zhou, and J. D. Choi (2017) Robust coreference resolution and entity linking on dialogues: character identification on tv show transcripts. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 216–225. External Links: Link Cited by: §4.4.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In ICLR, External Links: Link Cited by: §2.1, §4.3.
  • L. Cui, Y. Wu, S. Liu, Y. Zhang, and M. Zhou (2020) MuTual: a dataset for multi-turn dialogue reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1406–1416. External Links: Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link Cited by: §2.1.
  • D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh (2019)

    DialogueGCN: a graph convolutional neural network for emotion recognition in conversation

    .
    In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    ,
    pp. 154–164. External Links: Link Cited by: §2.2.
  • J. Gu, T. Li, Q. Liu, Z. Ling, Z. Su, S. Wei, and X. Zhu (2020) Speaker-aware bert for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 2041–2044. External Links: Link Cited by: §2.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    External Links: Link Cited by: §2.1.
  • K. M. Hermann, T. Kočiskỳ, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pp. 1693–1701. External Links: Link Cited by: §1.
  • W. Hu, Z. Chan, B. Liu, D. Zhao, J. Ma, and R. Yan (2019) GSN: a graph-structured network for multi-party dialogues. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 5010–5016. External Links: Link Cited by: §2.2.
  • Q. Jia, Y. Liu, S. Ren, K. Zhu, and H. Tang (2020) Multi-turn response selection using dialogue dependency relations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1911–1920. External Links: Link Cited by: §2.2.
  • G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017) RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794. External Links: Link Cited by: §2.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    In International Conference on Learning Representations, External Links: Link Cited by: §2.1.
  • C. Li and J. D. Choi (2020) Transformers to learn hierarchical contexts in multiparty dialogue for span-based question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5709–5714. External Links: Link Cited by: §4.3, §4.4, Table 1.
  • J. Li, M. Liu, M. Kan, Z. Zheng, Z. Wang, W. Lei, T. Liu, and B. Qin (2020) Molweni: a challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 2642–2652. External Links: Link Cited by: §1, §4.1, §4.4, §4.4, Table 2.
  • J. Li, M. Liu, Z. Zheng, H. Zhang, B. Qin, M. Kan, and T. Liu (2021) DADgraph: a discourse-aware dialogue graph neural network for multiparty dialogue machine reading comprehension. arXiv preprint arXiv:2104.12377. External Links: Link Cited by: §4.3, §4.4, Table 2.
  • J. Liu, D. Sui, K. Liu, and J. Zhao (2020) Graph-based knowledge integration for question answering over dialogue. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 2425–2435. External Links: Link Cited by: §4.3, §4.4, Table 1.
  • L. Liu, Z. Zhang, H. Zhao, X. Zhou, and X. Zhou (2021) Filling the gap of utterance-aware and speaker-aware representation for multi-turn dialogue. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), External Links: Link Cited by: §2.3, §3.4.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. External Links: Link Cited by: §2.1.
  • R. Lowe, N. Pow, I. V. Serban, and J. Pineau (2015) The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294. External Links: Link Cited by: §1, §4.1.
  • L. Mou, R. Men, G. Li, Y. Xu, L. Zhang, R. Yan, and Z. Jin (2016) Natural language inference by tree-based convolution and heuristic matching. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 130–136. External Links: Link Cited by: §3.3.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. External Links: Link Cited by: §1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) CoQA: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. External Links: Link Cited by: §1.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European semantic web conference, pp. 593–607. External Links: Link Cited by: §4.4.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. External Links: Link Cited by: §3.2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. External Links: Link Cited by: §2.1.
  • K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, and C. Cardie (2019) DREAM: a challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7, pp. 217–231. External Links: Link Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010. External Links: Link Cited by: §2.1, §3.4.2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. External Links: Link Cited by: §2.1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems 32. External Links: Link Cited by: §2.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355. External Links: Link Cited by: §2.1.
  • W. Wang, S. C. Hoi, and S. Joty (2020) Response selection for multi-party conversations with dynamic topic tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6581–6591. External Links: Link Cited by: §2.2.
  • T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue, A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer, et al. (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. External Links: Link Cited by: §4.2.
  • Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li (2017) Sequential matching network: a new architecture for multi-turn response selection in retrieval-based chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 496–505. External Links: Link Cited by: §1.
  • Z. Yang and J. D. Choi (2019) FriendsQA: open-domain question answering on tv show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pp. 188–197. External Links: Link Cited by: §1, §1, §4.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 32, pp. 5753–5763. External Links: Link Cited by: §2.1.
  • D. Yu, K. Sun, C. Cardie, and D. Yu (2020) Dialogue-based relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4927–4940. External Links: Link Cited by: §4.4.
  • Z. Zhang, J. Li, P. Zhu, H. Zhao, and G. Liu (2018) Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3740–3752. External Links: Link Cited by: §1.
  • Z. Zhang, J. Li, and H. Zhao (2021) Multi-turn dialogue reading comprehension with pivot turns and knowledge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 1161–1173. External Links: Link Cited by: §1, §3.2, §3.3.