Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension

by   Xinbei Ma, et al.
Shanghai Jiao Tong University

Multi-party multi-turn dialogue comprehension brings unprecedented challenges on handling the complicated scenarios from multiple speakers and criss-crossed discourse relationship among speaker-aware utterances. Most existing methods deal with dialogue contexts as plain texts and pay insufficient attention to the crucial speaker-aware clues. In this work, we propose an enhanced speaker-aware model with masking attention and heterogeneous graph networks to comprehensively capture discourse clues from both sides of speaker property and speaker-aware relationships. With such comprehensive speaker-aware modeling, experimental results show that our speaker-aware model helps achieves state-of-the-art performance on the benchmark dataset Molweni. Case analysis shows that our model enhances the connections between utterances and their own speakers and captures the speaker-aware discourse relations, which are critical for dialogue modeling.



There are no comments yet.


page 3


Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue

A multi-turn dialogue is composed of multiple utterances from two or mor...

Personalized Dialogue Generation with Diversified Traits

Endowing a dialogue system with particular personality traits is essenti...

Une analyse basée sur la S-DRT pour la modélisation de dialogues pathologiques

In this article, we present a corpus of dialogues between a schizophreni...

Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension

Multi-party dialogue machine reading comprehension (MRC) brings tremendo...

Structured Attention for Unsupervised Dialogue Structure Induction

Inducing a meaningful structural representation from one or a set of dia...

Topic-Aware Multi-turn Dialogue Modeling

In the retrieval-based multi-turn dialogue modeling, it remains a challe...

GSN: A Graph-Structured Network for Multi-Party Dialogues

Existing neural models for dialogue response generation assume that utte...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training models to understand dialogue contexts and answer questions has been shown even more challenging than common machine reading comprehension (MRC) tasks on plain text Reddy et al. (2019); Choi et al. (2018). In this paper, we focus on the challenging multi-party multi-turn dialogue MRC, whose given passage consists of multiple utterances announced by three or more speaker roles Li et al. (2020). Compared to two-party dialogues Lowe et al. (2015); Wu et al. (2016); Zhang et al. (2018), multi-party multi-turn dialogues have much more complex scenarios: First, every speaker role has a different speaking manner and speaking purposes, which leads to a unique speaking style of each speaker role. Thus the speaker property of each utterance provides unique cluesLiu et al. (2021); Gu et al. (2020). Second, instead of speaking in rotation in two-party dialogues, the transition of speakers in multi-party dialogues is in a random order, breaking the continuity as that in common non-dialogue texts due to the presence of crossing dependencies which are commonplace in a multi-party chat. Third, there may be multiple dialogue threads tangling within one dialogue history happened between any two or more speakers, making interrelations between utterances much more flexible, rather than only existing in adjacent utterances. Thus, the multi-party dialogue appears discourse dependency relations between non-adjacent utterances, which leads up to a graphical discourse structure Shi and Huang (2019); Li et al. (2020).

benkong2: also i did a sudo chown -r and also got permission denied Dr_Willis: swapfile drive ? you mean a swap partition ? benkong2: no a drive to share files with the rest of the network. Dr_Willis: ok a ’share ’ EMOJI is what ya mean . lol.. for samba ? NickGarvey: could you toss the commands and out put on pastebin ? benkong2: error is : “ chown : changing ownership of FILEPATH operation not permitted ” smo: for vfat filesystems , the permissions are dictated by the mount options , not chmod
What is the permission dictated by ?
Answer: by the mount options
Start Position: 471
Question2: What system does Nick use? Answer:
Table 1: An example of multi-party multi-turn dialogue reading comprehension with one answerable question and one unanswerable question. Speaker names are underlined for highlight, and different colors indicate different speakers.

To demonstrate the challenge of the multi-party multi-turn dialogue MRC, we present an example in Table 1, which is from the multi-party multi-turn dialogue benchmark dataset Molweni Li et al. (2020). Figure 1 depicts the corresponding speaker-aware discourse structure of the example dialogue, with different colors indicating different speakers. In this dialogue, there are four speakers, whose conversation develops as Dr_Willis, NickGarvey and smo help benkong2 with a system error. Along with the context, there are two relevant questions expected to be answered. An extractive span is given as an answer of question 1, while question 2 is unanswerable only based on this dialogue.

Figure 1: Speaker-aware discourse structure of the example dialogue in Table 1.

The mainstream work of machine reading comprehension on multiple multi-turn dialogues commonly adopts the pre-trained language model (PrLM) Devlin et al. (2019) as an encoder to represent the dialogue contexts coarsely, taking the pairwise dialogue passage and question as a whole Qu et al. (2019); Gu et al. (2020); Li et al. (2020). Recent researches about modeling speaker-aware information for dialogue MRC proved to be effective Gu et al. (2020); Liu et al. (2021). However, there are still unmatched attentions over key speaker role clues,

Attention paid on speaker role information is insufficient. The random and complicated speaker transition of multi-party dialogues needs to be disentangled and represented explicitly.

Complex speaker role transition leads to sophisticated discourse structure but little attention is paid to structure and interrelations among the utterances, while discourse relationship among utterances may effectively embody speaker-aware clues from different perspectives.

In this work, we propose an enhanced speaker-aware model to comprehensively capture speaker-aware clues. In detail, to explicitly model the speaker role information, we employ extended disentanglement modules: 1) To capture the overall speaker-aware information in the entire dialogue, we operate masking-based multi-head attention method on each utterance on the basis of speaker roles. 2) We build two graph networks to model both annotated and unannotated discourse relations among utterances. These speaker-aware representions are fused and input a span-extraction layer to generate a reasonable answer to the question.

Experimental results on datasets show that the proposed strategy helps our model gain substantial performance improvements over strong baseline and achieve new state-of-the-art performance on Molweni Li et al. (2020) benchmark.

2 Background and related work

2.1 Dialogue Reading Comprehension

Researches on dialogues MRC aim to teach machines to read dialogue contexts and make response Reddy et al. (2019); Choi et al. (2018); Sun et al. (2019); Cui et al. (2020), whose common application is building intelligent human-computer interactive systems Chen et al. (2017); Shum et al. (2018); Li et al. (2017); Zhu et al. (2018b). Training machines to understand dialogue has been shown much more challenging than the common MRC as every utterance in dialogue has an additional property of speaker role, which breaks the continuity as that in common non-dialogue texts due to the presence of complex discourse dependencies which are caused by speaker role transitions Afantenos et al. (2015); Shi and Huang (2019); Li et al. (2020).

Early studies mainly focus on the matching between the dialogue contexts and the questions Huang et al. (2018); Zhu et al. (2018a). As PrLMs prove to be useful as contextualized encoder with impressive performance, a general way is employing PrLMs to handle the whole input texts of a dialogue context and a question as a linear sequence of successive tokens, where contextualized information is captured through self-attention Qu et al. (2019); Liu et al. (2020); Li et al. (2020). Such a way of modeling would be suboptimal to capture the high-level relationships between utterances in the dialogue history.

To leverage speaker-aware information for better performance, Gu et al. (2020) proposed Speaker-aware BERT for two-party dialogue tasks by reorganizing utterances according to spoken-from speaker and spoken-to speaker and adding a speaker embedding at token representation stage. Liu et al. (2021) went further with speaker property, designing a decoupling and fusing network to enhance the turn order and speaker of each utterance. Both of them show that speaker property is helpful on dialogue MRC. However, existing studies mostly work on retrieval-based response selection task and on two-party datasets or those without speaker annotations, which drives us to make an attempt to extend to QA task on the multi-party scenario.

In this work, we focus on QA task of multi-party multi-turn dialogue MRC, which involves more than two speakers in a given dialogue passage Li et al. (2020) and expects an answer for each relevant question. Different from existing contributions of speaker-aware works, we regard discourse relations as a reflection of speaker transition information, thus leverage these complex relations to model speaker-aware information comprehensively.

Figure 2: Overview of our model.

2.2 Discourse Structure Modeling

Discourse parsing focuses on the discourse structure and relationships of texts, whose aim is to predict the relations between discourse units and to discover the discourse structure between those units. Discourse structure has shown benefits to a wide range of NLP tasks, including MRC on multi-party multi-turn dialogue Asher et al. (2016); Xu et al. (2021); Ouyang et al. (2021); Takanobu et al. (2018); Gao et al. (2020); Jia et al. (2020).

In addition to the concerned discourse parsing on dialogue-related NLP tasks, most existing studies on linguistics-motivated discourse parsing are based on two annotated datasets, Penn Discourse TreeBank (PDTB) Prasad et al. (2008) or Rhetorical Structure Theory Discourse TreeBank (RST-DT) Mann and Thompson (1988). PDTB focuses on shallow discourse relations but ignores the overall discourse structure Qin et al. (2017); Cai and Zhao (2017); Bai and Zhao (2018); Yang and Li (2018). In contrast, RST is constituency-based, where related adjacent discourse units are merged to form larger units recursively Braud et al. (2017); Wang et al. (2017); Yu et al. (2018); Joty et al. (2015); Li et al. (2016); Liu and Lapata (2017). However, RST only discovers the relations between neighbor discourse units, which is not suitable for our concerned multi-party dialogues.

In this work, we use discourse parsing as an application-motivated technique for the dialogue MRC task. Our task relies on the dependency-based structures where dependency relations may appear between any two adjacent or non-adjacent utterances which may be presented by the same speaker Shi and Huang (2019); Li et al. (2020).

Compared to the existing works mentioned above, our work is distinguished because: 1) we leverage speaker-aware information comprehensively for better performance progress; 2) we are one of the pioneers to model the speaker-aware discourse structure as graphs in dialogue MRC, to tackle the discourse tangle caused by speaker role transitions; 3) we firstly study general MRC task on multi-party multi-turn dialogue scenario with enhanced speaker-aware clues.

3 Methodology

Here, we present our enhanced speaker-aware model, as is shown in Figure 2, which enhances speaker-aware information through three extended modules. Our model contains a PrLM for encoding, three modules for disentanglement of complicated speaker-aware information, namely, Speaker Masking, Speaker Graph, Discourse Graph, and a span extraction layer for generating a final answer based on the fused representations. In this section, we will formulate the task and introduce every part of our model in detail.

3.1 Task Formulation

Supposing we conduct MRC on a multi-party multi-turn dialogue context , which consists of utterances and can be represented as . Each utterance consists of a name identity of the speaker and a sentence by the speaker, denoted by , where the sequence can be denoted as a -length sequence of words, . According to this multi-party multi-turn dialogue context, a question is put forward, and for this question, the model is expected to find a span from the dialogue context as a correct answer, or make a decision that this question is impossible to answer only based on the provided dialogue context.

3.2 Encoding

In order to utilize a PrLM such as BERT as an encoder to obtain the contextualized representations, we firstly concatenate the dialogue context and a question in the form of [CLS] question [SEP] context [SEP]. For the convenience of dividing utterances, we insert [SEP] token between each pair of adjacent utterances. The concatenated sequence is fed into a PrLM, and the output of the PrLM is the initial contextualized representations for each token, denoted as , where denotes the input sequence length in tokens, denotes the dimension of hidden states.

3.3 Speaker Masking

Having obtained the output contextualized representations from a PrLM, we design a decoupling module to capture the speaker property of each utterance and represent the speaker transition information of the dialogue passage.

We modify the mask-based Multi-Head Self-Attention mechanism proposed by Liu et al. (2021), adapting it to multi-party dialouges. The mask-based MHSA is formulated as follows:

where , , , , , denote the attention, head, query, key, value and mask, denotes the original representations from PrLM, and , , , are parameters. Operator [

] denotes concatenation. Instead of speaking in turn between two people, we have to explicitly identify the speaker of each utterance. In the implementation, we build a vector to label the speaker identity of each utterance, according to which, we mask utterances from the same speaker and utterances from different speakers. This step is denoted as:

where denotes the speaker identity, thus and denote masks of the same speaker and different speakers. contains the decoupled information of the same speaker while contains the decoupled information of the different speakers, as is shown in Figure 3.

Figure 3: Speaker-aware Masking for the example shown in Table 1.

Finally, we fuse the information from , and the original contextualized representation together, using the gate-based fusing method Liu et al. (2021). The fusing method is formulated as:

where and denote the shorthand of the two channels, and FC is shorthand of a fully-connected layer. Finally, we get the speaker-aware representations , which is in the same size of the original contextualized representation .

3.4 Graph Modeling

Complicated transitions of speaker roles segment text into separated utterances and breaks the consistency of passage, thus results in intricate interrelations among utterances. We assume that these relations are reflection of speaker property, and will provide passage-level clues for MRC.

We utilize the graph neural network to construct two heterogeneous graphs, called speaker graph and discourse graph, which are both in the form of relational graph convolutional network (R-GCN) following

Schlichtkrull et al. (2018). Speaker graph modeling relations of speaker property of each utterance. Discourse graph is built based on the speaker-aware discourse parsing relations, which are resulted from the complex non-adjacent dependencies caused by speaker transition and thus capture the latent speaker-aware information.

Speaker Graph

Since speaker property of each utterance impacts the dialogue development hugely, we build speaker graph to model relations of utterances based on speaker property. Specifically, we build an R-GCN to connect utterances from the same speaker, letting information exchanged among statements of one speaker, hoping to capture speaker manner. We denote the graph as , where denotes the set of vertices and denotes the set of edges. First we add vertices to represent every single utterance and a special global vertice for context-level information, denoted as:

where is the number of utterances. For each pair of utterances sharing the same speaker, we construct one edge and a reverse edge, which is denoted as . Finally, we construct a self-directed edge, , for each vertice and we connect the global vertice to every other vertice, denoted as .

Figure 4 illustrates the graph structure of the example dialogue in Table 1, with different colors for different kinds of edges.

Figure 4: Speaker graph of the example dialogue in Table 1.

The original representations of utterance vertice are the contextualized representations of [SEP] token extracted from , and the original representations of the global vertice is formed by embedding. The information exchange process can be formulated as:

where denotes the set of relations with other vertices. denotes the set of neighbours of vertice , which are connected to through relation , and is the element number of used for normalization. and are parameter matrices of layer .

is activated function, which in our implementation is ReLU

Glorot et al. (2011); Agarap (2018). After information exchange with neighbour nodes, we get the vectors of each utterance, containing speaker-aware interrelation information. After layers, we get as the last-layer output of the graph. Based on the intuition that each token inside the same utterance shares the same speaker information, we expand to the same dimension of for later fusion, which is denoted as . The extension is illustrated in Figure 5.

Figure 5: Extension of output of speaker graph.

Discourse Graph

Discourse relations contain latent speaker-aware information. In parallel to the speaker graph, we build a graph according to the annotated discourse relations to connect relevant utterance pairs. The preprocessing includes two steps. First, we assign a label for every considered relation. Second, we simplify each relation in the form of (first utterance, second utterance, relation label).

Then the graph is constructed according to the simplified representations of relations. We denote the graph as , where denotes the set of vertices and denotes the set of edges. Following kinds of vertices are constructed into the graph, utterance vertices for each utterances, relation vertices for each existing relations, and a global vertice to represent the dialogue-level information, which can be denoted as:

where is the number of utterances and is the number of corresponding relations. In terms of , for each relation (, , ), we construct oriented edges and , and also reverse oriented edges and . As the same as speaker graph, we add a self-directed edge to every vertice and for each vertice except the global one, a global vertice-directed edge is added. An example is shown in Figure 6.

Figure 6: Discourse graph of the example dialogue in Table 1.
Model Molweni FriendsQA
Public BaselinesLi et al. (2020) 45.3 58.0 45.2 -
1-5[0.8pt/2pt] Our Baselines 45.7 58.8 45.6 61.0
+Speaker EmbeddingGu et al. (2020) 47.9 61.5 45.0 61.6
+MDFNLiu et al. (2021) 48.4 62.4 46.1 62.9
+Our architecture 49.7 64.4 47.0 63.0
Public BaselinesLi et al. (2020) 51.8 65.5 - -
1-5[0.8pt/2pt] Our Baselines 52.0 65.6 47.3 63.3
+Speaker EmbeddingGu et al. (2020) 52.4 65.7 46.8 63.3
+MDFNLiu et al. (2021) 51.7 65.6 48.0 63.0
+Our architecture 52.9 66.9 49.0 64.0
Public BaselinesLi et al. (2020) 54.7 67.6 - -
1-5[0.8pt/2pt] Our Baselines 53.9 67.5 50.1 66.2
+Speaker EmbeddingGu et al. (2020) 56.0 68.3 49.2 65.9
+MDFNLiu et al. (2021) 55.8 68.7 50.4 66.2
+Our architecturel 56.0 69.1 52.1 68.0
Public BaselinesLi et al. (2020) - - - -
1-5[0.8pt/2pt] Our Baselines 57.3 70.4 56.8 74.0
+Speaker EmbeddingGu et al. (2020) 57.9 57.9 56.7 74.0
+MDFNLiu et al. (2021) 57.9 71.1 57.8 75.2
+Our architecture 58.6 72.2 58.7 75.4
Table 2: Experimental results on the test set of Molweni and FriendsQA. All results are from our inplementations except public baselines.
Model EM F1
BERT 45.3 58.0
 +Speaker Masking 49.6 63.4
 +Speaker Graph 49.0 63.3
 +Discourse Graph 49.0 63.0
+Our architecture 49.7 64.4
BERT 51.8 65.5
 +Speaker Masking 52.7 65.8
 +Speaker Graph 52.7 66.0
 +Discourse Graph 52.1 65.5
+Our architecture 52.9 66.9
BERT 53.9 67.5
 +Speaker Masking 55.8 68.7
 +Speaker Graph 54.9 68.9
 +Discourse Graph 55.2 68.3
+Our architecture 56.0 69.1
ELECTRA 57.3 70.4
 +Speaker Masking 57.9 71.0
 +Speaker Graph 57.6 72.1
 +Discourse Graph 58.4 71.8
+Our architecture 58.6 72.2
Ablation on BERT
Our Model 49.7 64.4
 w/o Speaker Masking 48.6 63.0
 w/o Speaker Graph 49.1 63.2
 w/o Discourse Graph 49.2 63.5
Table 3: Ablation study.

Similar to the speaker graph, the original representations of utterance vertices are the contextualized representations of [SEP] token. The original representations of relation vertices and the global vertice are formed by embedding. Finally, we get the vectors of each utterance containing speaker-aware discourse structure information after the fusing of information from related vertices. The formulation of message-passing is the same as the speaker graph, where the set of relations contains more kinds of relations as shown in Figure 6. The output of the last-layer of discourse graph is denoted as , we keep the vectors for utterances and conduct the same extension as shown in Figure 5, then we get .

3.5 Fusing

Decoupled information from aforementioned three modules is fused to predict the answer. We concatenate , , and together to obtain the final speaker-enchanced contextualized representations:

Following the standard process for span-based MRC Devlin et al. (2019); Glass et al. (2020); Zhang et al. (2021)

, the representations are fed to a fully connected layer to calculate the probability distribution of the start and end positions of answer spans, and cross-entropy function is used as the training object to minimized.

4 Experiments

Our method is evaluated on multi-party multi-turn dialogue MRC benchmark, Molweni Li et al. (2020) and FriendsQA Yang and Choi (2019).

4.1 Datasets


Molweni Li et al. (2020) is a multi-party multi-turn dialogue dataset that derives from Ubuntu Chat Corpus Lowe et al. (2015) which consists of 10,000 multi-party multi-turn dialogues context. On average, each dialogue context contains 8.82 utterances from 3.51 speaker roles. Following annotations are made on the raw dataset, making Molweni an ideal evaluation dataset for our research. 1) Answerable and unanswerable extractive questions according to dialogues. 2) Elementary discourse units (EDUs) on the utterance level, including the utterance and a speaker name. 3) Discourse relations for each dialogue passage, reflecting interrelations between utterances.


To verify the generality, we also evaluate our model on FriendsQA Yang and Choi (2019), which is a challenging multi-party multi-turn dialogue dataset including 1,222 human-to-human conversations from the TV show Friends. 10,610 answerable extractive questions are annotated. Discourse relations are annotated by using the tool of Shi and Huang (2019).

4.2 Baseline

Following Li et al. (2020), we use BERT as a naive baseline, where the contextualized output is used for span extraction directly. In addition, we compare our model with existing speaker-aware work Liu et al. (2021); Gu et al. (2020). Since they work on response selection task on two-party scenario or datasets without explicit speaker annotations, we adjust and implement their ideas on QA task of the multi-party scenario. We also apply BERT, and BERT (BERT) and ELECTRA Clark et al. (2020) as baselines, to see if the advance of our method still holds on top of the stronger PrLMs.

4.3 Setup

Our implementations are based on Transformers Library Wolf et al. (2020). Exact match (EM) and F1 score are the two metrics to measure performance. We fine-tune our model employing AdamW Loshchilov and Hutter (2019)

as the optimizer. The learning-rate is set to 3e-5, 5e-5, and 4e-6. In addition, the input sequence length is set to 348, which our inputs are truncated or padded to.

4.4 Results

Table 2 shows the results of our experiments. The experimental results show that our model outperforms all baselines and achieves SOTA on benchmark Molweni. We also see that our model helps effectively capture speaker role information and speaker-aware discourse structure information and then strengthens the ability of multi-party multi-turn MRC.

5 Analysis

Figure 7: Selected cases where baseline model fails (Prediction1) but our model gives gold answers (Prediction2). Related segments of dialogues are presented for illustration.

5.1 Ablation study

Since our speaker-aware information enhancing method includes three separate modules, we perform an ablation study to verify the contributions of our three speaker-aware modules. Respectively, we ablate each aforementioned modules and train them under the same hyper-parameters. As shown in Table 3, experimental results indicate that each module plays an effective part in the whole model, and the Speaker Masking module contributes the most.

5.2 Case Analysis

To intuitively show how our model improves the ability of MRC on multi-party multi-turn dialogues, we present an analysis on the predictions from baseline (BERT) and predictions from our model to show how our speaker-aware enhancement strategies help fix wrong cases of baseline. We select examples of different types of questions and compare the predictions, as shown in Figure 7.

In the first Who-type case, the answer given by baseline model is gnomefreak, which is the nearest speaker name to opened the repositories. While lightbright, the answer given by our model is the gold answer, which is the speaker of the utterances containing the phrase opened the repositories. Our model is able to fix this since we regard each utterance as an EDU and effectively model the speaker information.

For the Why-type question in case 2, the baseline model failed to find a plausible answer. However the Clarification-question relation and QAP relation among (from fyrestrtr), (from alexbOrsova) and (from alexbOrsova) is very obvious, which are captured by our model.

In the third case, which is a What-type case, the answer ubuntu given by baseline model is reasonable already, based on which contains the key word use. But our model gives the gold answer linux, which is a more precise span from , which is from noone.

As these cases show, our model enhances the connections between utterances and their own speakers and captures the speaker-aware discourse relations, which helps to fix some wrong cases.

6 Conclusion

In this work, we study machine reading comprehension on multi-party multi-turn dialogues and propose an enhanced speaker-aware model to model speaker information comprehensively and firstly leverage discourse relation in dialogue MRC. Our model is evaluated on two multi-party multi-turn dialogue benchmarks, Molweni and FriendsQA. Experimental results show the superiority of our method compared to previous work. In addition, we analyze the contribution of each module by ablation study and present examples for intuitive illustration. Our work verifies that speaker roles and interrelations are significant characters of dialogue contexts. Our model takes advantage of enhancing the connections between utterances and their speakers and capturing the speaker-aware discourse relations.


  • S. Afantenos, E. Kow, N. Asher, and J. Perret (2015) Discourse parsing for multi-party chat dialogues. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 928–937. Cited by: §2.1.
  • A. F. Agarap (2018) Deep learning using rectified linear units (relu). External Links: 1803.08375 Cited by: §3.4.
  • N. Asher, J. Hunter, M. Morey, F. Benamara, and S. Afantenos (2016) Discourse structure and dialogue acts in multiparty dialogue: the stac corpus. In 10th International Conference on Language Resources and Evaluation (LREC), pp. 2721–2727. Cited by: §2.2.
  • H. Bai and H. Zhao (2018) Deep enhanced representation for implicit discourse relation recognition. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), pp. 571–583. Cited by: §2.2.
  • C. Braud, M. Coavoux, and A. Søgaard (2017) Cross-lingual rst discourse parsing. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (ACL): Volume 1, Long Papers, pp. 292–304. Cited by: §2.2.
  • D. Cai and H. Zhao (2017) Pair-aware neural sentence modeling for implicit discourse relation classification. In International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 458–466. Cited by: §2.2.
  • H. Chen, X. Liu, D. Yin, and J. Tang (2017) A survey on dialogue systems: recent advances and new frontiers. In ACM SIGKDD Explorations Newsletter, Cited by: §2.1.
  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2174–2184. External Links: Link Cited by: §1, §2.1.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §4.2.
  • L. Cui, Y. Wu, S. Liu, Y. Zhang, and M. Zhou (2020) MuTual: A Dataset for Multi-Turn Dialogue Reasoning. In Proceedings of the 58th Conference of the Association for Computational Linguistics (ACL), Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (ACL): Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Link, Document Cited by: §1, §3.5.
  • Y. Gao, C. Wu, J. Li, S. Joty, S. C.H. Hoi, C. Xiong, I. King, and M. Lyu (2020) Discern: discourse-aware Entailment Reasoning Network for Conversational Machine Reading. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2439–2449. External Links: Link, Document Cited by: §2.2.
  • M. Glass, A. Gliozzo, R. Chakravarti, A. Ferritto, L. Pan, G. S. Bhargav, D. Garg, and A. Sil (2020) Span selection pre-training for question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2773–2782. Cited by: §3.5.
  • X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §3.4.
  • J. Gu, T. Li, Q. Liu, Z. Ling, Z. Su, S. Wei, and X. Zhu (2020) Speaker-Aware BERT for Multi-Turn Response Selection in Retrieval-Based Chatbots. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management (CIKM), Virtual Event, Ireland, October 19-23, 2020, M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and P. Cudré-Mauroux (Eds.), pp. 2041–2044. External Links: Link, Document Cited by: §1, §1, §2.1, Table 2, §4.2.
  • H. Huang, E. Choi, and W. Yih (2018) FlowQA: grasping flow in history for conversational machine comprehension. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
  • Q. Jia, Y. Liu, S. Ren, K. Zhu, and H. Tang (2020) Multi-turn response selection using dialogue dependency relations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1911–1920. Cited by: §2.2.
  • S. Joty, G. Carenini, and R. T. Ng (2015) CODRA: A Novel Discriminative Framework for Rhetorical Analysis. Computational Linguistics 41 (3), pp. 385–435. External Links: Link, Document Cited by: §2.2.
  • F. Li, M. Qiu, H. Chen, X. Wang, X. Gao, J. Huang, J. Ren, Z. Zhao, W. Zhao, L. Wang, et al. (2017) Alime assist: an intelligent assistant for creating an innovative e-commerce experience. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM), pp. 2495–2498. Cited by: §2.1.
  • J. Li, M. Liu, M. Kan, Z. Zheng, Z. Wang, W. Lei, T. Liu, and B. Qin (2020) ”Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure”. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pp. 2642–2652. External Links: Link, Document Cited by: §1, §1, §1, §1, §2.1, §2.1, §2.1, §2.2, Table 2, §4.1, §4.2, §4.
  • Q. Li, T. Li, and B. Chang (2016) Discourse parsing with attention-based hierarchical neural networks. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 362–371. Cited by: §2.2.
  • C. Liu, D. Xiong, Y. Jia, H. Zan, and C. Hu (2020) HisBERT for conversational reading comprehension. In 2020 International Conference on Asian Language Processing (IALP), pp. 147–152. Cited by: §2.1.
  • L. Liu, Z. Zhang, H. Zhao, X. Zhou, and X. Zhou (2021) Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Cited by: §1, §1, §2.1, §3.3, §3.3, Table 2, §4.2.
  • Y. Liu and M. Lapata (2017) Learning contextually informed representations for linear-time discourse parsing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1289–1298. Cited by: §2.2.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §4.3.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In Proceedings of the SIGDIAL 2015 Conference, The 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2-4 September 2015, Prague, Czech Republic, pp. 285–294. External Links: Link, Document Cited by: §1, §4.1.
  • W. C. Mann and S. A. Thompson (1988) Rhetorical structure theory: toward a functional theory of text organization. Text 8 (3), pp. 243–281. Cited by: §2.2.
  • S. Ouyang, Z. Zhang, and H. Zhao (2021) Dialogue graph modeling for conversational machine reading. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 3158–3169. External Links: Link, Document Cited by: §2.2.
  • R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. Joshi, and B. Webber (2008) The penn discourse treebank 2.0.. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC), Cited by: §2.2.
  • L. Qin, Z. Zhang, H. Zhao, Z. Hu, and E. Xing (2017) Adversarial connective-exploiting networks for implicit discourse relation classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1006–1017. Cited by: §2.2.
  • C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, and M. Iyyer (2019) BERT with history answer embedding for conversational question answering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1133–1136. Cited by: §1, §2.1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics (TACL) 7, pp. 249–266. External Links: Link Cited by: §1, §2.1.
  • M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling Relational Data with Graph Convolutional Networks. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam (Eds.), Lecture Notes in Computer Science, Vol. 10843, pp. 593–607. External Links: Link, Document Cited by: §3.4.
  • Z. Shi and M. Huang (2019) A deep sequential model for discourse parsing on multi-party dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7007–7014. Cited by: §1, §2.1, §2.2, §4.1.
  • H. Shum, X. He, and D. Li (2018) From eliza to xiaoice: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19 (1), pp. 10–26. Cited by: §2.1.
  • K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, and C. Cardie (2019) DREAM: a challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics (TACL) 7, pp. 217–231. External Links: Document Cited by: §2.1.
  • R. Takanobu, M. Huang, Z. Zhao, F. Li, H. Chen, X. Zhu, and L. Nie (2018)

    A weakly supervised method for topic segmentation and labeling in goal-oriented dialogues via reinforcement learning.

    In IJCAI, pp. 4403–4410. Cited by: §2.2.
  • Y. Wang, S. Li, and H. Wang (2017) A two-stage parsing method for text-level discourse analysis. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 2: Short Papers), pp. 184–188. Cited by: §2.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 38–45. External Links: Link Cited by: §4.3.
  • Y. Wu, W. Wu, M. Zhou, and Z. Li (2016) Sequential Match Network: A New Architecture for Multi-turn Response Selection in Retrieval-based Chatbots. CoRR abs/1612.01627. External Links: Link, 1612.01627 Cited by: §1.
  • J. Xu, Z. Lei, H. Wang, Z. Niu, H. Wu, and W. Che (2021) Discovering dialog structure graph for coherent dialog generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 1726–1739. External Links: Link, Document Cited by: §2.2.
  • A. Yang and S. Li (2018) SciDTB: discourse dependency treebank for scientific abstracts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 444–449. Cited by: §2.2.
  • Z. Yang and J. D. Choi (2019) FriendsQA: open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 188–197. External Links: Link, Document Cited by: §4.1, §4.
  • N. Yu, M. Zhang, and G. Fu (2018) Transition-based neural rst parsing with implicit syntax features. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), pp. 559–570. Cited by: §2.2.
  • Z. Zhang, J. Li, P. Zhu, H. Zhao, and G. Liu (2018) Modeling Multi-turn Conversation with Deep Utterance Aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, E. M. Bender, L. Derczynski, and P. Isabelle (Eds.), pp. 3740–3752. External Links: Link Cited by: §1.
  • Z. Zhang, J. Yang, and H. Zhao (2021) Retrospective reader for machine reading comprehension. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Cited by: §3.5.
  • C. Zhu, M. Zeng, and X. Huang (2018a) Sdnet: contextualized attention-based deep network for conversational question answering. External Links: 1812.03593 Cited by: §2.1.
  • P. Zhu, Z. Zhang, J. Li, Y. Huang, and H. Zhao (2018b) Lingke: a fine-grained multi-turn chatbot for customer service. In Proceedings of the 27th International Conference on Computational Linguistics (COLING): System Demonstrations, pp. 108–112. Cited by: §2.1.