Structural Modeling for Dialogue Disentanglement

10/15/2021
by   Xinbei Ma, et al.
Shanghai Jiao Tong University
0

Tangled multi-party dialogue context leads to challenges for dialogue reading comprehension, where multiple dialogue threads flow simultaneously within the same dialogue history, thus increasing difficulties in understanding a dialogue history for both human and machine. Dialogue disentanglement aims to clarify conversation threads in a multi-party dialogue history, thus reducing the difficulty of comprehending the long disordered dialogue passage. Existing studies commonly focus on utterance encoding with carefully designed feature engineering-based methods but pay inadequate attention to dialogue structure. This work designs a novel model to disentangle multi-party history into threads, by taking dialogue structure features into account. Specifically, based on the fact that dialogues are constructed through successive participation of speakers and interactions between users of interest, we extract clues of speaker property and reference of users to model the structure of a long dialogue record. The novel method is evaluated on the Ubuntu IRC dataset and shows state-of-the-art experimental results in dialogue disentanglement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/08/2021

Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension

Multi-party dialogue machine reading comprehension (MRC) brings tremendo...
03/31/2021

CloneBot: Personalized Dialogue-Response Predictions

Our project task was to create a model that, given a speaker ID, chat hi...
09/17/2020

Structured Attention for Unsupervised Dialogue Structure Induction

Inducing a meaningful structural representation from one or a set of dia...
09/12/2018

Game-Based Video-Context Dialogue

Current dialogue systems focus more on textual and speech context knowle...
05/31/2019

GSN: A Graph-Structured Network for Multi-Party Dialogues

Existing neural models for dialogue response generation assume that utte...
01/14/2020

A Hybrid Solution to Learn Turn-Taking in Multi-Party Service-based Chat Groups

To predict the next most likely participant to interact in a multi-party...
09/22/2021

FCM: A Fine-grained Comparison Model for Multi-turn Dialogue Reasoning

Despite the success of neural dialogue systems in achieving high perform...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the boom of social networking services rapidly facilitates communication among people, group chatting is happening all the time, which generates multi-party dialogues with long histories Lowe et al. (2015); Zhang et al. (2018); Choi et al. (2018); Reddy et al. (2019); Li et al. (2020a). Different from plain texts that are always formally written by the authors, multi-parity dialogues are organized by distributed participants in an random way; thus exhibit disorder and confusion Kummerfeld et al. (2019); Elsner and Charniak (2010); Joty et al. (2019); Shen et al. (2006); Jiang et al. (2018, 2021). As is shown in figure 1, the development of a multi-party chatting dialogue has special characters: 1) Random users successively participate in the dialogue and follow certain topics that they are interested in; thus, the threads of those topics in this dialogue are developed. 2) Users reply to former related utterances and mention involved users, thus brings dependencies among utterances. In a word, the behavior of speakers determines the structure of a dialogue history record.

Figure 1: An example of multi-party chatting dialogues.

Due to the aforementioned features of dialogue development, there are always multiple ongoing conversation threads developing in a dialogue history simultaneously, which cause troubles for both human and machine to understand dialogue context or further deal with various reading comprehension tasks Kummerfeld et al. (2019); Elsner and Charniak (2010); Joty et al. (2019); Shen et al. (2006); Jiang et al. (2018, 2021). Therefore, to some extent, disentangling context or clustering conversation threads can make an effective prerequisite for downstream tasks on dialogues Elsner and Charniak (2010); Liu et al. (2021a); Jia et al. (2020), as which contributes to screening concerned parts for further machine reading comprehension (MRC) applications.

Nevertheless, existing works on dialogue disentanglement remain to be improved Zhu et al. (2020); Yu and Joty (2020); Li et al. (2020c), which ignore or pay little attention to special features of dialogues and show sub-optimal performance. Earlier works mainly depend on feature engineering Kummerfeld et al. (2019); Elsner and Charniak (2010); Yu and Joty (2020)

, and use well-constructed handcrafted features to train a naive classifier

Elsner and Charniak (2010) or linear feed-forward network Kummerfeld et al. (2019). Recent works mainly based on two strategies: 1) two-step Mehri and Carenini (2017); Zhu et al. (2020); Yu and Joty (2020); Li et al. (2020c); Liu et al. (2021a) and 2) end-to-end Tan et al. (2019); Liu et al. (2020). In the two-step method, disentanglement task is divided into matching and clustering, which means firstly matching utterance pairs to detect reply-to relations and then dividing utterances into clusters according to the matching score. In the end-to-end strategy, alternatively, for each conversation thread, a dialogue state is modeled, which is mapped with a new utterance and accordingly updated. At the same time, the new utterance is divided into the best-matched thread.

Recently, Pre-trained Language Models (PrLMs)

Devlin et al. (2019); an (2019); Clark et al. (2020)

have brought prosperity to downstream natural language processing tasks by providing contextualized backbones, based on which various works have combined strong PrLMs with dialogue features for gains on performance

Lowe et al. (2015); Li et al. (2020a); Liu et al. (2021b); Jia et al. (2020); Wang et al. (2020). At the same time, domain-adaptive pre-training has been proposed, adapting language models to in-domain tasks better Wang et al. (2020); Li et al. (2020b); Xu and Zhao (2021). Works on dialogue disentanglement also manage to make use of PrLMs for performance improvement Li et al. (2020c); Zhu et al. (2020).

In this work, we design a new model to deal with the long tangled multi-party dialogues leveraging dialogue structure-aware information and aiming at a better solution for the dialogue disentanglement problem so as to contribute to downstream dialogue MRC tasks. The structure of a multi-party dialogue is based on the actions of speakers according to the process of dialogue development. Thus we extract 1) speaker property and 2) reference among users to characterize dependencies of utterances, which is taken into consideration to help with detection of reply-to relationship. Experiments are conducted on DSTC-8 Ubuntu IRC dataset Kummerfeld et al. (2019), where the model achieves performance progress and makes a new state-of-art model.

2 Background and Related Work

2.1 Pre-trained Language Models

Pre-trained language models (PrLMs) have brought remarkable achievements in a wide range of natural language processing (NLP) tasks. BERT Devlin et al. (2019) is one of the pioneers that have achieved significant progress in language understanding tasks. It was pre-trained on a large corpus to learn basic language knowledge on the two self-supervised training objectives, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) Devlin et al. (2019). Variants of BERT such as RoBERTa an (2019) and ELECTRA Clark et al. (2020) are proposed with stronger capacity. Devoted to NLP tasks, PrLMs often work as a contextualized encoder with some task-oriented layers added on the top. For dialogue disentanglement on multi-party dialogues, previous works concatenate utterances as an input to feed into subsequent layers for detecting relationships of utterances Zhu et al. (2020); Li et al. (2020c).

2.2 Dialogue-related Machine Reading Comprehension

Dialogue-related machine reading comprehension brings challenges on handling the complicated scenarios from multiple speakers and criss-crossed dependency among utterances Lowe et al. (2015); Yang and Choi (2019); Sun et al. (2019); Li et al. (2020a). A dialogue is developed by all involved speakers in a distributed way, where an individual speaker focuses and declares oneself on one of the topics discussed in the conversation, or reply to utterances from other speakers. Therefore, consistency and continuity are broken by tangled reply-to dependencies between non-adjacent utterances Li et al. (2020a); Jia et al. (2020); Ma et al. (2021); Li et al. (2021), leading to a graph structure which is quite different from the smooth presentation in plain texts. Recently, numbers of works of dialogue-related MRC managed to enhance dialogue structure-aware features to improve the adaptation of language models on dialogue passages Jia et al. (2020); Ouyang et al. (2021); Ma et al. (2021); Li et al. (2021) and achieve progress compared to methods suitable for plain texts. A wide range of dialogue-related MRC tasks such as response selection Gu et al. (2020); Liu et al. (2021b), question answering Ma et al. (2021); Li et al. (2021) and emotion detection Hu et al. (2021) have been inspired by paying attention to dialogue structure-aware information.

2.3 Dialogue Disentanglement

Dialogue disentanglement Elsner and Charniak (2010), which is also referred as conversation management Traum et al. (2004) , thread detection Shen et al. (2006) or thread extraction Adams (2008), has been studied for decades due to the importance for both understanding long multi-party dialogues and assisting in down stream NLP tasks on dialogues.

With the goal of disentangling multi-party dialogue history, a number of methods have been proposed for years. Early works are based on feature encode and clustering algorithms. Well-designed handcraft features are constructed to train simple networks to predict whether a pair of utterances are alike or different, and clustering methods are borrowed for partitioning Elsner and Charniak (2010); Jiang et al. (2018). Researches are facilitated by a large-scale, high-quality dataset, Ubuntu IRC, published by Kummerfeld et al. (2019). And then, FeedForward network and pointer network Vinyals et al. (2015) are used, leading to significant progress. But the improvement still partially relies on handcraft-related features Kummerfeld et al. (2019); Yu and Joty (2020). Then the end-to-end strategy is proposed and fills the gap between the two steps Liu et al. (2020), where dialogue disentanglement problem is modeled as a dialogue state transition process, and utterances are clustered by mapping with the states of each dialogue thread.

Inspired by achievements of pre-trained language models Devlin et al. (2019); Clark et al. (2020); an (2019), approaches based on PrLMs are recently proposed Zhu et al. (2020); Gu et al. (2020), where PrLMs produce contextually encode utterances as backbones and the encoded representations of tokens are used for higher-level operations.

However, attention paid to special features of dialogues is inadequate. Feature engineering-based works represent properties of individual utterances such as time, speakers and topics with naive handcraft methods, and ignore dialogue context Elsner and Charniak (2010); Kummerfeld et al. (2019). Masked Hierarchical Transformer Zhu et al. (2020) utilizes the golden conversation structure to operate attentions on related utterances when training models, which results in exposure bias. DialBERT Li et al. (2020c) is a BERT Devlin et al. (2019) added a LSTM Hochreiter and Schmidhuber (1997) on the top for modeling contextual clues and claim a state-of-art performance. In this work, we propose a new design of method considering structure-aware clues, based on the fact that dialogues are developed according to the behave of speakers, so as to disentangle a multi-party chatting dialogue history. We model dialogues with attention paid to two aspects: 1) speaker properties of each utterances, which helps with the understanding of utterances, and 2) interactions of speakers between utterances, which helps with the development of conversation threads. We evaluated the model on Ubuntu IRC dataset Kummerfeld et al. (2019) and obtained a state-of-art performance.

3 Methodology

The definition of the dialogue disentanglement task and details of our model are sequentially presented in this section, indicating how we make efforts to deal with the disentanglement task with dialogue structure-aware features.

3.1 Task Formulation

Suppose that we perform disentanglement to a long multi-party dialogue history , where is composed of utterances. An utterance includes an identity of speaker and a message sent by this user, thus denoted as . As several threads are flowing simultaneously within , we define a set of threads as a partition of , where , denoting a thread of the conversation. In this task, we aim to disentangle into . As indicated before, a multi-party dialogue is constructed by successive participation of speakers, who often reply to former utterances of interest. Thus, a dialogue passage can be modeled as a graph structure whose vertices denote utterances and edges denote reply-to relationships between utterances. Therefore, we focus on finding a parent node for each utterance through inference of reply-to relationship, so as to discover edges and then determine the graph of a conversation thread.

Figure 2: Overview of our model.

3.2 Model Architecture

Figure 2 shows the architecture of the proposed model, which is introduced in detail in this part. The model architecture consists of three modules, including text encoder, structural interaction, and context-aware prediction. The utterances from a dialogue history are encoded with a pre-trained language model whose output is then aggregated to context-level in the encoder. The representation is sequentially fed into the structural modeling module, used for dialogue structural features modeling to characterize contexts with speaker-aware and reference-aware features. Then in the prediction module, the model performs a fusion and calculates the prediction of reply-to relationships.

3.2.1 Encoder

Pairwise encoding Following previous works Zhu et al. (2020); Li et al. (2020c), we utilize a pre-trained language model e.g. BERT Devlin et al. (2019) as an encoder for contextualized representation of tokens. Since chatting records are always long and continuous, it is inappropriate to concatenate the whole context as input. Thus, we concatenate an utterance with each candidate separately at the encoder stage, satisfying contextual information from former history.

Assuming that for an utterance , we consider former utterances (including itself) as candidates for parent node of , the input of a PrLM is in the form of [CLS] [SEP] [SEP], where . The output is denoted as , where denotes the window length in which former utterances are considered as candidates of the parent, denotes the input sequence length in tokens, denotes the dimension of hidden states of the PrLM. Note that there is a situation where the golden parent utterance is beyond the range of . We label a self-loop for in this case, which means is a beginning of a new dialogue thread as it is too far from the parent, making a root of the thread, which makes sense in the real world, because when users enter a chatroom, they intend to check a limited number of recent messages and make replies, instead of scanning the whole chatting record.

Utterance Aggregation is pairwise contextualized representations of each pair of the utterance and a candidate , which will be aggregated to context-level representation for further modeling. Due to the next sentence prediction information is modeled into the position of [CLS], we simply reserve the representations of [CLS]. After concatenating pairwise representations from all candidates, we denote the pairwise representations as , where denotes the window length and denotes the dimension of hidden states of the PrLM.

3.2.2 Structural Modeling

Speaker Property Modeling With the goal of enhancing speaker property of each utterance, we follow the mask-based Multi-Head Self-Attention (MHSA) mechanism to emphasis correlations between utterances from the same speaker. The mask-based MHSA is formulated as follows:

where , , , , , , denote the attention, head, query, key, value, mask, and the number of heads, denotes the input matrix, and , , , are parameters. Operator [] denotes concatenation. At this step, we input the aggregated representation with a speaker-aware mask:

where denotes the speaker identity, denotes masks of speaker property. The output of MHSA, , has the same dimension with . We simply concatenate and and adjust to the same size using a linear layer, resulting in a final output of this module denoted as .

Model Ubuntu IRC
VI ARI 1-1 F1 P R
Test Set
FeedForward Kummerfeld et al. (2019) 91.3 - 75.6 36.2 34.6 38.0
union Kummerfeld et al. (2019) 86.2 - 62.5 33.4 40.4 28.5
vote Kummerfeld et al. (2019) 91.5 - 76.0 38.0 36.3 39.7
intersect Kummerfeld et al. (2019) 69.3 - 26.6 32.1 67.0 21.1
Elsner Elsner and Charniak (2008) 82.1 - 51.4 15.5 12.1 21.5
Lowe Lowe et al. (2017) 80.6 - 53.7 8.9 10.8 7.6
BERT Li et al. (2020c) 90.8 62.9 75.0 32.5 29.3 36.6
DialBERT Li et al. (2020c) 92.6 69.6 78.5 44.1 42.3 46.2
 +cov Li et al. (2020c) 93.2 72.8 79.7 44.8 42.1 47.9
 +feature Li et al. (2020c) 92.4 66.6 77.6 42.2 38.8 46.3
 +future context Li et al. (2020c) 92.3 66.3 79.1 42.6 40.0 45.6
Ptr-Net Yu and Joty (2020) 92.3 70.2 - 36.0 33.0 38.9
 + Joint train Yu and Joty (2020) 93.1 71.3 - 39.7 37.2 42.5
 + Self-link Yu and Joty (2020) 93.0 74.3 - 41.5 42.2 44.9
 + Joint train&Self-link Yu and Joty (2020) 94.2 80.1 - 44.5 44.9 44.2
1-7[0.8pt/2pt] BERT (Our baseline) 91.4 60.8 74.4 37.2 34.0 41.2
Our model 94.0+2.6 74.4+13.6 81.9+7.5 46.0+8.9 46.1+12.1 47.6+6.4
Dev Set
Decom. Atten. Parikh et al. (2016) 70.3 - 39.8 0.6 0.9 0.7
 +featureParikh et al. (2016) 87.4 - 66.6 21.1 18.2 25.2
ESIM Chen et al. (2017) 72.1 - 44.0 1.4 2.2 1.8
 +feature Chen et al. (2017) 87.7 - 65.8 22.6 18.9 28.3
MHT Zhu et al. (2020) 82.1 - 59.6 8.7 12.6 10.3
 +feature Zhu et al. (2020) 89.8 - 75.4 35.8 32.7 34.2
DialBERT Li et al. (2020c) 94.1 81.1 85.6 48.0 49.5 46.6
1-7[0.8pt/2pt] BERT (Our baseline) 91.7 74.6 80.2 33.5 32.2 35.0
Our model 94.4+2.7 81.8+7.2 86.1+5.9 52.6+19.1 51.0+18.8 54.3+9.3
Table 1: Experimental results on the Ubuntu IRC dataset Kummerfeld et al. (2019).

Reference Dependency Modeling As discussed above, the relation of references between speakers is the most important and straightforward dependency among utterances, for references indicate interactions between users which is the internal motivation of the development of a dialogue record. To this end, we build a matrix to label the references, which is regarded as an adjacency matrix of a graph representation. In the graph of references, a vertice denotes an utterance and an edge for reference dependence. For example, if has a reference to the speaker of , then there is an edge from to . Inspired by the wide application of graph convolutional network (GCN) Kipf and Welling (2017), we borrow the relation-modeling method of relational graph convolutional network (r-GCN) Schlichtkrull et al. (2018); Shi and Huang (2019) in order to enhance the reference dependencies, which can be denoted as follows:

where is the set of relationships, which in our module is only the reference dependency. denotes the set of neighbours of vertice , which are connected to through relationship , and is constant for normalization. and are parameter matrix of layer .

is activated function, which in our implementation is ReLU

Glorot et al. (2011); Agarap (2018). The input of r-GCN is , and after this dependency modeling, the representation is denoted as .

3.2.3 Context-aware Prediction

As claimed above, the separated pairwise way of encoding satisfied some context information, to compensate which, we put a Bi-LSTM Hochreiter and Schmidhuber (1997) layer for compensating contextual clues within the whole window of candidate parents. At the same time, the dialogue structure aware representation need to be combined with the original representation of [CLS] for enhancement.

Considering both of them, we employ a Syn-LSTM proposed by Xu et al. (2021)

, which was originally proposed for named entity recognition (NER). A Syn-LSTM models the contextual information while the reference dependency is highlighted, enriching relations among parent candidates. Syn-LSTM is distinguished from an additional input gate for an extra source of input, whose parameters are obtained from training, achieving a better fusion of input sources. The process in a Syn-LSTM cell can be formulated as:

where and are inputs, is a forget gate, is a output gate, and are input gates, and represents parameter. We use Syn-LSTM in the bi-directional way and the output of Syn-LSTM is , where is the hidden size of Syn-LSTM.

At this stage, is the structure feature-enchanced representation of each pair of the utterance and a candidate parent utterance . To measure the correlations of these pairs, we follow previous work Li et al. (2020c) to consider the Siamese architecture of each pair of and the pair of :

where is the representation for pair of from , and we got . is fed into a classifier to predict a parent utterance from all parent candidates. We use the cross-entropy for model training.

4 Experiments

The proposed model is evaluated on a large-scale multi-party dialogue record dataset Ubuntu IRC Kummerfeld et al. (2019), which is also used as a dataset of DSTC-8 Track2 Task4. The results show that our model surpasses the baseline significantly and achieves a new state-of-the-art.

4.1 Dataset

Ubuntu IRC (Internet Relay Chat) is the benchmark corpus published by Kummerfeld et al. (2019), which is collected from #Ubuntu and #Linux IRC channels. Reply-to relations between utterances are manually annotated in the form of (parent utterance, son utterance). Ubuntu IRC consists of 77653 messages and is well-annotated, thus it is the largest and most influential dataset of dialogue disentanglement and contributes to related researches heavily.

4.2 Metrics

Reply-to relations

We calculated the accuracy for the prediction of parent utterance, indicating the inference ability of reply-to relations.

Disentanglement

With the goal of dialogue disentanglement, threads of a conversation is formed by clustering all related utterances bridged by reply-to relations, in other words, a connected subgraph. At this stage, we use metrics to evaluate following DSTC-8, which are Variation of Information (VI) Kummerfeld et al. (2019), Adjusted rand index (ARI) Kim et al. (2019), One-to-One Overlap (1-1) Elsner and Charniak (2010), precision (P), recall (R), and F1 score of clustering. Note that in the table of results, we present 1-VI instead of VI Kummerfeld et al. (2019), thus we expect a larger numerical value for all metrics indicating a stronger performance.

4.3 Setup

Our implementations are based on Transformers Library Wolf et al. (2020). We fine-tune our model employing AdamW Loshchilov and Hutter (2019)

as the optimizer. The learning rate begins with 4e-6. In addition, the input sequence length is set to 128, which our inputs are truncated or padded to, and the window width of considered candidates is 50.

4.4 Experimental Results

Table 1 shows the results of our experiments. The experimental results show that our model outperforms all baselines and previously proposed models by a large margin as highlighted in the table, and achieves a new state-of-the-art (SOTA).

Model VI ARI 1-1 F1 P R
BERT 91.7 74.6 80.2 33.5 32.16 35.0
 + speaker 94.0 81.2 84.9 45.0 44.7 45.3
 + reference 94.1 82.4 85.6 47.4 47.4 47.4
 + Both 94.4 81.8 86.1 52.6 51.0 54.3
Table 2: Ablation study.
Model VI ARI 1-1 F1 P R
BERT 91.7 74.6 80.2 33.5 32.16 35.0

 w/ max-pooling

94.1 80.0 85.3 50.8 52.5 49.2
 w/ [CLS] 94.4 81.8 86.1 52.6 51.0 54.3
Table 3: Comparison about aggregation methods.
Model VI ARI 1-1 F1 P R
BERT 91.7 74.6 80.2 33.5 32.16 35.0
 w/ 1 layer 94.4 81.8 86.1 52.6 51.0 54.3
 w/ 2 layers 94.0 78.2 84.6 50.4 50.9 50.0
 w/ 3 layers 94.3 79.6 85.3 52.2 51.9 52.6
Table 4: The influence of the layers of Syn-LSTM.

5 Analysis

5.1 Ablation Study

We study the effect of speaker property and reference dependency respectively to verify their contribution. We ablate each of the features and train the model. Results in table 2 show that both speaker property and reference dependency are non-trivial.

5.2 Methods of Aggregation

At the stage of aggregation, we head for context-level representations. We consider the effect of different methods of aggregation, i.e., max-pooling and extraction of [CLS] tokens, the models are trained with the same hyper-parameters. Results in table 3 show our aggregation is the best.

5.3 Layers of LSTM

To see the effect of depth of Syn-LSTM, we did experiments on the numbers of layers of Syn-LSTM, also with the same hyper-parameters. According to the results as shown in Table 4, we put an one-layer LSTM for better performance.

6 Conclusion

In this paper, we study disentanglement on long multi-party dialogue records and propose a new model by paying close attention to dialogue structure, i.e., the speaker property and reference dependency. Our model is evaluated on the largest latest benchmark dataset Ubuntu IRC, where experimental results show the advancement of our method compared to previous work and reach a performance of SOTA. In addition, we analyze the contribution of each structure-related feature by ablation study and the effect of the different model architecture. Our work discloses that speaker property and dependency are significant characters of dialogue contexts and deserves studies in multi-turn dialogue modeling.

References

  • Holland. Adams (2008) Conversation thread extraction and topic detection in text-based chat. Calhoun. External Links: Link Cited by: §2.3.
  • A. F. Agarap (2018) Deep learning using rectified linear units (relu). ArXiv preprint abs/1803.08375. External Links: Link Cited by: §3.2.2.
  • Y. L. an (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint abs/1907.11692. External Links: Link Cited by: §1, §2.1, §2.3.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1657–1668. External Links: Document, Link Cited by: Table 1.
  • E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer (2018) QuAC: question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2174–2184. External Links: Document, Link Cited by: §1.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1, §2.1, §2.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §1, §2.1, §2.3, §2.3, §3.2.1.
  • M. Elsner and E. Charniak (2008) You talking to me? a corpus and algorithm for conversation disentanglement. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 834–842. External Links: Link Cited by: Table 1.
  • M. Elsner and E. Charniak (2010) Disentangling chat. Computational Linguistics 36 (3), pp. 389–409. External Links: Document, Link Cited by: §1, §1, §1, §2.3, §2.3, §2.3, §4.2.
  • X. Glorot, A. Bordes, and Y. Bengio (2011)

    Deep sparse rectifier neural networks

    .
    In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §3.2.2.
  • J. Gu, T. Li, Q. Liu, Z. Ling, Z. Su, S. Wei, and X. Zhu (2020) Speaker-aware BERT for multi-turn response selection in retrieval-based chatbots. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and P. Cudré-Mauroux (Eds.), pp. 2041–2044. External Links: Document, Link Cited by: §2.2, §2.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.3, §3.2.3.
  • D. Hu, L. Wei, and X. Huai (2021) DialogueCRN: contextual reasoning networks for emotion recognition in conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 7042–7052. External Links: Document, Link Cited by: §2.2.
  • Q. Jia, Y. Liu, S. Ren, K. Zhu, and H. Tang (2020) Multi-turn response selection using dialogue dependency relations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 1911–1920. External Links: Document, Link Cited by: §1, §1, §2.2.
  • J. Jiang, F. Chen, Y. Chen, and W. Wang (2018) Learning to disentangle interleaved conversational threads with a Siamese hierarchical network and similarity ranking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1812–1822. External Links: Document, Link Cited by: §1, §1, §2.3.
  • Z. Jiang, L. Shi, C. Chen, J. Hu, and Q. Wang (2021) Dialogue disentanglement in software engineering: how far are we?. ArXiv preprint abs/2105.08887. External Links: Link Cited by: §1, §1.
  • S. Joty, G. Carenini, R. Ng, and G. Murray (2019) Discourse analysis and its applications. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, Florence, Italy, pp. 12–17. External Links: Document, Link Cited by: §1, §1.
  • S. Kim, M. Galley, R. C. Gunasekara, S. Lee, A. Atkinson, B. Peng, H. Schulz, J. Gao, J. Li, M. Adada, M. Huang, L. A. Lastras, J. K. Kummerfeld, W. S. Lasecki, C. Hori, A. Cherian, T. K. Marks, A. Rastogi, X. Zang, S. Sunkara, and R. Gupta (2019) The eighth dialog system technology challenge. CoRR abs/1911.06394. External Links: Link, 1911.06394 Cited by: §4.2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §3.2.2.
  • J. K. Kummerfeld, S. R. Gouravajhala, J. J. Peper, V. Athreya, C. Gunasekara, J. Ganhotra, S. S. Patel, L. C. Polymenakos, and W. Lasecki (2019) A large-scale corpus for conversation disentanglement. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3846–3856. External Links: Document, Link Cited by: §1, §1, §1, §1, §2.3, §2.3, Table 1, §4.1, §4.2, §4.
  • J. Li, M. Liu, M. Kan, Z. Zheng, Z. Wang, W. Lei, T. Liu, and B. Qin (2020a) Molweni: a challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 2642–2652. External Links: Document, Link Cited by: §1, §1, §2.2.
  • J. Li, M. Liu, Z. Zheng, H. Zhang, B. Qin, M. Kan, and T. Liu (2021) DADgraph: a discourse-aware dialogue graph neural network for multiparty dialogue machine reading comprehension. ArXiv preprint abs/2104.12377. External Links: Link Cited by: §2.2.
  • J. Li, Z. Zhang, H. Zhao, X. Zhou, and X. Zhou (2020b) Task-specific objectives of pre-trained language models for dialogue adaptation. arXiv preprint arXiv:2009.04984. Cited by: §1.
  • T. Li, J. Gu, X. Zhu, Q. Liu, Z. Ling, Z. Su, and S. Wei (2020c) DialBERT: a hierarchical pre-trained model for conversation disentanglement. ArXiv preprint abs/2004.03760. External Links: Link Cited by: §1, §1, §2.1, §2.3, §3.2.1, §3.2.3, Table 1.
  • H. Liu, Z. Shi, J. Gu, Q. Liu, S. Wei, and X. Zhu (2020) End-to-end transition-based online dialogue disentanglement. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, C. Bessiere (Ed.), pp. 3868–3874. External Links: Document, Link Cited by: §1, §2.3.
  • H. Liu, Z. Shi, and X. Zhu (2021a) Unsupervised conversation disentanglement through co-training. ArXiv preprint abs/2109.03199. External Links: Link Cited by: §1, §1.
  • L. Liu, Z. Zhang, H. Zhao, X. Zhou, and X. Zhou (2021b) Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue. In The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Cited by: §1, §2.2.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §4.3.
  • R. Lowe, N. Pow, I. Serban, and J. Pineau (2015) The Ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czech Republic, pp. 285–294. External Links: Document, Link Cited by: §1, §1, §2.2.
  • R. Lowe, N. Pow, I. V. Serban, L. Charlin, C. Liu, and J. Pineau (2017) Training end-to-end dialogue systems with the ubuntu dialogue corpus. Dialogue & Discourse 8 (1), pp. 31–65. Cited by: Table 1.
  • X. Ma, Z. Zhang, and H. Zhao (2021) Enhanced speaker-aware multi-party multi-turn dialogue comprehension. ArXiv preprint abs/2109.04066. External Links: Link Cited by: §2.2.
  • S. Mehri and G. Carenini (2017)

    Chat disentanglement: identifying semantic reply relationships with random forests and recurrent neural networks

    .
    In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 615–623. External Links: Link Cited by: §1.
  • S. Ouyang, Z. Zhang, and H. Zhao (2021) Dialogue graph modeling for conversational machine reading. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 3158–3169. External Links: Document, Link Cited by: §2.2.
  • A. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    .
    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2249–2255. External Links: Document, Link Cited by: Table 1.
  • S. Reddy, D. Chen, and C. D. Manning (2019) CoQA: a conversational question answering challenge. Transactions of the Association for Computational Linguistics 7, pp. 249–266. External Links: Document, Link Cited by: §1.
  • M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2018) Modeling Relational Data with Graph Convolutional Networks. In The Semantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3-7, 2018, Proceedings, A. Gangemi, R. Navigli, M. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam (Eds.), Lecture Notes in Computer Science, Vol. 10843, pp. 593–607. External Links: Document, Link Cited by: §3.2.2.
  • D. Shen, Q. Yang, J. Sun, and Z. Chen (2006) Thread detection in dynamic text message streams. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 35–42. Cited by: §1, §1, §2.3.
  • Z. Shi and M. Huang (2019) A deep sequential model for discourse parsing on multi-party dialogues. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 7007–7014. External Links: Document, Link Cited by: §3.2.2.
  • K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, and C. Cardie (2019) DREAM: a challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics 7, pp. 217–231. External Links: Document, Link Cited by: §2.2.
  • M. Tan, D. Wang, Y. Gao, H. Wang, S. Potdar, X. Guo, S. Chang, and M. Yu (2019) Context-aware conversation thread detection in multi-party chat. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6456–6461. External Links: Document, Link Cited by: §1.
  • D. R. Traum, S. Robinson, and J. Stephan (2004) Evaluation of multi-party virtual reality dialogue interaction. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. External Links: Link Cited by: §2.3.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2692–2700. External Links: Link Cited by: §2.3.
  • W. Wang, S. C.H. Hoi, and S. Joty (2020) Response selection for multi-party conversations with dynamic topic tracking. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6581–6591. External Links: Document, Link Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Document, Link Cited by: §4.3.
  • L. Xu, Z. Jie, W. Lu, and L. Bing (2021) Better feature integration for named entity recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 3457–3469. External Links: Document, Link Cited by: §3.2.3.
  • Y. Xu and H. Zhao (2021) Dialogue-oriented pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 2663–2673. External Links: Document, Link Cited by: §1.
  • Z. Yang and J. D. Choi (2019) FriendsQA: open-domain question answering on TV show transcripts. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 188–197. External Links: Document, Link Cited by: §2.2.
  • T. Yu and S. Joty (2020) Online conversation disentanglement with pointer networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6321–6330. External Links: Document, Link Cited by: §1, §2.3, Table 1.
  • Z. Zhang, J. Li, P. Zhu, H. Zhao, and G. Liu (2018) Modeling multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 3740–3752. External Links: Link Cited by: §1.
  • H. Zhu, F. Nan, Z. Wang, R. Nallapati, and B. Xiang (2020) Who did they respond to? conversation structure modeling using masked hierarchical transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 9741–9748. Cited by: §1, §1, §2.1, §2.3, §2.3, §3.2.1, Table 1.