Analyzing Assumptions in Conversation Disentanglement Research Through the Lens of a New Dataset and Model

by   Jonathan K. Kummerfeld, et al.

Disentangling conversations mixed together in a single stream of messages is a difficult task with no large annotated datasets. We created a new dataset that is 25 times the size of any previous publicly available resource, has samples of conversation from 152 points in time across a decade, and is annotated with both threads and a within-thread reply-structure graph. We also developed a new neural network model, which extracts conversation threads substantially more accurately than prior work. Using our annotated data and our model we tested assumptions in prior work, revealing major issues in heuristically constructed resources, and identifying how small datasets have biased our understanding of multi-party multi-conversation chat.


page 1

page 2

page 3

page 4


DeliData: A dataset for deliberation in multi-party problem solving

Dialogue systems research is traditionally focused on dialogues between ...

DialBERT: A Hierarchical Pre-Trained Model for Conversation Disentanglement

Disentanglement is a problem in which multiple conversations occur in th...

Analyzing Verbal and Nonverbal Features for Predicting Group Performance

This work analyzes the efficacy of verbal and nonverbal features of grou...

Six Attributes of Unhealthy Conversation

We present a new dataset of approximately 44000 comments labeled by crow...

Unsupervised Conversation Disentanglement through Co-Training

Conversation disentanglement aims to separate intermingled messages into...

A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

There is an increasing demand for goal-oriented conversation systems whi...

What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

Acronyms are the short forms of phrases that facilitate conveying length...

1 Introduction

When a group of people communicate in a common channel there are often multiple conversations occurring concurrently. Some online multi-party conversations have explicit structure, such as separate threads in forums, and reply nesting in comments. This work is concerned with communication channels that do not have explicit structure, such as Internet Relay Chat (IRC), and group chats in Google Hangout, WhatsApp, and Snapchat. Figure 1 shows an example of two entangled threads in our data. Disentangling these conversations would help users understand what is happening when they join a channel and search over conversation history. It would also give researchers access to enormous resources for studying synchronous dialog. More than a decade of research has considered the disentanglement problem (Shen et al., 2006), but using datasets that are either small (Elsner and Charniak, 2008) or not publicly released (Adams and Martell, 2008; Mayfield et al., 2012).

[03:05] <delire> hehe yes. does Kubuntu have ’KPackage’?

=== delire found that to be an excellent interface to the apt suite in another distribution.

=== E-bola [...@...] has joined #ubuntu

[03:06] <BurgerMann> does anyone know a consoleprog that scales jpegs fast and efficient?.. this digital camera age kills me when I have to scale photos :s

[03:06] <Seveas> delire, yes

[03:06] <Seveas> BurgerMann, convert

[03:06] <Seveas> part of imagemagick

=== E-bola [...@...] has left #ubuntu []

[03:06] <delire> BurgerMann: ImageMagick

[03:06] <Seveas> BurgerMann, i used that to convert 100’s of photos in one command

[03:06] <BurgerMann> Oh... I’ll have a look.. thx =)
Figure 1: #Ubuntu IRC log sample, earliest message first. Curved lines are our annotations of reply structure, which implicitly defines two clusters of messages (threads).

This paper makes three key contributions: (1) we release a new dataset of over 62,000 messages of IRC manually annotated with graphs of reply structure between messages, (2) we propose a simple neural network model for disentanglement, and (3) using our data and model, we investigate assumptions in prior work. Our dataset is 25 times larger than any other publicly available resource, and is the first we are aware of to include context: messages from the channel immediately prior to each annotated sample111 All code and data will be available under a permissive use license at by January 2019. . Across a range of metrics, our model has substantially higher performance that previous disentanglement approaches. Finally, our analysis shows that a range of assumptions do not generalise across IRC datasets, and identifies issues in the widely used set of disentangled threads from Lowe et al. (2015).

This resource will enable the development of data-driven systems for conversational disentanglement, opening up access to a section of online discourse that has previously been difficult to understand and study. Our analysis will help guide future work to avoid the assumptions of prior research.

2 Background

The most significant work on conversation disentanglement is a line of papers developing data and models for the #Linux IRC channel (Elsner and Charniak, 2008; Elsner and Schudy, 2009; Elsner and Charniak, 2010, 2011). Until now, their dataset was the only publicly available set of messages and annotations (later partially re-annotated by Mehri and Carenini (2017)), and has been used to train and evaluate a range of subsequent systems (Wang and Oard, 2009; Mehri and Carenini, 2017; Jiang et al., 2018)

. They explored multiple models using various feature sets and linear classifiers, combined with local and global inference methods. Their system is the only publicly released statistical model for disentanglement of chat conversation. We compare to their most effective approach and build on their work, providing new annotations of their data that include more detail, specifically the reply structures inside threads.

There has been additional work on disentanglement of IRC discussion, though with unreleased datasets. Adams and Martell (2008) annotated threads in multiple channels and explored disentanglement and topic identification. The French Ubuntu channel was studied by Riou et al. (2015), who annotated threads and relations between messages, and by Hernandez et al. (2016), who annotated dialogue acts and sentiment.

IRC is not the only form of synchronous group conversation online. Other platforms with similar communication formats have been studied in settings such as classes (Wang et al., 2008; Dulceanu, 2016), support communities (Mayfield et al., 2012), and customer service (Du et al., 2017). Unfortunately, only one of these resources, from Dulceanu (2016), is available, possibly due to privacy concerns.

In terms of models, most work is similar to Elsner and Charniak (2008), with linear feature-based pairwise models, followed by either greedy decoding or a constrained global search. Recent work has applied neural networks (Mehri and Carenini, 2017; Jiang et al., 2018), with slight gains in performance.

The data we consider, logs of the #Ubuntu IRC channel, was first identified as a useful source by Uthus and Aha (2013c). In follow up work, they identified messages relevant to the Unity desktop environment (Uthus and Aha, 2013b), and whether questions can be answered by the channel bot alone (Uthus and Aha, 2013a). Lowe et al. (2015, 2017) constructed a version of the data with heuristically extracted conversations threads. Their work opened up a new opportunity by providing 930,000 threads containing 100 million tokens, and has already been the basis of many papers (218 citations), particularly on developing dialogue agents (Hernandez et al., 2016; Zhou et al., 2016; Ouchi and Tsuboi, 2016; Serban et al., 2017; Wu et al., 2017; Zhang et al., 2018, inter alia). Using our data we provide the first empirical evaluation of their heuristic thread disentanglement approach, and using our model we extract a smaller, but higher quality dataset of 114,201 threads.

Another stream of research has used user-provided structure to get thread labels Shen et al. (2006); Domeniconi et al. (2016) and reply-to relations Wang and Rosé (2010); Wang et al. (2011); Aumayr et al. (2011); Balali et al. (2013, 2014); Chen1 et al. (2017). By removing these labels and mixing threads they create a disentanglement problem. While convenient, this risks introducing a bias, as people write differently when explicit structure is defined. They may be useful for partial supervision, but only a few publications have released their data Abbott et al. (2016); Zhang et al. (2017); Louis and Cohen (2015).

Within a conversation thread, we define a graph that expresses the reply relations between messages. To our knowledge, almost all prior work with annotated graph structures has been for threaded web forums Kim et al. (2010); Schuth et al. (2007), which do not exhibit the disentanglement problem we explore. Two exceptions are Mehri and Carenini (2017) and Dulceanu (2016), who labeled small samples of chat.

3 Data

Authors Anno. Data
Dataset Messages Parts Part Length / part Context / msg Avail.
Shen et al. (2006)* 1,645 16 35–381 msg 6-68 n/a 1 No
Elsner and Charniak (2008)* 2,500 1 5 hr 379 0 1-6 Yes
Adams and Martell (2008)* 19,925 38 67–831 msg ? 0 3 No
Wang et al. (2008) 337 28 2–70 msg ? n/a 1-2 No
Mayfield et al. (2012)* ? 45 1 hr 3-7 n/a 1 No
Riou et al. (2015) 1,429 2 12 / 60 hr 21/70 0 2/1 Req
Dulceanu (2016) 843 3 ½–1½ hr 8-9 n/a 1 Req
Guo et al. (2017) 1,500 1 48 hr 5 n/a 2 No
Mehri and Carenini (2017) 530 1 1.5 hr 54 0 3 Yes
This work - Pilot 1,250 9 100–332 msg 19-48 0-100 1-5 Yes
32,000 64 500 msg 35-90 1000 1 Yes
This work - Train 1,000 10 100 msg 20-43 100 3+a Yes
18,924 48 1 hr 22-142 100 1 Yes
This work - Dev 2,500 10 250 msg 76-167 1000 2+a Yes
This work - Test 5,000 10 500 msg 79-221 1000 3+a Yes
This work - Elsner 2,600 1 5 hr 387 0 2+a Yes
Table 1: Properties of IRC data with annotated threads: our data is significantly larger than prior work, and one of the only resources that is publicly available. ‘*’s indicate data with only thread clusters, not internal reply structure. Context is how many messages of text above the portion being annotated were included (n/a indicates that no conversation occurs prior to the annotated portion). ‘+a’ indicates there was an adjudication step to resolve disagreements. ‘Req’ means a dataset is available by request. ‘?’s indicate values where the information is not in the paper and the authors no longer have access to the data.

This work is based on a new, manually annotated dataset of over 62,000 messages of Internet Relay Chat (IRC). We focus on IRC because there are enormous publicly available logs of potential interest for dialogue research. In this section, we (1) outline our methodology for data selection and annotation, (2) show several measures of data quality, (3) present general properties of our data, and (4) compare with prior datasets.

Our data is drawn from the #Ubuntu help channel222 , which has formed the basis of prior work on dialogue (Lowe et al., 2015). We annotated the data with a graph structure in which messages are nodes and edges indicate reply-to relations. This implicitly defines a thread structure in which a cluster of messages is one conversation. Figure 1 shows an example of two entangled threads of messages and their graph structure. It contains an example of multiple responses to a single message, when multiple people independently help BurgerMann, and the inverse, when the last message responds to multiple messages. We also see two of the users, delire and Seveas, simultaneously participating in two conversations. This multi-conversation participation is common in the data, creating a challenging ambiguity.

In our annotations, a message that initiates a new conversation is linked to itself. When messages appear to be a response, but we cannot find the earlier message, they are left unlinked. Connected components in the graph are threads of conversation.

The example also shows two aspects of IRC we will refer to throughout the paper:

Directed messages, an informal practice in which a participant is named in the message. These cues are crucial for understanding the discussion, but not every message has them.

System messages, which indicate users entering and leaving the channel or changing their name. These all start with ===, but not all messages starting with === are system messages, as Figure 1 shows.

IRC conversations exist in a continuous channel that we are sampling from and so messages in our sample may be a response to messages that occur before the start of our sample. This is the first work we are aware of to consider this and include earlier messages as context.

3.1 Data Selection

Table 1 presents general properties of our data and compares with prior work on thread disentanglement in real-time chat. Here we describe how we chose the spans of time to annotate.

Part of the Pilot data was chosen at random, then after developing an initial annotation guide we considered times of particularly light and heavy use of the channel.

Dev, and Test were sampled by choosing a random point in the logs and keeping 500 messages after that point to be annotated and 1000 messages before that point as context.

The training set was sampled in three ways. Part of it was sampled the same way as Dev and Test. Part was sampled the same way, but with only 100 messages annotated (this was used to check annotator agreement). Finally, part of the training data contains time spans of one hour, chosen to sample a diverse range of conditions in terms of (1) the number of messages, (2) the number of participants, and (3) what percentage of messages are directed. We sorted the data independently for each factor and divided it into four equally sized buckets, then selected four random samples from each bucket, giving 48 samples (4 samples from 4 buckets for 3 factors).

Finally, we annotated the complete message log from Elsner and Charniak (2010), including the 100 message gap in their annotations. This enables a more direct comparison of annotation quality and model performance.

3.2 Methodology

Our annotation was performed in several stages. Most stages included an adjudication phase in which disagreements between annotators were resolved. To avoid bias, during resolution there was no indication of who had given which annotation, and there was the option to choose a different annotation entirely.

Pilot: We went through three rounds of pilot annotations to develop clear guidelines. In each round, annotators labeled a set of messages and then discussed all disagreements. Annotators that joined after the guide development were trained by labeling data we had previously labeled, and discussing each disagreement in a meeting with the lead author. Preliminary experiments with crowdsourcing yielded poor quality annotations, possibly due to the technical nature of the discussions or the difficulty of clearly and briefly explaining the task.

Train: To maximise the volume of annotated data, most training files were annotated by only a single annotator. However, a small set was triple annotated in order to check agreement.

Dev: Two annotators labeled the development data and a third performed adjudication.

Test: Three annotators labeled the test data and one of them went back a month later to adjudicate. The time gap was intended to reduce bias towards the annotator’s own choices.

Elsner: Two annotators labeled all of the data and a third adjudicated.

Annotations took between 7 and 11 seconds per message depending on the complexity of the discussion, and adjudication took around 5 seconds per message (lower because many messages did not have a disagreement, but not extremely low because those that did were the harder cases to label). Overall, we spent approximately 200 hours on annotation and 15 hours on adjudication. All annotations were performed using SLATE333, a custom-built tool with features designed specifically for this task. We store each message on a single line and use pairs of line numbers to specify links between messages. Messages that start a new thread of discussion are linked to themselves.

The annotators were all fluent English speakers with a background in computer science (necessary to understand some of the technical discussions). All adjudication was performed by the lead author.

3.3 Annotation Quality

Graph Structure: We measure agreement on the graph structure annotation using Cohen (1960)’s Kappa, chosen as it measure inter-rater reliability with correction for chance agreement. The first column of Table 2 shows that agreement levels vary across datasets, but our annotations fall into the good agreement range proposed by Altman (1990), and are slightly better than the graph annotations by Mehri and Carenini (2017). Results are not shown for Elsner and Charniak (2008) because their annotations do not include internal thread structure. For all of our datasets with multiple annotations, the final annotations are the result of an additional adjudication step in which the lead author considered every disagreement and chose a final annotation (though we also provide the individual annotations in our data release).

Threads: We can also measure agreement on threads, defined as connected components in the graph structure. This removes a common source of ambiguity: when it is clear that a message is part of a conversation but is unclear which specific previous message(s) it is a response to. However, it also makes comparisons more difficult, because there is no clear mapping from one set of threads to another, ruling out common agreement measures such as Cohen’s and Krippendorff’s . We consider three metrics:

Exact Match F: The standard F score, using the count of perfectly matching clusters and the overall count of clusters for each annotator.

One-to-One Overlap (1-1, Elsner and Charniak, 2010): Percentage overlap when clusters from two annotations are optimally paired up using the max-flow algorithm.

Variation of Information (VI, Meila, 2007): A measure of information gained or lost when going from one clustering to another. Specifically the sum of conditional entropies , where and are clusterings of the same set of items. We consider a scaled version, using the bound that VI, where is the number of items, and present VI so that larger values are better across all metrics.

We retain system messages across all metrics, as in Mehri and Carenini (2017), giving slightly higher values than Elsner and Charniak (2010). When there are more than two sets of annotations for a set we average the pairwise scores.

Data F 1-1 VI
Train 0.71 72.0 85.0 94.2
Dev 0.72 68.0 83.8 94.0
Test 0.74 74.7 83.8 95.0
Elsner and Charniak (2010) data
This work all 0.72 82.3 75.9 90.4
This work pilot 0.68 81.6 82.4 90.9
Elsner (2008) pilot - 88.9 73.5 85.6
This work dev 0.74 87.0 81.7 92.2
Mehri (2017) dev 0.67 58.1 80.7 91.3
This work test 0.73 77.4 66.6 84.3
Elsner (2008) test - 80.0 62.4 80.8
Table 2: Agreement metrics for graph structure () and disentangled threads (F, 1-1, VI). Our annotations are comparable or better than prior work, and in the good agreement range proposed by Altman (1990).

Table 2 presents results across all of our sets and subsets of the data from Elsner and Charniak (2010). As we observed for Kappa, our agreement is higher than the data from Mehri and Carenini (2017). Comparison to the annotations from Elsner and Charniak (2010) shows a more mixed picture. On both the pilot and test sets, our exact match score is lower, but the other two metrics are higher. Manually inspecting comparisons of the annotations, we observed that ours have more clusters that differ by one or two messages.

Finally, note two interesting patterns in Table 2. First, there is substantial variation in scores across datasets labeled by the same annotators. Second, the test set from Elsner and Charniak (2010) has consistently lower levels of agreement. From these we conclude that there is substantial variation in the difficulty of thread disentanglement across datasets. This is supported by observations by Riou et al. (2015), who had an agreement level of on a 200 message sample of French Ubuntu IRC. They described their data as less entangled then Elsner’s data and observed that a baseline system that links all messages less than 4 minutes apart scores 71 on the 1-1 metric (compared with a maximum score of 35 on Elsner’s data when tuning the time gap parameter to its optimal value).

Property Value
Annotated Messages 62,024
Non-System Messages 86%
Directed Messages 43%
Messages With 2+ Responses 17%
Messages With 2+ Antecedents 1.2%
Average Response Distance 6.6 msg
Threads with 2+ messages 5,219
Table 3: Properties of our annotated data, calculated across all sets except the Pilot data.

3.4 Data Properties

Table 3 presents key properties of our data beyond the information in Table 1. From these values we can see that the option to annotate graph structures is used, as multiple responses and multiple antecedents do occur, though most structure could be represented by trees, as multiple antecedents are rare.

4 Models

Our new dataset makes it possible to train models that identify conversation structure. In this section we propose a new model, which we use in Section 7 as part of our analysis of assumptions in prior work. We considered a range of models and inference methods, and compare with several prior systems for IRC disentanglement.

No links, and Link to Previous

We include two methods that do not meaningfully separate threads at all, but provide a reference point for metrics. No Links does not link any pair of messages. Link to Previous links every message to the previous non-system message and leaves system messages unlinked. For graphs, since the most common case is to link to the previous message, and the second most common case is to link to nothing, these are similar to majority class baselines. For threads, these correspond to placing every message in a separate thread, or all messages in a single thread, which are common baselines in clustering.

Elsner and Charniak (2008)

This system has two phases, one to score pairs of messages and one to extract a set of threads. The scoring phase uses a linear maximum-entropy model that predicts if two messages are from the same thread using a range of features based on metadata (e.g. time of message), text content (e.g. use of words from a technical jargon list), and a model of unigram probabilities trained on additional unlabeled data. To avoid label bias and reduce computational cost, only messages within 129 seconds of each other are scored. The extraction phase uses a greedy single pass in which each message

is compared with earlier messages and linked to the one with the highest score from phase one, or with no message if all scores are less than zero.

We retrained the unigram probability model and the max-ent model using our training set. We also ran the system with three settings of the cutoff parameter that determines the longest link to consider, since we suspected that the 129 second cutoff in the original work may be too short.

Lowe et al. (2017)444 No code was provided with the paper, but it was released at

This work focused on dialogue modeling, but used a rule-based approach to extract conversations from the same Ubuntu channel we consider. Their heuristic starts with a directed message and looks back over the last three minutes for a message from the target recipient. If found, these messages are linked to start a conversation thread, and then follow up directed messages from one user directed at the other are added. Undirected messages from one of the two participants are added if the speaker is not part of any other conversation thread at the same time. In the paper, conversations were filtered out if they had fewer than two participants or three messages, but we disabled this for our analysis to enable a fairer comparison.

This heuristic was developed to be run on the entire Ubuntu dataset, not small samples, and so we provided it with extra context. Instead of providing 1,000 messages before and no messages after each test sample, we provided the entire day on which the sample came from. If the 1,000 prior messages extended into the previous day we provided that entire day as well.

Additional Systems

We could not run these systems on our data as the code is not available, but we compare with their results on the Elsner data.

Wang and Oard (2009)

use a weighted combination of three estimates of the probability that two messages are in the same thread. The estimates are based on time differences, and two measures of content similarity. For each message they find the highest scoring previous message and create a link if the score is above a tuned threshold.

Mehri and Carenini (2017)

took a system combination approach with several models: (1) an RNN that encodes context messages and scores potential next messages, (2) a random forest classifier that uses the RNN’s score plus hand-crafted features to predict if one message is a response to another, (3) a variant of the second model that is trained to predict if two messages are in the same thread, (4) another random forest classifier that takes the scores from the previous three models and additional features to predict if a message belongs in a thread.

4.1 New Statistical Models

We explored a range of approaches, combining different statistical models and inference methods. We used the development set to identify the most effective methods and to tune all hyperparameters.

4.1.1 Inference

First, we treated the problem as a binary classification task, independently scoring every pair of messages. Second, we formulated it as a multi-class task, where a message is being linked to itself or one of 100 preceding messages555 In theory, a message could respond to any previous message. We impose the limit to save computation.

. Third, we applied greedy search so that previous decisions could inform the current choice. We leave further exploration of structured global search for future work. In all three cases, we experimented with both cross-entropy and hinge loss functions.

On the development set we found the first approach had poor performance due to class imbalance, while the third approach did not yield consistent improvements over the second. At test time when using the second approach, we select the single highest scoring option, leading to tree structured output. Experiments with methods to extract graphs did not improve performance.

4.1.2 Model

For all of our models the input contains two versions of the messages, one that is just split on whitespace, and one that has (1) tokenisation, (2) usernames replaced with a special symbol, and (3) rare words (less than 100 occurrences in the complete logs) replaced with a word shape token. Using the messages and associated metadata, the model uses a range of manually defined features as input. We explored a range of models, all implemented using DyNet (Neubig et al., 2017).

We used a feedforward neural network, exploring a range of configurations using the development set. We varied the number of layers, nonlinearity type, hidden dimensions, input features, and sentence representations. We also explored alternative structures including adding features and sentence representations from messages before and after the current message.

For training we varied the gradient update method, the loss type, batch sizes, weights for the loss based on error types, gradient clipping, and dropout (both at input and the hidden layers).

The best model was relatively simple, with one fully-connected hidden layer, a 128 dimensional hidden vector, tanh non-linearities, and sentence representations that averaged 100-dimensional word embeddings trained using Word2Vec

(Mikolov et al., 2013)

on messages from the entire Ubuntu IRC channel. We trained the model with a cross-entropy loss, no dropout, and stochastic gradient descent using a learning rate of 0.1 for up to 100 iterations over the training data, with early stopping after no improvement in 10 iterations and saving the model that scored highest on the development set. It was surprising that dropout did not help, but was consistent with another observation: the model was not overfitting the training set, generally scoring almost the same on the training and development data.

We also considered a linear version of the model using the same input, output, and training configuration, just no hidden layer and nonlinearity.

Finally, we tried three simple forms of model combination. We trained the model ten times, varying just the random seed. To combine the outputs we considered: keeping the union of all predicted links (x10 union), keeping the most common prediction for each message (x10 vote), and forming threads, then keeping only threads on which all ten models agreed (x10 intersect).

5 Metrics

Graph Structure Thread Extraction
System P R F VI ARand AMI 1-1 P R F
No links 18.3 17.7 18.0 59.7 12.6 33.1 19.9 0.0 0.0 0.0
Link to previous 35.7 34.4 35.0 66.1 14.7 38.7 27.6 0.0 0.0 0.0
Lowe (2017) [18.1] [17.5] [17.8] 66.5 12.3 21.1 29.5 0.0 0.0 0.0
Elsner (2010) [53.8] [51.9] [52.9] 82.1 35.5 60.9 51.4 12.1 21.5 15.5
Elsner 5 minutes [55.0] [53.1] [54.1] 83.4 39.6 64.9 55.9 15.2 25.7 19.1
Elsner 30 minutes [53.0] [51.1] [52.1] 82.5 35.2 61.3 52.9 13.9 25.7 18.0
Linear 63.0 60.8 61.9 87.2 55.3 73.3 66.5 16.7 22.3 19.1
Feedforward 64.7 62.4 63.5 88.7 57.1 77.5 69.8 20.4 26.2 22.9
ax10 union 60.1 66.5 63.1 88.5 60.0 80.1 68.5 23.2 25.4 24.3
ax10 vote 65.1 62.8 63.9 88.5 56.4 76.5 69.4 20.1 26.6 22.9
ax10 intersect - - - 69.3 3.4 7.2 27.6 38.6 18.1 24.6
Table 4:

Test set results: our new model is substantially better than prior work. Bold indicates the best result in each column. Thread level precision, recall, and F-score is only counted over threads with more than one message. Values in [] are for methods that produce threads only, which we converted to graphs by linking each message to the one immediately before it.

We consider evaluations of both the graph structure and thread structure. In both cases we compare system output with the adjudicated test labels. To make interpretation simple, we present all metrics on a scale between and , with higher being better. For the feedforward model we present results averaged over ten models with different random seeds.

For graphs we calculate precision, recall, and F-score over links. To compare with systems that produce threads without internal graph structure, we convert their output by making the assumption that each message in a thread is a direct response to the message immediately before it. Clearly this is an approximation, but it allows us to apply the same metrics for evaluation and get some sense of relative performance. These results are marked with square brackets in our results. For our thread-level metrics we use their original output.

For thread-level evaluation, it is less clear what the best metric is. We consider three metrics from the literature on comparing clusterings, and two that are specific to our task:

Variation of Information (VI, Meila, 2007): The same as in Section 3.3.

Adjusted Rand Index (ARand, Hubert and Arabie, 1985): A variant of the Rand (1971) Index, which considers whether pairs of items are in the same or different clusters in the two clusterings. The variation corrects for random agreement by rescaling using the expected rand index.

Adjusted Mutual Information (AMI, Vinh et al., 2010): A variation of mutual information that is corrected for random agreement by rescaling using the expected mutual information.

One-to-One (1-1, Elsner and Charniak, 2010): The same as in Section 3.3.

Exact Match: Almost as in Section 3.3, except we exclude threads with only one message. This makes the metric harder, but focuses it on what we are most interested in: extracting conversations.

6 Results

Table 4

presents the results on our test set. Looking first at the graph structure results we see three clear regions of performance. First, the three heuristic methods, which have very low scores across precision and recall. Second,

Elsner and Charniak (2010)’s system, which appears to benefit from our increase to the maximum link length from 2.15 to 5 minutes, but then degrade when the length is further increased. Third, our methods, with a clear gap between the linear model and the neural network. As we would expect, the union of 10 models has the highest recall, while the voting approach has the highest precision.

For thread extraction there is a clear separation in performance between the heuristic methods and the learned approaches. In general, the first four metrics follow the same trends. One notable exception is how Lowe et al. (2017)’s method scores higher on VI and 1-1 than the rule based baselines, but lower on ARand and AMI. For our challenging calculation of precision and recall over complete threads we see relatively low scores for all systems, though the x10 intersect approach is able to boost precision substantially. For the purpose of extracting a dataset of conversations trading recall off for precision is worth it, as we will end up with a slightly smaller number of higher quality conversations.

System 1-1 Local Shen-F
Tested on original annotations
Wang (2009) 47.0 75.1 52.8
Elsner (2010) 50.7 83.4 52.7
Mehri (2017) 55.2 78.6 56.6
Our model, Elsner 53.5 75.8 54.4
Our model, Ubuntu 55.0 81.7 58.0
Our model, Both 53.5 79.4 55.9
Tested on our annotations
Elsner (2010) 54.1 81.2 56.3
Our model, Elsner 55.0 78.7 58.4
Our model, Ubuntu 59.7 83.2 62.8
Our model, Both 60.7 82.2 62.6
Table 5: Results on variations of the Elsner dataset. Our model is trained using our annotations, with either the Elsner data only (Elsner), our new Ubuntu data only (Ubuntu), or both.

6.1 Elsner data

Table 5 presents results on Elsner and Charniak (2010)’s data, with the original annotations and with our new annotations. For the original annotations we include results from prior work, except Jiang et al. (2018) as their results are on a substantially modified version of the dataset (discussed in the next section). Here we follow prior work, using metrics defined by Shen et al. (2006, Shen-F) and Elsner and Charniak (2008, Local)666 Following Wang and Oard (2009) and Mehri and Carenini (2017), we report results with system messages included in the evaluation, unlike Elsner and Charniak (2008). We did, however, confirm the accuracy of our metric implementations by removing system messages and calculating results for Elsner and Charniak (2008)’s output. . Local is a constrained form of the Rand index that only considers pairs of messages within a range of three messages of each other. Shen-F considers each gold thread and finds the predicted thread with highest F-score relative to it, then averages these scores weighted by the size of the gold thread (note that this allows a predicted thread to match to zero, one, or multiple gold threads). We chose not to include these in our primary evaluation as they have been superseded by more rigorously studied metrics (VI and AMI for Shen-F) or make assumptions that do not fit our objective (Local).

We observe several interesting trends. First, training on only our new data, treating this data as an out-of-domain sample, yields high performance, similar to or better than prior work. Training our model on only our annotations of the Elsner data performs comparably to Elsner and Charniak (2010)’s model (higher on some metrics, lower on others). Training on both has mixed impact, hurting performance when measured against the original annotations, and performing similarly on our annotations.

Comparing to prior work, when not training on our additional data the best system depends on the metric. Mehri and Carenini (2017)’s model combination system is substantially ahead on 1-1 and Shen-F, but substantially behind Elsner and Charniak (2010)’s model on Local. Our system appears to fall somewhere between the two.

7 Analysis

Using our new annotated data and our trained model we are able to investigate a range of assumptions made in prior work on disentanglement. In particular, most prior work has been biased by considering a relatively small sample from a single point in time from a single channel. While we also focus on a single channel, we have considerably more data including samples from many different points in time with which to evaluate assumptions.

Length of message links

This is a common assumption made to reduce computational complexity. Elsner and Charniak (2010) and Mehri and Carenini (2017) limit links to 129 seconds, Jiang et al. (2018) limit to within 1 hour, Guo et al. (2017) limit to within 8 messages, and we limit to within 100 messages. We can test these limits by looking at the time difference between consecutive messages in the threads based on our annotations. 96% of links are within 2 minutes, and virtually all are within an hour. 82.5% of links are to one of the last 8 messages, and 99.5% are to one of the last 100 messages. This suggests that the lower limits in prior work are too low. However, in our annotation of the Elsner data, 98% of messages are within 2 minutes, suggesting that this property is channel and sample dependent.

A related assumption by Lowe et al. (2017) is that the first response to a question is within 3 minutes. Filtering the threads in our annotations to only those that start with a message explicitly annotated as not linking to anything else, we find this is true in 96% of threads.

Figure 2: Time between consecutive messages in conversations. Minor markers on the x-axis are at intervals defined by the last major marker. The upper right point is the sum over all larger values.

Lowe et al. (2017) also allow arbitrarily long links between later messages in a conversation, but claim dialogues longer than an hour are rare. To test this we measured the time between consecutive messages in a thread and plot the frequency of each value in Figure 2777 Note, the data from Lowe et al. (2017) changes the timestamps, sometimes adding four hours, and not doing so in a way that takes into consideration the 12-hour clock used in part of the data. Since the values we show are all relative, we did not attempt to correct this variation. When their change led to a negative time difference, we did not count it. . The figure indicates that the threads they extract often do extend over days, or even more than a month apart (note the point in the top-right corner). In contrast, our annotations rarely contain links beyond an hour888 Including context messages, the files we annotated contained 3.5 hours of conversation on average, enabling the annotation of longer links if they were present. , and the output of our model rarely contains links longer than 2 hours. We manually inspected 40 of their threads, half 12 to 24 long, and half longer than 24 hours. All of the longer conversations and 17 of the shorter ones incorrectly merged multiple conversations. The exceptions were two cases where a user thanked another user for their help the previous day, and one case where a user asked if another user ended up resolving their issue.

Number of concurrent threads

Adams and Martell (2008) constrain their annotators to label at most 3 concurrent threads, while Jiang et al. (2018) remove conversations from their data to ensure there are no more than 10 at once. In our data we find there at most 3 concurrent threads 53% of the time (where time is in terms of messages, not minutes), and at most 10 threads 99.5% of the time. Presumably the annotators in Adams and Martell (2008) would have proposed changes if the 3 thread limit was problematic, suggesting that their data is less entangled than ours.

Lengths of conversations

Adams and Martell (2008) break their data into 200 message blocks for annotation, which prevents threads extending beyond that range. In our data, assuming conversations start at any point through the 200 message block, 92% would finish before the cutoff point. This suggests that their conversations are typically shorter, which is consistent with the previous conclusion that their threads are less entangled.

Jiang et al. (2018)

remove threads with fewer than 10 messages, describing them as outliers. Not counting threads with only system messages, 82.6% of our conversations have fewer than 10 messages, and 39% of those have multiple authors, suggesting that such conversations are not outliers.

Message length

Jiang et al. (2018) remove messages shorter than 5 words, intending to focus on real conversations. In our data, 88.1% of messages with less than 5 words occur in conversation with more than one author, compared with 89.7% of other messages, suggesting that they do occur in real conversations. One possible explanation is that Jiang et al. (2018) may have overfit their filtering process to their Reddit data.


Lowe et al. (2017) make several other assumptions in the construction of their heuristic. First, that if all directed messages from a user are in one conversation, all undirected messages from the user are in the same conversation. We find this is true 57.9% of the time. Second, that it is rare for two people to respond to an initial question. In our data, of the messages that are explicitly labeled as starting a thread and that receive a response, 36.8% receive multiple responses. Third, that a directed message can start a conversation. For messages that are explicitly labeled as starting a thread, this is true 9.2% of the time. Overall, these assumptions have mixed support from our data, with only the third being clearly supported.


This analysis indicates that working from a single sample or a small number of samples can lead to major bias in system design for disentanglement. There is substantial variation across channels, and across time within a single channel. Our dataset provides a larger number of samples in time, but future work should expand the range of channels considered.

7.1 Dialogue Modeling

As mentioned earlier, the threads extracted by Lowe et al. (2017) have formed the basis of a series of papers on dialogue. Based on our observations above we are not confident in the accuracy of their threads. To construct a new dataset that is comparable in style, but with more accurately extracted conversations, we applied the x10 intersect model to all of the Ubuntu logs, but relaxed the constraint, requiring only 7 models to agree, and then applied several filters: (1) the first message is not directed, (2) there are exactly two participants (a questioner and a helper), not counting the channel bot, (3) no more than 80% of the messages are by a single participant, and (4) there are at least three messages. This gave 114,201 conversations. Anecdotally spot checking 100 conversations we found that 75% looked accurate, slightly higher than the results across all conversations, as shown in Table 4.

Using the new conversations, we constructed a next utterance selection task, similar to the one introduced by Lowe et al. (2017). The task is to predict the next utterance in a thread given the messages so far. We cut threads off before a message from the helper and select 9 random utterances from helpers in the full dataset as negative options. We measure performance with Mean Reciprocal Rank, and by counting the percentage of cases when the system places the correct next utterance among the first k options (Recall@k).

Model Test Train MRR R@1 R@5
DE Lowe Lowe 0.75 0.61 0.94
Ours 0.63 0.45 0.90
2-6 Ours Lowe 0.72 0.57 0.93
Ours 0.76 0.63 0.94
ESIM Lowe Lowe 0.82 0.72 0.97
Ours 0.69 0.53 0.92
2-6 Ours Lowe 0.78 0.67 0.95
Ours 0.83 0.74 0.97
Table 6: Next utterance prediction results (Recall@N), with various models and training data variations. The decrease in performance when training on one set and testing on the other suggests they differ in content.

We applied two standard models to this task: a dual-encoder model (DE, Lowe et al., 2017), and the Enhanced LSTM model (ESIM, Chen et al., 2017)

. We implemented both models in Tensorflow

(Abadi et al., 2015). Words were represented as the concatenation of (1) word embeddings initialised with 300-dimensional GloVe vectors, and (2) the output of a bidirectional LSTM over characters with 40 dimensional hidden vectors. All hidden layers were 200 dimensional, except the ESIM prediction layer, which had 256 dimensions. Finally, we limited the input sequence length to 180 tokens. We trained with batches of size 128, and the Adam optimiser (Kingma and Ba, 2014) with an initial learning rate of 0.001 and an exponential decay of 0.95 every 5000 steps.

Table 6 show results when varying the training and test datasets. For both models, a model trained on the same dataset performs better than one trained on a different dataset. This is consistent with the two datasets being different, despite being constructed from the same underlying data, with similar filtering of conversations. Also, looking at the cases when the training and test datasets match (comparing rows 1 and 4, and rows 5 and 8), we see that the results are similar, despite the fact that our data contains far fewer conversations than Lowe et al. (2017)’s.

8 Conclusion

This work provides an exciting new resource for understanding synchronous multi-party conversation online. Our data includes a diverse range of samples across a decade of chat, with high quality annotations of threads and reply-structure. We have demonstrated how our data can be used to investigate properties of communication, calling into question assumptions in prior work. Models based on our data are more accurate than prior approaches, but there is still great scope for improvement and many potential directions of future exploration.


We would like to thank Jacob Andreas, Will Radford, and Glen Pink for helpful feedback on earlier drafts of this paper. This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of IBM.