When a group of people communicate in a common channel there are often multiple conversations occurring concurrently. Some online multi-party conversations have explicit structure, such as separate threads in forums, and reply nesting in comments. This work is concerned with communication channels that do not have explicit structure, such as Internet Relay Chat (IRC), and group chats in Google Hangout, WhatsApp, and Snapchat. Figure 1 shows an example of two entangled threads in our data. Disentangling these conversations would help users understand what is happening when they join a channel and search over conversation history. It would also give researchers access to enormous resources for studying synchronous dialog. More than a decade of research has considered the disentanglement problem (Shen et al., 2006), but using datasets that are either small (Elsner and Charniak, 2008) or not publicly released (Adams and Martell, 2008; Mayfield et al., 2012).
This paper makes three key contributions: (1) we release a new dataset of over 62,000 messages of IRC manually annotated with graphs of reply structure between messages, (2) we propose a simple neural network model for disentanglement, and (3) using our data and model, we investigate assumptions in prior work. Our dataset is 25 times larger than any other publicly available resource, and is the first we are aware of to include context: messages from the channel immediately prior to each annotated sample111 All code and data will be available under a permissive use license at https://github.com/jkkummerfeld/irc-disentanglement by January 2019. . Across a range of metrics, our model has substantially higher performance that previous disentanglement approaches. Finally, our analysis shows that a range of assumptions do not generalise across IRC datasets, and identifies issues in the widely used set of disentangled threads from Lowe et al. (2015).
This resource will enable the development of data-driven systems for conversational disentanglement, opening up access to a section of online discourse that has previously been difficult to understand and study. Our analysis will help guide future work to avoid the assumptions of prior research.
The most significant work on conversation disentanglement is a line of papers developing data and models for the #Linux IRC channel (Elsner and Charniak, 2008; Elsner and Schudy, 2009; Elsner and Charniak, 2010, 2011). Until now, their dataset was the only publicly available set of messages and annotations (later partially re-annotated by Mehri and Carenini (2017)), and has been used to train and evaluate a range of subsequent systems (Wang and Oard, 2009; Mehri and Carenini, 2017; Jiang et al., 2018)
. They explored multiple models using various feature sets and linear classifiers, combined with local and global inference methods. Their system is the only publicly released statistical model for disentanglement of chat conversation. We compare to their most effective approach and build on their work, providing new annotations of their data that include more detail, specifically the reply structures inside threads.
There has been additional work on disentanglement of IRC discussion, though with unreleased datasets. Adams and Martell (2008) annotated threads in multiple channels and explored disentanglement and topic identification. The French Ubuntu channel was studied by Riou et al. (2015), who annotated threads and relations between messages, and by Hernandez et al. (2016), who annotated dialogue acts and sentiment.
IRC is not the only form of synchronous group conversation online. Other platforms with similar communication formats have been studied in settings such as classes (Wang et al., 2008; Dulceanu, 2016), support communities (Mayfield et al., 2012), and customer service (Du et al., 2017). Unfortunately, only one of these resources, from Dulceanu (2016), is available, possibly due to privacy concerns.
In terms of models, most work is similar to Elsner and Charniak (2008), with linear feature-based pairwise models, followed by either greedy decoding or a constrained global search. Recent work has applied neural networks (Mehri and Carenini, 2017; Jiang et al., 2018), with slight gains in performance.
The data we consider, logs of the #Ubuntu IRC channel, was first identified as a useful source by Uthus and Aha (2013c). In follow up work, they identified messages relevant to the Unity desktop environment (Uthus and Aha, 2013b), and whether questions can be answered by the channel bot alone (Uthus and Aha, 2013a). Lowe et al. (2015, 2017) constructed a version of the data with heuristically extracted conversations threads. Their work opened up a new opportunity by providing 930,000 threads containing 100 million tokens, and has already been the basis of many papers (218 citations), particularly on developing dialogue agents (Hernandez et al., 2016; Zhou et al., 2016; Ouchi and Tsuboi, 2016; Serban et al., 2017; Wu et al., 2017; Zhang et al., 2018, inter alia). Using our data we provide the first empirical evaluation of their heuristic thread disentanglement approach, and using our model we extract a smaller, but higher quality dataset of 114,201 threads.
Another stream of research has used user-provided structure to get thread labels Shen et al. (2006); Domeniconi et al. (2016) and reply-to relations Wang and Rosé (2010); Wang et al. (2011); Aumayr et al. (2011); Balali et al. (2013, 2014); Chen1 et al. (2017). By removing these labels and mixing threads they create a disentanglement problem. While convenient, this risks introducing a bias, as people write differently when explicit structure is defined. They may be useful for partial supervision, but only a few publications have released their data Abbott et al. (2016); Zhang et al. (2017); Louis and Cohen (2015).
Within a conversation thread, we define a graph that expresses the reply relations between messages. To our knowledge, almost all prior work with annotated graph structures has been for threaded web forums Kim et al. (2010); Schuth et al. (2007), which do not exhibit the disentanglement problem we explore. Two exceptions are Mehri and Carenini (2017) and Dulceanu (2016), who labeled small samples of chat.
|Dataset||Messages||Parts||Part Length||/ part||Context||/ msg||Avail.|
|Shen et al. (2006)*||1,645||16||35–381 msg||6-68||n/a||1||No|
|Elsner and Charniak (2008)*||2,500||1||5 hr||379||0||1-6||Yes|
|Adams and Martell (2008)*||19,925||38||67–831 msg||?||0||3||No|
|Wang et al. (2008)||337||28||2–70 msg||?||n/a||1-2||No|
|Mayfield et al. (2012)*||?||45||1 hr||3-7||n/a||1||No|
|Riou et al. (2015)||1,429||2||12 / 60 hr||21/70||0||2/1||Req|
|Dulceanu (2016)||843||3||½–1½ hr||8-9||n/a||1||Req|
|Guo et al. (2017)||1,500||1||48 hr||5||n/a||2||No|
|Mehri and Carenini (2017)||530||1||1.5 hr||54||0||3||Yes|
|This work - Pilot||1,250||9||100–332 msg||19-48||0-100||1-5||Yes|
|This work - Train||1,000||10||100 msg||20-43||100||3+a||Yes|
|This work - Dev||2,500||10||250 msg||76-167||1000||2+a||Yes|
|This work - Test||5,000||10||500 msg||79-221||1000||3+a||Yes|
|This work - Elsner||2,600||1||5 hr||387||0||2+a||Yes|
This work is based on a new, manually annotated dataset of over 62,000 messages of Internet Relay Chat (IRC). We focus on IRC because there are enormous publicly available logs of potential interest for dialogue research. In this section, we (1) outline our methodology for data selection and annotation, (2) show several measures of data quality, (3) present general properties of our data, and (4) compare with prior datasets.
Our data is drawn from the #Ubuntu help channel222 https://irclogs.ubuntu.com/ , which has formed the basis of prior work on dialogue (Lowe et al., 2015). We annotated the data with a graph structure in which messages are nodes and edges indicate reply-to relations. This implicitly defines a thread structure in which a cluster of messages is one conversation. Figure 1 shows an example of two entangled threads of messages and their graph structure. It contains an example of multiple responses to a single message, when multiple people independently help BurgerMann, and the inverse, when the last message responds to multiple messages. We also see two of the users, delire and Seveas, simultaneously participating in two conversations. This multi-conversation participation is common in the data, creating a challenging ambiguity.
In our annotations, a message that initiates a new conversation is linked to itself. When messages appear to be a response, but we cannot find the earlier message, they are left unlinked. Connected components in the graph are threads of conversation.
The example also shows two aspects of IRC we will refer to throughout the paper:
Directed messages, an informal practice in which a participant is named in the message. These cues are crucial for understanding the discussion, but not every message has them.
System messages, which indicate users entering and leaving the channel or changing their name. These all start with ===, but not all messages starting with === are system messages, as Figure 1 shows.
IRC conversations exist in a continuous channel that we are sampling from and so messages in our sample may be a response to messages that occur before the start of our sample. This is the first work we are aware of to consider this and include earlier messages as context.
3.1 Data Selection
Table 1 presents general properties of our data and compares with prior work on thread disentanglement in real-time chat. Here we describe how we chose the spans of time to annotate.
Part of the Pilot data was chosen at random, then after developing an initial annotation guide we considered times of particularly light and heavy use of the channel.
Dev, and Test were sampled by choosing a random point in the logs and keeping 500 messages after that point to be annotated and 1000 messages before that point as context.
The training set was sampled in three ways. Part of it was sampled the same way as Dev and Test. Part was sampled the same way, but with only 100 messages annotated (this was used to check annotator agreement). Finally, part of the training data contains time spans of one hour, chosen to sample a diverse range of conditions in terms of (1) the number of messages, (2) the number of participants, and (3) what percentage of messages are directed. We sorted the data independently for each factor and divided it into four equally sized buckets, then selected four random samples from each bucket, giving 48 samples (4 samples from 4 buckets for 3 factors).
Finally, we annotated the complete message log from Elsner and Charniak (2010), including the 100 message gap in their annotations. This enables a more direct comparison of annotation quality and model performance.
Our annotation was performed in several stages. Most stages included an adjudication phase in which disagreements between annotators were resolved. To avoid bias, during resolution there was no indication of who had given which annotation, and there was the option to choose a different annotation entirely.
Pilot: We went through three rounds of pilot annotations to develop clear guidelines. In each round, annotators labeled a set of messages and then discussed all disagreements. Annotators that joined after the guide development were trained by labeling data we had previously labeled, and discussing each disagreement in a meeting with the lead author. Preliminary experiments with crowdsourcing yielded poor quality annotations, possibly due to the technical nature of the discussions or the difficulty of clearly and briefly explaining the task.
Train: To maximise the volume of annotated data, most training files were annotated by only a single annotator. However, a small set was triple annotated in order to check agreement.
Dev: Two annotators labeled the development data and a third performed adjudication.
Test: Three annotators labeled the test data and one of them went back a month later to adjudicate. The time gap was intended to reduce bias towards the annotator’s own choices.
Elsner: Two annotators labeled all of the data and a third adjudicated.
Annotations took between 7 and 11 seconds per message depending on the complexity of the discussion, and adjudication took around 5 seconds per message (lower because many messages did not have a disagreement, but not extremely low because those that did were the harder cases to label). Overall, we spent approximately 200 hours on annotation and 15 hours on adjudication. All annotations were performed using SLATE333https://github.com/jkkummerfeld/slate, a custom-built tool with features designed specifically for this task. We store each message on a single line and use pairs of line numbers to specify links between messages. Messages that start a new thread of discussion are linked to themselves.
The annotators were all fluent English speakers with a background in computer science (necessary to understand some of the technical discussions). All adjudication was performed by the lead author.
3.3 Annotation Quality
Graph Structure: We measure agreement on the graph structure annotation using Cohen (1960)’s Kappa, chosen as it measure inter-rater reliability with correction for chance agreement. The first column of Table 2 shows that agreement levels vary across datasets, but our annotations fall into the good agreement range proposed by Altman (1990), and are slightly better than the graph annotations by Mehri and Carenini (2017). Results are not shown for Elsner and Charniak (2008) because their annotations do not include internal thread structure. For all of our datasets with multiple annotations, the final annotations are the result of an additional adjudication step in which the lead author considered every disagreement and chose a final annotation (though we also provide the individual annotations in our data release).
Threads: We can also measure agreement on threads, defined as connected components in the graph structure. This removes a common source of ambiguity: when it is clear that a message is part of a conversation but is unclear which specific previous message(s) it is a response to. However, it also makes comparisons more difficult, because there is no clear mapping from one set of threads to another, ruling out common agreement measures such as Cohen’s and Krippendorff’s . We consider three metrics:
Exact Match F: The standard F score, using the count of perfectly matching clusters and the overall count of clusters for each annotator.
One-to-One Overlap (1-1, Elsner and Charniak, 2010): Percentage overlap when clusters from two annotations are optimally paired up using the max-flow algorithm.
Variation of Information (VI, Meila, 2007): A measure of information gained or lost when going from one clustering to another. Specifically the sum of conditional entropies , where and are clusterings of the same set of items. We consider a scaled version, using the bound that VI, where is the number of items, and present VI so that larger values are better across all metrics.
We retain system messages across all metrics, as in Mehri and Carenini (2017), giving slightly higher values than Elsner and Charniak (2010). When there are more than two sets of annotations for a set we average the pairwise scores.
|Elsner and Charniak (2010) data|
Table 2 presents results across all of our sets and subsets of the data from Elsner and Charniak (2010). As we observed for Kappa, our agreement is higher than the data from Mehri and Carenini (2017). Comparison to the annotations from Elsner and Charniak (2010) shows a more mixed picture. On both the pilot and test sets, our exact match score is lower, but the other two metrics are higher. Manually inspecting comparisons of the annotations, we observed that ours have more clusters that differ by one or two messages.
Finally, note two interesting patterns in Table 2. First, there is substantial variation in scores across datasets labeled by the same annotators. Second, the test set from Elsner and Charniak (2010) has consistently lower levels of agreement. From these we conclude that there is substantial variation in the difficulty of thread disentanglement across datasets. This is supported by observations by Riou et al. (2015), who had an agreement level of on a 200 message sample of French Ubuntu IRC. They described their data as less entangled then Elsner’s data and observed that a baseline system that links all messages less than 4 minutes apart scores 71 on the 1-1 metric (compared with a maximum score of 35 on Elsner’s data when tuning the time gap parameter to its optimal value).
|Messages With 2+ Responses||17%|
|Messages With 2+ Antecedents||1.2%|
|Average Response Distance||6.6 msg|
|Threads with 2+ messages||5,219|
3.4 Data Properties
Table 3 presents key properties of our data beyond the information in Table 1. From these values we can see that the option to annotate graph structures is used, as multiple responses and multiple antecedents do occur, though most structure could be represented by trees, as multiple antecedents are rare.
Our new dataset makes it possible to train models that identify conversation structure. In this section we propose a new model, which we use in Section 7 as part of our analysis of assumptions in prior work. We considered a range of models and inference methods, and compare with several prior systems for IRC disentanglement.
No links, and Link to Previous
We include two methods that do not meaningfully separate threads at all, but provide a reference point for metrics. No Links does not link any pair of messages. Link to Previous links every message to the previous non-system message and leaves system messages unlinked. For graphs, since the most common case is to link to the previous message, and the second most common case is to link to nothing, these are similar to majority class baselines. For threads, these correspond to placing every message in a separate thread, or all messages in a single thread, which are common baselines in clustering.
Elsner and Charniak (2008)
This system has two phases, one to score pairs of messages and one to extract a set of threads. The scoring phase uses a linear maximum-entropy model that predicts if two messages are from the same thread using a range of features based on metadata (e.g. time of message), text content (e.g. use of words from a technical jargon list), and a model of unigram probabilities trained on additional unlabeled data. To avoid label bias and reduce computational cost, only messages within 129 seconds of each other are scored. The extraction phase uses a greedy single pass in which each messageis compared with earlier messages and linked to the one with the highest score from phase one, or with no message if all scores are less than zero.
We retrained the unigram probability model and the max-ent model using our training set. We also ran the system with three settings of the cutoff parameter that determines the longest link to consider, since we suspected that the 129 second cutoff in the original work may be too short.
Lowe et al. (2017)444 No code was provided with the paper, but it was released at https://github.com/npow/ubuntu-corpus.
This work focused on dialogue modeling, but used a rule-based approach to extract conversations from the same Ubuntu channel we consider. Their heuristic starts with a directed message and looks back over the last three minutes for a message from the target recipient. If found, these messages are linked to start a conversation thread, and then follow up directed messages from one user directed at the other are added. Undirected messages from one of the two participants are added if the speaker is not part of any other conversation thread at the same time. In the paper, conversations were filtered out if they had fewer than two participants or three messages, but we disabled this for our analysis to enable a fairer comparison.
This heuristic was developed to be run on the entire Ubuntu dataset, not small samples, and so we provided it with extra context. Instead of providing 1,000 messages before and no messages after each test sample, we provided the entire day on which the sample came from. If the 1,000 prior messages extended into the previous day we provided that entire day as well.
We could not run these systems on our data as the code is not available, but we compare with their results on the Elsner data.
Wang and Oard (2009)
use a weighted combination of three estimates of the probability that two messages are in the same thread. The estimates are based on time differences, and two measures of content similarity. For each message they find the highest scoring previous message and create a link if the score is above a tuned threshold.
Mehri and Carenini (2017)
took a system combination approach with several models: (1) an RNN that encodes context messages and scores potential next messages, (2) a random forest classifier that uses the RNN’s score plus hand-crafted features to predict if one message is a response to another, (3) a variant of the second model that is trained to predict if two messages are in the same thread, (4) another random forest classifier that takes the scores from the previous three models and additional features to predict if a message belongs in a thread.
4.1 New Statistical Models
We explored a range of approaches, combining different statistical models and inference methods. We used the development set to identify the most effective methods and to tune all hyperparameters.
First, we treated the problem as a binary classification task, independently scoring every pair of messages. Second, we formulated it as a multi-class task, where a message is being linked to itself or one of 100 preceding messages555 In theory, a message could respond to any previous message. We impose the limit to save computation.
. Third, we applied greedy search so that previous decisions could inform the current choice. We leave further exploration of structured global search for future work. In all three cases, we experimented with both cross-entropy and hinge loss functions.
On the development set we found the first approach had poor performance due to class imbalance, while the third approach did not yield consistent improvements over the second. At test time when using the second approach, we select the single highest scoring option, leading to tree structured output. Experiments with methods to extract graphs did not improve performance.
For all of our models the input contains two versions of the messages, one that is just split on whitespace, and one that has (1) tokenisation, (2) usernames replaced with a special symbol, and (3) rare words (less than 100 occurrences in the complete logs) replaced with a word shape token. Using the messages and associated metadata, the model uses a range of manually defined features as input. We explored a range of models, all implemented using DyNet (Neubig et al., 2017).
We used a feedforward neural network, exploring a range of configurations using the development set. We varied the number of layers, nonlinearity type, hidden dimensions, input features, and sentence representations. We also explored alternative structures including adding features and sentence representations from messages before and after the current message.
For training we varied the gradient update method, the loss type, batch sizes, weights for the loss based on error types, gradient clipping, and dropout (both at input and the hidden layers).
The best model was relatively simple, with one fully-connected hidden layer, a 128 dimensional hidden vector, tanh non-linearities, and sentence representations that averaged 100-dimensional word embeddings trained using Word2Vec(Mikolov et al., 2013)
on messages from the entire Ubuntu IRC channel. We trained the model with a cross-entropy loss, no dropout, and stochastic gradient descent using a learning rate of 0.1 for up to 100 iterations over the training data, with early stopping after no improvement in 10 iterations and saving the model that scored highest on the development set. It was surprising that dropout did not help, but was consistent with another observation: the model was not overfitting the training set, generally scoring almost the same on the training and development data.
We also considered a linear version of the model using the same input, output, and training configuration, just no hidden layer and nonlinearity.
Finally, we tried three simple forms of model combination. We trained the model ten times, varying just the random seed. To combine the outputs we considered: keeping the union of all predicted links (x10 union), keeping the most common prediction for each message (x10 vote), and forming threads, then keeping only threads on which all ten models agreed (x10 intersect).
|Graph Structure||Thread Extraction|
|Link to previous||35.7||34.4||35.0||66.1||14.7||38.7||27.6||0.0||0.0||0.0|
|Elsner 5 minutes||[55.0]||[53.1]||[54.1]||83.4||39.6||64.9||55.9||15.2||25.7||19.1|
|Elsner 30 minutes||[53.0]||[51.1]||[52.1]||82.5||35.2||61.3||52.9||13.9||25.7||18.0|
Test set results: our new model is substantially better than prior work. Bold indicates the best result in each column. Thread level precision, recall, and F-score is only counted over threads with more than one message. Values in  are for methods that produce threads only, which we converted to graphs by linking each message to the one immediately before it.
We consider evaluations of both the graph structure and thread structure. In both cases we compare system output with the adjudicated test labels. To make interpretation simple, we present all metrics on a scale between and , with higher being better. For the feedforward model we present results averaged over ten models with different random seeds.
For graphs we calculate precision, recall, and F-score over links. To compare with systems that produce threads without internal graph structure, we convert their output by making the assumption that each message in a thread is a direct response to the message immediately before it. Clearly this is an approximation, but it allows us to apply the same metrics for evaluation and get some sense of relative performance. These results are marked with square brackets in our results. For our thread-level metrics we use their original output.
For thread-level evaluation, it is less clear what the best metric is. We consider three metrics from the literature on comparing clusterings, and two that are specific to our task:
Adjusted Rand Index (ARand, Hubert and Arabie, 1985): A variant of the Rand (1971) Index, which considers whether pairs of items are in the same or different clusters in the two clusterings. The variation corrects for random agreement by rescaling using the expected rand index.
Adjusted Mutual Information (AMI, Vinh et al., 2010): A variation of mutual information that is corrected for random agreement by rescaling using the expected mutual information.
Exact Match: Almost as in Section 3.3, except we exclude threads with only one message. This makes the metric harder, but focuses it on what we are most interested in: extracting conversations.
presents the results on our test set. Looking first at the graph structure results we see three clear regions of performance. First, the three heuristic methods, which have very low scores across precision and recall. Second,Elsner and Charniak (2010)’s system, which appears to benefit from our increase to the maximum link length from 2.15 to 5 minutes, but then degrade when the length is further increased. Third, our methods, with a clear gap between the linear model and the neural network. As we would expect, the union of 10 models has the highest recall, while the voting approach has the highest precision.
For thread extraction there is a clear separation in performance between the heuristic methods and the learned approaches. In general, the first four metrics follow the same trends. One notable exception is how Lowe et al. (2017)’s method scores higher on VI and 1-1 than the rule based baselines, but lower on ARand and AMI. For our challenging calculation of precision and recall over complete threads we see relatively low scores for all systems, though the x10 intersect approach is able to boost precision substantially. For the purpose of extracting a dataset of conversations trading recall off for precision is worth it, as we will end up with a slightly smaller number of higher quality conversations.
|Tested on original annotations|
|Our model, Elsner||53.5||75.8||54.4|
|Our model, Ubuntu||55.0||81.7||58.0|
|Our model, Both||53.5||79.4||55.9|
|Tested on our annotations|
|Our model, Elsner||55.0||78.7||58.4|
|Our model, Ubuntu||59.7||83.2||62.8|
|Our model, Both||60.7||82.2||62.6|
6.1 Elsner data
Table 5 presents results on Elsner and Charniak (2010)’s data, with the original annotations and with our new annotations. For the original annotations we include results from prior work, except Jiang et al. (2018) as their results are on a substantially modified version of the dataset (discussed in the next section). Here we follow prior work, using metrics defined by Shen et al. (2006, Shen-F) and Elsner and Charniak (2008, Local)666 Following Wang and Oard (2009) and Mehri and Carenini (2017), we report results with system messages included in the evaluation, unlike Elsner and Charniak (2008). We did, however, confirm the accuracy of our metric implementations by removing system messages and calculating results for Elsner and Charniak (2008)’s output. . Local is a constrained form of the Rand index that only considers pairs of messages within a range of three messages of each other. Shen-F considers each gold thread and finds the predicted thread with highest F-score relative to it, then averages these scores weighted by the size of the gold thread (note that this allows a predicted thread to match to zero, one, or multiple gold threads). We chose not to include these in our primary evaluation as they have been superseded by more rigorously studied metrics (VI and AMI for Shen-F) or make assumptions that do not fit our objective (Local).
We observe several interesting trends. First, training on only our new data, treating this data as an out-of-domain sample, yields high performance, similar to or better than prior work. Training our model on only our annotations of the Elsner data performs comparably to Elsner and Charniak (2010)’s model (higher on some metrics, lower on others). Training on both has mixed impact, hurting performance when measured against the original annotations, and performing similarly on our annotations.
Comparing to prior work, when not training on our additional data the best system depends on the metric. Mehri and Carenini (2017)’s model combination system is substantially ahead on 1-1 and Shen-F, but substantially behind Elsner and Charniak (2010)’s model on Local. Our system appears to fall somewhere between the two.
Using our new annotated data and our trained model we are able to investigate a range of assumptions made in prior work on disentanglement. In particular, most prior work has been biased by considering a relatively small sample from a single point in time from a single channel. While we also focus on a single channel, we have considerably more data including samples from many different points in time with which to evaluate assumptions.
Length of message links
This is a common assumption made to reduce computational complexity. Elsner and Charniak (2010) and Mehri and Carenini (2017) limit links to 129 seconds, Jiang et al. (2018) limit to within 1 hour, Guo et al. (2017) limit to within 8 messages, and we limit to within 100 messages. We can test these limits by looking at the time difference between consecutive messages in the threads based on our annotations. 96% of links are within 2 minutes, and virtually all are within an hour. 82.5% of links are to one of the last 8 messages, and 99.5% are to one of the last 100 messages. This suggests that the lower limits in prior work are too low. However, in our annotation of the Elsner data, 98% of messages are within 2 minutes, suggesting that this property is channel and sample dependent.
A related assumption by Lowe et al. (2017) is that the first response to a question is within 3 minutes. Filtering the threads in our annotations to only those that start with a message explicitly annotated as not linking to anything else, we find this is true in 96% of threads.
Lowe et al. (2017) also allow arbitrarily long links between later messages in a conversation, but claim dialogues longer than an hour are rare. To test this we measured the time between consecutive messages in a thread and plot the frequency of each value in Figure 2777 Note, the data from Lowe et al. (2017) changes the timestamps, sometimes adding four hours, and not doing so in a way that takes into consideration the 12-hour clock used in part of the data. Since the values we show are all relative, we did not attempt to correct this variation. When their change led to a negative time difference, we did not count it. . The figure indicates that the threads they extract often do extend over days, or even more than a month apart (note the point in the top-right corner). In contrast, our annotations rarely contain links beyond an hour888 Including context messages, the files we annotated contained 3.5 hours of conversation on average, enabling the annotation of longer links if they were present. , and the output of our model rarely contains links longer than 2 hours. We manually inspected 40 of their threads, half 12 to 24 long, and half longer than 24 hours. All of the longer conversations and 17 of the shorter ones incorrectly merged multiple conversations. The exceptions were two cases where a user thanked another user for their help the previous day, and one case where a user asked if another user ended up resolving their issue.
Number of concurrent threads
Adams and Martell (2008) constrain their annotators to label at most 3 concurrent threads, while Jiang et al. (2018) remove conversations from their data to ensure there are no more than 10 at once. In our data we find there at most 3 concurrent threads 53% of the time (where time is in terms of messages, not minutes), and at most 10 threads 99.5% of the time. Presumably the annotators in Adams and Martell (2008) would have proposed changes if the 3 thread limit was problematic, suggesting that their data is less entangled than ours.
Lengths of conversations
Adams and Martell (2008) break their data into 200 message blocks for annotation, which prevents threads extending beyond that range. In our data, assuming conversations start at any point through the 200 message block, 92% would finish before the cutoff point. This suggests that their conversations are typically shorter, which is consistent with the previous conclusion that their threads are less entangled.
Jiang et al. (2018) remove messages shorter than 5 words, intending to focus on real conversations. In our data, 88.1% of messages with less than 5 words occur in conversation with more than one author, compared with 89.7% of other messages, suggesting that they do occur in real conversations. One possible explanation is that Jiang et al. (2018) may have overfit their filtering process to their Reddit data.
Lowe et al. (2017) make several other assumptions in the construction of their heuristic. First, that if all directed messages from a user are in one conversation, all undirected messages from the user are in the same conversation. We find this is true 57.9% of the time. Second, that it is rare for two people to respond to an initial question. In our data, of the messages that are explicitly labeled as starting a thread and that receive a response, 36.8% receive multiple responses. Third, that a directed message can start a conversation. For messages that are explicitly labeled as starting a thread, this is true 9.2% of the time. Overall, these assumptions have mixed support from our data, with only the third being clearly supported.
This analysis indicates that working from a single sample or a small number of samples can lead to major bias in system design for disentanglement. There is substantial variation across channels, and across time within a single channel. Our dataset provides a larger number of samples in time, but future work should expand the range of channels considered.
7.1 Dialogue Modeling
As mentioned earlier, the threads extracted by Lowe et al. (2017) have formed the basis of a series of papers on dialogue. Based on our observations above we are not confident in the accuracy of their threads. To construct a new dataset that is comparable in style, but with more accurately extracted conversations, we applied the x10 intersect model to all of the Ubuntu logs, but relaxed the constraint, requiring only 7 models to agree, and then applied several filters: (1) the first message is not directed, (2) there are exactly two participants (a questioner and a helper), not counting the channel bot, (3) no more than 80% of the messages are by a single participant, and (4) there are at least three messages. This gave 114,201 conversations. Anecdotally spot checking 100 conversations we found that 75% looked accurate, slightly higher than the results across all conversations, as shown in Table 4.
Using the new conversations, we constructed a next utterance selection task, similar to the one introduced by Lowe et al. (2017). The task is to predict the next utterance in a thread given the messages so far. We cut threads off before a message from the helper and select 9 random utterances from helpers in the full dataset as negative options. We measure performance with Mean Reciprocal Rank, and by counting the percentage of cases when the system places the correct next utterance among the first k options (Recall@k).
. We implemented both models in Tensorflow(Abadi et al., 2015). Words were represented as the concatenation of (1) word embeddings initialised with 300-dimensional GloVe vectors, and (2) the output of a bidirectional LSTM over characters with 40 dimensional hidden vectors. All hidden layers were 200 dimensional, except the ESIM prediction layer, which had 256 dimensions. Finally, we limited the input sequence length to 180 tokens. We trained with batches of size 128, and the Adam optimiser (Kingma and Ba, 2014) with an initial learning rate of 0.001 and an exponential decay of 0.95 every 5000 steps.
Table 6 show results when varying the training and test datasets. For both models, a model trained on the same dataset performs better than one trained on a different dataset. This is consistent with the two datasets being different, despite being constructed from the same underlying data, with similar filtering of conversations. Also, looking at the cases when the training and test datasets match (comparing rows 1 and 4, and rows 5 and 8), we see that the results are similar, despite the fact that our data contains far fewer conversations than Lowe et al. (2017)’s.
This work provides an exciting new resource for understanding synchronous multi-party conversation online. Our data includes a diverse range of samples across a decade of chat, with high quality annotations of threads and reply-structure. We have demonstrated how our data can be used to investigate properties of communication, calling into question assumptions in prior work. Models based on our data are more accurate than prior approaches, but there is still great scope for improvement and many potential directions of future exploration.
We would like to thank Jacob Andreas, Will Radford, and Glen Pink for helpful feedback on earlier drafts of this paper. This material is based in part upon work supported by IBM under contract 4915012629. Any opinions, findings, conclusions or recommendations expressed above are those of the authors and do not necessarily reflect the views of IBM.
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
- Abbott et al. (2016) Rob Abbott, Brian Ecker, Pranav Anand, and Marilyn Walker. 2016. Internet Argument Corpus 2.0: An SQL schema for Dialogic Social Media and the Corpora to go with it. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
- Adams and Martell (2008) Page H. Adams and Craig H. Martell. 2008. Topic Detection and Extraction in Chat. In 2008 IEEE International Conference on Semantic Computing.
- Altman (1990) Douglas G Altman. 1990. Practical statistics for medical research. CRC press.
- Aumayr et al. (2011) Erik Aumayr, Jeffrey Chan, and Conor Hayes. 2011. Reconstruction of threaded conversations in online discussion forums. In International AAAI Conference on Web and Social Media.
- Balali et al. (2014) Ali Balali, Hesham Faili, and Masoud Asadpour. 2014. A Supervised Approach to Predict the Hierarchical Structure of Conversation Threads for Comments. The Scientific World Journal.
- Balali et al. (2013) Ali Balali, Hesham Faili, Masoud Asadpour, and Mostafa Dehghani. 2013. A Supervised Approach for Reconstructing Thread Structure in Comments on Blogs and Online News Agencies. Computacion y Sistemas, 17(2):207–217.
- Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1.
- Chen1 et al. (2017) Jun Chen1, Chaokun Wang, Heran Lin, Weiping Wang, Zhipeng Cai, and Jianmin Wang. 2017. Learning the Structures of Online Asynchronous Conversations, volume 10177 of Lecture Notes in Computer Science. Springer.
- Cohen (1960) Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20(1):37–46.
- Domeniconi et al. (2016) Giacomo Domeniconi, Konstantinos Semertzidis, Vanessa Lopez, Elizabeth M. Daly, Spyros Kotoulas, and Gianluca Moro. 2016. A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection. In Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA,.
Du et al. (2017)
Wenchao Du, Pascal Poupart, and Wei Xu. 2017.
Discovering Conversational Dependencies between Messages in Dialogs.
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.
- Dulceanu (2016) Andrei Dulceanu. 2016. Recovering implicit thread structure in chat conversations. Revista Romana de Interactiune Om-Calculator, 9:217–232.
- Elsner and Charniak (2008) Micha Elsner and Eugene Charniak. 2008. You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement. In Proceedings of ACL-08: HLT.
- Elsner and Charniak (2010) Micha Elsner and Eugene Charniak. 2010. Disentangling Chat. Computational Linguistics, 36(3):389–409.
- Elsner and Charniak (2011) Micha Elsner and Eugene Charniak. 2011. Disentangling chat with local coherence models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.
- Elsner and Schudy (2009) Micha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clustering beyond ilp. In
- Guo et al. (2017) Gaoyang Guo, Chaokun Wang, Jun Chen, and Pengcheng Ge. 2017. Who Is Answering to Whom? Finding "Reply-To" Relations in Group Chats with Long Short-Term Memory Networks. In International Conference on Emerging Databases (EDB’17).
- Hernandez et al. (2016) Nicolas Hernandez, Soufian Salim, and Elizaveta Loginova Clouet. 2016. Ubuntu-fr: A Large and Open Corpus for Multi-modal Analysis of Online Written Conversations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
- Hubert and Arabie (1985) Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification, 2(1):193–218.
- Jiang et al. (2018) Jyun-Yu Jiang, Francine Chen, Yan-Ying Chen, and Wei Wang. 2018. Learning to disentangle interleaved conversational threads with a siamese hierarchical network and similarity ranking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers).
- Kim et al. (2010) Su Nam Kim, Li Wang, and Timothy Baldwin. 2010. Tagging and Linking Web Forum Posts. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning.
- Kingma and Ba (2014) D. P. Kingma and J. Ba. 2014. Adam: A Method for Stochastic Optimization. ArXiv e-prints.
- Louis and Cohen (2015) Annie Louis and Shay B. Cohen. 2015. Conversation Trees: A Grammar Model for Topic Structure in Forums. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
- Lowe et al. (2017) Ryan Lowe, Nissan Pow, Iulian Vlad Serban, Laurent Charlin, Chia-Wei Liu, and Joelle Pineau. 2017. Training End-to-End Dialogue Systems with the Ubuntu Dialogue Corpus. Dialogue & Discourse, 8(1).
- Mayfield et al. (2012) Elijah Mayfield, David Adamson, and Carolyn Penstein Rosé. 2012. Hierarchical Conversation Structure Prediction in Multi-Party Chat. In Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue.
- Mehri and Carenini (2017) Shikib Mehri and Giuseppe Carenini. 2017. Chat disentanglement: Identifying semantic reply relationships with random forests and recurrent neural networks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
Marina Meila. 2007.
clusterings–an information based distance.
Journal of Multivariate Analysis, 98(5):873–895.
- Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. 2013. Efficient Estimation of Word Representations in Vector Space. ArXiv e-prints.
- Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The Dynamic Neural Network Toolkit. arXiv preprint arXiv:1701.03980.
- Ouchi and Tsuboi (2016) Hiroki Ouchi and Yuta Tsuboi. 2016. Addressee and Response Selection for Multi-Party Conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
- Rand (1971) William M. Rand. 1971. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association, 66(336):846–850.
- Riou et al. (2015) Matthieu Riou, Soufian Salim, and Nicolas Hernandez. 2015. Using discursive information to disentangle French language chat. In NLP4CMC 2nd Workshop on Natural Language Processing for Computer-Mediated Communication / Social Media at GSCL Conference.
- Schuth et al. (2007) Anna Schuth, Maarten Marx, and Maarten de Rijke. 2007. Extracting the discussion structure in comments on news-articles. In Proceedings of the 9th annual ACM international workshop on Web information and data management.
- Serban et al. (2017) Iulian Vlad Serban, Alexander G. Ororbia, Joelle Pineau, and Aaron Courville. 2017. Piecewise Latent Variables for Neural Variational Text Processing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
- Shen et al. (2006) Dou Shen, Qiang Yang, Jian-Tao Sun, and Zheng Chen. 2006. Thread Detection in Dynamic Text Message Streams. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Uthus and Aha (2013a) David Uthus and David Aha. 2013a. Detecting Bot-Answerable Questions in Ubuntu Chat. In Proceedings of the Sixth International Joint Conference on Natural Language Processing.
- Uthus and Aha (2013b) David Uthus and David Aha. 2013b. Extending word highlighting in multiparticipant chat. In Florida Artificial Intelligence Research Society Conference.
- Uthus and Aha (2013c) David C. Uthus and David W. Aha. 2013c. The Ubuntu Chat Corpus for Multiparticipant Chat Analysis. In Analyzing Microtext: Papers from the 2013 AAAI Spring Symposium.
Vinh et al. (2010)
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010.
Information Theoretic Measures for Clusterings Comparison: Variants,
Properties, Normalization and Correction for Chance.
Journal of Machine Learning Research (JMLR), 11:2837–2854.
- Wang et al. (2011) Hongning Wang, Chi Wang, ChengXiang Zhai, and Jiawei Han. 2011. Learning Online Discussion Structures by Conditional Random Fields. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Wang and Oard (2009) Lidan Wang and Douglas W. Oard. 2009. Context-based Message Expansion for Disentanglement of Interleaved Text Conversations. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
- Wang et al. (2008) Yi-Chia Wang, Mahesh Joshi, William Cohen, and Carolyn Rosé. 2008. Recovering Implicit Thread Structure in Newsgroup Style Conversations. In Proceedings of the International Conference on Weblogs and Social Media.
- Wang and Rosé (2010) Yi-Chia Wang and Carolyn P. Rosé. 2010. Making Conversational Structure Explicit: Identification of Initiation-response Pairs within Online Discussions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
- Wu et al. (2017) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Zhang et al. (2017) Amy Zhang, Bryan Culbertson, and Praveen Paritosh. 2017. Characterizing Online Discussion Using Coarse Discourse Sequences. In 11th AAAI International Conference on Web and Social Media (ICWSM).
- Zhang et al. (2018) Rui Zhang, Honglak Lee, Lazaros Polymenakos, and Dragomir R. Radev. 2018. Addressee and Response Selection in Multi-Party Conversations with Speaker Interaction RNNs. In Proceedings of AAAI-2018.
- Zhou et al. (2016) Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016. Multi-view Response Selection for Human-Computer Conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.