Log In Sign Up

Find the Conversation Killers: a Predictive Study of Thread-ending Posts

How to improve the quality of conversations in online communities has attracted considerable attention recently. Having engaged, urbane, and reactive online conversations has a critical effect on the social life of Internet users. In this study, we are particularly interested in identifying a post in a multi-party conversation that is unlikely to be further replied to, which therefore kills that thread of the conversation. For this purpose, we propose a deep learning model called the ConverNet. ConverNet is attractive due to its capability of modeling the internal structure of a long conversation and its appropriate encoding of the contextual information of the conversation, through effective integration of attention mechanisms. Empirical experiments on real-world datasets demonstrate the effectiveness of the proposal model. For the widely concerned topic, our analysis also offers implications for improving the quality and user experience of online conversations.


page 1

page 2

page 3

page 4


Analysing Meso and Macro conversation structures in an online suicide support forum

Platforms like Reddit and Twitter offer internet users an opportunity to...

Are Deepfakes Concerning? Analyzing Conversations of Deepfakes on Reddit and Exploring Societal Implications

Deepfakes are synthetic content generated using advanced deep learning a...

Human Languages with Greater Information Density Increase Communication Speed, but Decrease Conversation Breadth

Language is the primary medium through which human information is commun...

Perceived Conversation Quality in Spontaneous Interactions

The quality of daily spontaneous conversations is of importance towards ...

I Beg to Differ: A study of constructive disagreement in online conversations

Disagreements are pervasive in human communication. In this paper we inv...

Modeling Dyadic Conversations for Personality Inference

Nowadays, automatical personality inference is drawing extensive attenti...

RESPER: Computationally Modelling Resisting Strategies in Persuasive Conversations

Modelling persuasion strategies as predictors of task outcome has severa...

1. Introduction

111This work was done when the first author was visiting the University of Michigan.

More and more people are relying on online communities to access the latest information, exchange ideas, express options, and participate in discussions. Facilitating these natural conversations in online communities has become increasingly important. On one hand, decision makers utilize these conversations to optimize their online marketing strategies; social scientists study how opinions are shaped and diffused through discussions; politicians analyze how users respond to certain governmental policies. On the other hand, effective and healthy conversations lead to increasing satisfaction and engagement of users; low-quality and ill conversations hurt the users’ social experience, turn them away, and even convert them into trolls. How to engage people in online conversations has aroused the interest of researchers from various domains.

Facilitating user conversations in online communities has also become an active line of research in the field of data mining. Many studies focus on predicting the number of retweets accumulated by a particular Tweet (Suh et al., 2010; Hong et al., 2011; Tan et al., 2014) in Twitter, identifying various factors that could help users get a higher response rate from the audience. Similar studies have also been conducted on online forums (Wang et al., 2011; Balali et al., [n. d.]), e.g., to parse and predict thread structures, which can potentially enhance information access and support sharing.

While most existing work on online conversations focuses on posts that are positively received (e.g., highly retweeted Tweets), there are much fewer studies on posts that are negatively received. For example, there is no existing work identifying conversation killers, posts that result in no further replies in a multi-party conversation. Analyzing the ineffectiveness of conversations is as important as analyzing the effectiveness. Indeed, a conversation killer not only prevents new information from being introduced and new opinions from being expressed, but also projects negative experience to the author himself - the lack of response often leads to disappointment and lower self-evaluation, and further decreases their interest and engagement. Developing a system that identifies potential conversation-killing posts and providing suggestions accordingly could greatly improve the engagement of users into the conversations. For example, if a user intends to expand a discussion, the system could send out a notice before they submit their post if it is more likely to end the discussion instead. Such a notice, together with possible suggestions, is also plausible in a two-way conversation when one intends to have the other engaged.

In this work, we study the novel task of predicting thread-ending posts, which we use as a practical surrogate for “conversation killers.” Although not all thread-ending posts are killing a conversation, and not all are done in an unintended way, knowing whether or not there will be further relies does help one to avoid becoming a potential conversation killer. We analyze various properties that are potentially predictive to ending a conversation, including text content, conversation background, conversation structure, and sentiments. We find that a standard SVM model is able to distinguish predictive signals from others. To make the best use of these predictive signals, a specially designed recurrent neural network (RNN), named the ConverNet, is employed to model conversations as sequences of posts, as RNNs are known to be good at learning high-level representations from the text. A particular challenge of modeling conversations is the large variance of the length of threads, which makes standard RNNs ineffective due to their weakness to handle long term dependency and standard attention mechanisms ineffective due to the inability of handling various-lengthed sequences. To address this challenge, we propose a simple yet powerful attention mechanism that is specifically designed for this task. The attention mechanism not only resolves the issue of lengthy threads, but also provides an effective way to model important context information (e.g., timestamps and authorship) in the conversation.

We conduct large-scale experiments with conversations in two representative domains – online forums (e.g., Reddit) and movie dialogs. The results demonstrate great effectiveness and generality of ConverNet, which outperforms a portfolio of strong baselines, including feature based SVMs and deep learning methods equipped with standard attention mechanisms. By comparing the results of ConverNet and SVM, we present interesting implications of how to engage the participants in a conversation.

2. Related Work

The large quantity of information on the social media platforms exhibits a great potential for the research in the domain of data mining and natural language processing. In this paper, we specifically focus on the task of predicting posts that will be conversation endings. There are several lines of related work.

2.1. Prediction of Replies or Retweets for Microblogs

One line of work concerning response prediction on social networks aims to predict the number of replies. This helps content generators, especially advertisers and celebrities, to increase their exposure and maintain their images. Rowe et al. (2011) (Rowe et al., 2011) targeted the prediction of seed posts and their potential number of replies in the future. Suh et al. (2010) (Suh et al., 2010) and Hong et al. (2011) (Hong et al., 2011) focused on predicting the number of retweets and analyzed what kinds of tweets attract more retweets.

Besides predicting the number of replies, there is also work on predicting the binary task of whether a Tweet can get a response (e.g., replies or retweets). Yoav Artzi et al.(2012) (Artzi et al., 2012) tackle this task mainly based on users’ social network and historical influence. Rowe et al. (2014) (Rowe and Alani, 2014) further investigated more features that might affect engagement across various social media platforms.

The above-mentioned work mainly focuses on a single post (e.g., a Tweet), which is not part of a longer conversation. On the contrary, we identify the thread-ending posts in the context of a multi-party conversation, which has a complex internal structure and richer context information beyond the text content.

2.2. Prediction Tasks in online Forums

Another line of work comes from the domain of online forums. Some studies aim at predicting thread structures. The work proposed by Wang et al. 2011 (Wang et al., 2011) approaches this task by detecting initiation-response pairs, which are pairs of utterances that the first part sets up an expectation for the second part. Balali et al. (Balali et al., [n. d.])

followed their work by reconstructing the thread structure, formulating it as a supervised learning task.

There are other types of tasks in online forums that are relevant to work, including assessing the quality of posts (Siersdorfer et al., 2010; Tausczik and Pennebaker, 2011; Ghose and Ipeirotis, 2011), and categorization of post types (e.g., question, solution, spam) (Perumal and Hirst, 2016).

Although the context and features are relevant, these tasks do not aim at the problem we are solving, to identify thread-ending posts in a conversation. Further investigation needs to be done to identify what types of information could be predictive in our task.

2.3. Modeling Conversational Interactions

There has been extensive work on modeling conversational interactions on social media platforms. Honeycutt and Herring (Honeycutt and Herring, 2009) analyzed how to make Twitter more usable as a tool for collaboration. Boyd et al. (Boyd et al., 2010) studied how retweeting can be used as a way to converse with others. Ritter et al. (Ritter et al., 2010) used unsupervised conversation models to cluster utterances with similar conversational roles. These tasks focus more on the analysis of conversations, rather than predictions.

Recently, researchers begin to study how to automatically generate responses given a conversation history (Shang et al., 2015; Sordoni et al., 2015; Serban et al., 2016). In this work, we do not aim at generating responses. Rather, we attempt to help users understand whether their posts that are about to be submitted would terminate the thread of the conversation.

In summary, our work uses data across different social media platforms and complement the studies mentioned above, but with a unique focus on the effect of a post to the entire conversations. We propose a deep learning model that takes consideration of the content, structure, and context of the entire conversation and predicts the outcome of a single post. This model and our findings could help the aforementioned tasks in literature and help increase user engagement in online conversations.

3. ConverNet for Thread-ending prediction

In this section, we propose a specifically designed neural network model that uses information of an entire conversation to predict thread-ending posts.

We start with a few definitions. A post is a message submitted by a single user, while a conversation is a set of posts concerning a focused topic posted by a group of people. A thread of a conversation is a subset of posts that are organized as a tree structure through the reply-to relationships. We only focus on threads with at least two posts, as more than one person has to be involved when there is a conversation222 The thread-ending post is a post in a given thread that will not receive any further replies. In this prediction task, we care more about these thread enders than their counterparts, thus we label a thread-ending post as positive and the others as negative. In this way, the problem is formulated as a binary classification task. Note that we use thread-ending posts as surrogates for conversation killing posts because they are widely available and have explicit labels.

Our model builds on the insight that recurrent neural networks (RNNs) with a specifically designed attention mechanism can have great advantages on dealing with the internal structure of posts in a thread and that additional context information can further boost the classification performance. The reasons are listed as follows.

  • Posts in a given thread have strong connections between each other. For explicit tree structure and latent connections behind these posts, RNN models are more suitable compared with standard classification methods, e.g., SVMs. Similar to using RNNs to encode sentences in machine translation tasks (Cho et al., 2014), we can also use them to model posts in a thread, which can be used for the further classification task.

  • Compared with traditional models, deep learning models have an advantage in dealing with data sets of large scale, by training on one batch at a time. This is desirable when we are working on a large amount of user generated content.

  • We empirically found that context information, e.g., post time and authorship, could greatly complement text information. Therefore we incorporate them into a unified model.

  • Some conversation threads can be very long, containing tens of posts. Attention mechanisms are usually added over RNNs to solve the problem of long term dependency  (Yang et al., 2016). However, because of the fact that different conversations could vary greatly by length, we find that the standard attention mechanism over posts falls short of modeling the longer threads. Therefore, we specifically design an attention mechanism to handle this situation.

We propose a recurrent neural network model, called ConverNet, which implements the above design objectives and ideas. In the rest of this section, we will give a brief introduction to standard RNNs, followed by the description of our model.

3.1. Background

To facilitate readers with different levels of knowledge, we introduce the standard building blocks used by our model here.

3.1.1. LSTM And BiLSTM

In ConverNet, we use BiLSTM as the basic building blocks of its architecture. LSTM (Long Short-term Memory)

(Hochreiter and Schmidhuber, 1997) units are widely used to build an RNN model and BiLSTM (Graves et al., 2013) is one of its extensions.

Below we briefly introduce the basic formulation of a LSTM layer. Given as the cell state’s initial value and as the input of time step , LSTM can be formulated as follows:


where is the output of the LSTM layer in time step t.

Bidirectional LSTM is an extension of LSTM. It takes the information not only from the forward pass but also from the backward pass. There are two identical independent LSTM kernels in the BiLSTM. At time step , one takes and the other takes as its input, where

is the total time steps needed. The outputs of the two kernels are later aligned according to the time ordinal number and concatenated as the final output of the BiLSTM block.

Bidirectional LSTM can help overcome the problem that the LSTM kernel at the time step

does not know anything about the following inputs sequences. However, it still cannot solve the paradox that the longer the input sequences, the more LSTM layer is about to forget since the hidden unit inside the LSTM is a constant. Therefore, we need to further explore an attention model.

3.1.2. Layer Normalization

Layer Normalization technique introduced by Ba et. al (Ba et al., 2016) has shown that in the training process of LSTM, layer normalization can have a significant influence on both the training speed and the task performance. In a standard RNN, the summed inputs in the recurrent layer are computed from the current input

and previous vector of hidden states

which are computed as . The layer normalized recurrent layer re-centers and re-scales using the extra normalization terms:

where are the recurrent hidden to hidden weights and are the bottom up input to hidden weights. In a layer normalized RNN, the normalization terms make it invariant to re-scaling all of the summed inputs to a layer, thus resulting in much more stable hidden-to-hidden dynamics.

3.2. Context Information for Prediction of Thread-ending Posts

As mentioned above, in addition to the text content, we investigate a set of context information of a conversation thread that could contribute to the prediction task. To incorporate them into a unified model, they are implemented as features, which are listed as follows. Generally speaking, there are four types, length information, sentiment information, background information, and replying property.

Length information

Post length: the number of words in a given post.

Thread length: the number of posts in a given thread.

Sentiment information

Sentiment: the intensity scores of neutral, positive, and negative sentiments of a given post. In this work we simply adopt the scores implemented by nltk

the VADER Lexicon

(Hutto and Gilbert, 2014).

Background information

Conversation background: the context where the conversation happens. For example, in the movie-dialog data set, this is the information of the movie in which the conversation happens.

Author features: background information of the post author. For example, the number of times the author ends a conversation thread in the past.

Reply information

Replying structure: basic replying information of a thread, consisting of every post’s parent post in this thread.

Post time

: the post time interval between each post and its previous one, classified into categories of

within an hour, within a day, within a week and no later than a month.

3.3. ConverNet

Figure 1. A version of the ConverNet designed for the prediction task. Some submodules are numbered and correspondingly detailed in the text.

We now introduce the framework of our proposed neural network model, shown in Figure  1. For simplicity, we will focus on illustrating three main components in our network.

(1) Input Processing Component

The input of ConverNet is a flattened sequence of posts in a thread sorted by their post time, regardless of whether one is replying to the previous one. The replying tree structure is handled along with other context information.

For the content information, we first use an embedding layer to get (the total number of posts in a thread) embedding vectors (which represents the embedding for the word of the post) for each word in a post from a given thread. After that we are going to generate a post embedding vector based on all its words.

where is the corresponding context information for the post in a given thread. This is done through an average pooling layer on top of the word embedding layer in ConverNet.

All types of context information are merged with the pooling result by a simple concatenation, composing the input data for LNBiLSTM layer.

(2) Encoding Component

The encoding component consists of a BiLSTM with the layer normalization technique and our proposed Dwdl attention layer, which will be described in details in the next subsection. Firstly, the LNBiLSTM encodes the in the following way,


The parameters are as follows:

For the input gate: , and .

For the forget gate: , and .

For the cell computation: , and .

For the output gate: , and

The post to be predicted is positioned at the end of the input sequence. However, instead of using only the final output of the LNBiLSTM layer. , we take full use of all the outputs from all time steps. The Dwdl attention layer functions to further encode the output sequence of LNBiLSTM into a vector that has the same dimension as the hidden units in LNBiLSTM, where is a matrix vertically stacked by LNBiLSTM output at every time step time.

In the end, there is a merge operation implemented to combine the result of the attention layer and the LNBiLSTM’s last output vector.

In this way, theoretically, the performance of adding the attention layer will not be any worse than a single LNBiLSTM kernel. As for the merge operation, we implement it using concatenation.

(3) Decoding Component

After getting the result from the encoding component, the decoding component focuses on the final classification task. It consists of several MLP layers, followed by the final layer that has only one output unit, predicting whether the given post is a conversation en or not:

All MLP layers are followed by batch normalization layers

(Ioffe and Szegedy, 2015)

. All layers are activated with ReLU

(Glorot et al., 2011), with the exception of the last one, which is activated with Sigmoid function. This guarantees that the final output is either 0 or 1.

3.4. Dwdl Attention Layer

The motivation to use an attention layer is to take full usage of the information generated by the LSTM kernel. However, one difficulty of our task is that the length of threads ranges from a considerable scale. The standard attention mechanism learns attention weights uniformly over posts. That is, the -th post in different threads always receive the same attention. Unfortunately, this assumption does not hold in reality, as the learned weights do not fit universally to posts in threads of various lengths.

As a solution, one may want to apply different attention weights for each thread. This leads to another problem by introducing a large number of parameters to be learned.

To resolve both issues, we propose an attention mechanism that applies different attention weights for different lengths of input (Dwdl), while weights are shared among threads of identical length. In this way, the attention mechanism thus outputs the result , which is,

where is the output sequence from the LNBiLSTM layer with length . And is the attention weight matrix that will be learned.

Despite its simplicity, we found that it solves not only the problem that it is hard to learn thread representations when they vary greatly in length, but also avoid to introduce too many parameters to be learned. The design of the Dwdl attention layer is a major innovation of the proposed ConverNet model in the context of deep learning architectures.

3.5. Loss function

We use binary cross entropy loss to train our model. The objective is to minimize the loss function:


is the predicted probability of the

-th conversation, and is the ground-truth label.

4. Experiment Setup

We present empirical experiments that compare our model with various alternative approaches on two public datasets. The following experiments are aimed to demonstrate the effectiveness of ConverNet in general and the effectiveness of different kinds of content and context information in predicting thread-ending posts.

4.1. Data Sets

We accomplish this task on two representative datasets. One contains threads of Reddit posts and comments, which is extracted from, one of the largest online forums that cover a variety of topics. This dataset is representative for online conversations. The other is a collection of conversations extracted from movie scripts. We include this dataset because movie dialogs are closer to offline, everyday conversations, which would be a good reference for understanding the properties of online conversations. The statistics of these datasets are listed in Table 1.

Properties Reddit-Threads Movie-Dialogs
Threads 83,097 100,000
Vocabulary 29,729 107,354
Max post len. 673 2689
Avg. post len. 13.02 words 43.83 words
# train threads 63,097 80,000
# val threads 10,000 10,000
# test threads 10,000 10,000
Table 1. Statistics of each data set.

Reddit-Threads Data Set

Source. This data set is generated based on the public Reddit-Comments data set provided by Reddit user Stuck_In_the_Matrix (Red, [n. d.]). The original data set consists of all of the posts and comments available on Reddit since early 2006. In our experiment, we focus on the threads in the political domain (from 2007.82009.8), a major topic of interest on Reddit.

Processing. By utilizing the parent post information provided by Reddit, we recover the tree structure of each thread, where leaf nodes are considered as thread-ending posts. As mentioned before, we only focus on threads with more than one post – a thread with only one post is not a conversation. The length distribution of threads is shown in Figure 2, which generally follows the power law distribution.

Prediction Task. With these tree-structured threads, our prediction task is equivalent to predicting whether a given node is a leaf node or not. Note that in a Reddit Thread, there might be more than two actors (authors) engaged in a conversation.

Movie-Dialogs Corpus


We use the Cornell Movie-Dialogs Corpus, which is widely used for text generation tasks

(Danescu-Niculescu-Mizil and Lee, 2011). It contains more everyday words and involves in total 617 movies with 10,292 movie characters. Compared with the Reddit-Threads data set, the posts (or sentences) are simpler, shorter, and more formal. A major difference is that every dialog happens between two speakers, so the number of users in threads is a constant. The length distribution of threads is shown in Figure 3.

Prediction Task. We treat every movie dialog also as a chatting thread with two participators. These chatting threads are organized as a sequence of “posts” (sentences) instead of a tree structure. As a result, only the last post (sentence) ends the conversation.

Sampling and Other Processing

In both datasets, we randomly sample one post (sentence) from each thread (dialog) and predict whether it is a thread-ending post. We call these posts target posts. Since the information after the target post will reveal the ground truth for the prediction task, all posts after the target post must be omitted before putting into the model. The Reddit-threads data set are split into train, validation, and test set according to submission time of the first post in each thread. The first 80,000 threads are assigned to the training set and the rest 20,000 are equally separated into validation and test set. For the movie data set, we do a random permutation of all the threads, and assign them into train/val/test set as Table 1 shows.

Figure 2. The length distribution of threads in Reddit-Threads data set.
Figure 3. The length distribution of threads in Movie-Dialogs data set.

4.2. Metrics

Since the label distribution of our binary classification task is skewed, we adopt metrics in addition to the commonly used

accuracy, including AUC and MAP

(mean-average precision), because in reality it is more important to make sure the top-ranked posts are truely thread-ending posts (so that notices can be sent). To achieve high scores when evaluated by these additional metrics, both precision and recall are important.

4.3. Competing methods

We compare ConverNet with a handful of baseline methods in two categories: conventional machine learning methods and alternative deep learning methods.

4.3.1. Conventional Baselines

SVM+/-[features]. As a conventional method that has demonstrated superior performance in many kinds of classification tasks, SVM can sometimes achieve comparable performance to the deep learning methods. Besides, it plays a key role in deciding which kind of hand-crafted features are helpful in our prediction task. Therefore, we include all kinds of features potentially related to this problem across domains, and use a linear kernel SVM implemented by sklearn for classification.

In addition to all the features mentioned in the method section, we include additional features extracted from text content. They include word unigrams, bigrams, trigrams, and post embeddings. Specifically, post embeddings are the averages of word vectors in a post. The word vectors are generated by word2vec 

(Mikolov et al., 2013) skip-gram and CBOW models. We concatenate all the features from all the posts in a thread. When a method is named SVM+[features], it refers to an SVM model trained using only the corresponding set of features. When a method is named SVM-[features], it refers to an SVM model that uses all but the corresponding set of features.

4.3.2. Deep learning baselines

BiLSTM, LNBiLSTM, Parallel LNBiLSTM. We use the bi-directional LSTM (BiLSTM) model that is widely used for classification tasks. Considering the recent success of layer normalization, we include the bi-directional LSTM with layer normalization (LNBiLSTM) as a baseline. We also stack multiple LNBiLSTM to learn deeper representations (Stacked LNBiLSTM).

LNBiLSTM+features and LNBiLSTM+features+SA. One major innovation of ConverNet is the newly designed Dwdl attention mechanism (see Section 3.4) to handle context information in a thread. For comparison, we also add context information to LNBiLSTM using several ways. LNBiLSTM+features simply concatenate the representation output by LNBiLSTM and features extracted from context information. LNBiLSTM+features+SA applies a standard attention over the output of the hidden state by LNBiLSTM. For all LNBiLSTM-related models, LSTMs are stacked using horizontal or vertical ways. But the stacking times can vary from model to model.

4.4. Training Details

Reddit-Threads Data Set Movie-Dialogs Data Set
Method Accuracy AUC MAP Accuracy AUC MAP

SVM-Text content(Embedding, N-grams)

SVM-Lengths info
SVM-Background info
SVM-Post time
SVM-Replying structures
SVM+All features
BiLSTM+Text content (only the target post)
BiLSTM+Text content
LNBiLSTM+Text content
Stacked LNBiLSTM+Text content
LNBiLSTM+All features
LNBiLSTM+All features+Standard attention
Table 2. Performance of competing methods: LNBiLSTM+All features+Dwdl attention achieves top performance.

All hyper-parameters are tuned to obtain the best performance of AUC score on the validation set. For LSTM-related methods, the candidate word embedding sizes are set as and the candidate numbers of hidden/cell units in the LSTM-related layer are . The vertically and horizontally LSTM stacking candidate number is chosen from . The embedding size for context information is selected from . The initial learning rate is selected from . For SVM, the candidate embedding size is from and the relaxing parameter C for SVM model is chosen from .

We initialize parameters in neural networks using a zero-mean Gaussian with standard deviation selected from

. For parallel network models, the number of stacking layers is selected from

. All deep learning models are optimized by RmsProp 

(Tieleman and Hinton, 2012). We stop training when the performance converges on the validation set.

5. experiment results

5.1. Overall Performance

The overall performance of all competing methods are shown in Table 2. The proposed method ConverNet outperforms all competing methods in all three metrics, AUC, Accuracy, and MAP. The improvements are all statistically significant except for one case (accuracy on Reddit-Threads, where LNBiLSTM+All features already performs very well). This empirically confirms that a well designed deep learning model can achieve the best result in predicting thread-ending posts in online conversations.

Comparing different versions of SVM models, the content of the thread and the target post appears to be the most important. When content features are included, there is a 5% improvement in MAP on Reddit dataset (0.688 -¿ 0.726) and 7% improvement on the Movie-dialog dataset (0.650 -¿ 0.696). Certain context information is also useful on top of textual features, especially the time of the posts in Reddit threads (0.699 -¿ 0.726). A more detailed comparison of the features is deferred to Section 6. Consistent conclusion can be made by comparing ConverNet with the best deep learning baseline that is purely based on content information. Overall, comparing ConverNet to the best performing SVM baseline, there is another 8% improvement on Reddit-Threads (0.726 -¿ 0.782) and 6% improvement on Movie-Dialogs (0.696 -¿ 0.737).

When using the standard attention layer to handle context information, the deep learning model does not perform better than the model based on content only, sometimes even less effective. This is possibly due to the large variation of the thread length in both data sets. Threads with fewer posts could have a totally different attention distribution from those with more posts. With the newly designed Dwrl attention layer, ConverNet is able to significantly outperform the content based models.

The experiment results of using different LSTM-related kernels also prove that BiLSTM with layer normalization can make a significant improvement, while repeatedly stacking this layer can also slightly enhance the prediction result.

It is interesting to note that a deep learning model learned only based on the content of the target post (instead of the whole thread), BiLSTM+Text content (only the target post), performs significantly worse even compared to the same model that considers the content of both the target post and other posts in the thread (with no context information). This assures that information in the whole thread is important for predicting whether a post will be replied to, which again distinguishes our problem setting with existing work that predicts retweets. Indeed, whether a post will carry on a conversation highly depends on whether what it says is relevant to the topic of the discussion.

5.2. Training Time Analysis

In order to measure the training speed of each model, we train all deep learning methods on a server with a single TITAN X GPU (12 GB GDDR5X, Graphics Card Power of 250 W).

For all deep learning methods, the numbers of epochs required for training are all quite close, which is around 15 epochs averaged over all data sets.

Comparing the training time per epoch, stacking more LNBiLSTM will decrease the time efficiency. And the basic BiLSTM model takes around 55 seconds per epoch. Using layer normalization will slightly reduce training time (2 3 seconds). Adding attention layer increases the training time. Our ConverNet model consumes around 60 seconds per epoch. The total training time of ConverNet model is around 900 seconds with about 18 training epochs.

6. Discussion

The overall performances of the competing algorithms have demonstrated that thread-ending posts are predictable through integrating content and rich context information, and through a carefully designed deep learning architecture. Beyond the numbers, we are also interested in the implications of the experiment on how to avoid to be a conversation killer. We approach this by a more detailed analysis of the features and interpretation of the model results.

6.1. Feature Analysis

Based on the features extracted for SVM, we first conduct a simple correlation analysis to understand what features of content and context are positively or negatively correlated with the outcome, whether a post in a thread-ending post or not. Correlations of some selected features (measured in Pearson’s coefficient) and the outcome label are shown in Table 3.

Reddit-Threads Movie-Dialogs
Features Correlation Correlation
Word Embeddings(-) ‘Mr.’,‘Mrs.’,‘like’,‘talked’,‘heard’,‘seen’,‘care’
Word Embeddings(+) ‘ass’,‘but’,‘YOU’
Length of Thread + +
Length of Post - +
Post Time Difference + x
Positive Sentiment Score + -
Negative Sentiment Score - +
Table 3. Correlation of features to ending a thread.

Embeddings of certain words are identified as significantly correlated to thread-ending posts, either positively or negatively (positive words mean higher probabilities to cause endings, negative words otherwise). We see that the most correlated words are likely to be sentimental or particular expressions instead of topical words. For example, polite addresses like ’Mr.’ and ’Mrs’ will more successfully lead to further communications. Key Words indicating inclines of sharing experiences like ’seen’, ’heard’ will also draw the attention of other users. However, insulting words like ’ass’ or words with an intense sentiment like ’YOU’ will be more likely to end a conversation.

Length of a thread is positively correlated with thread-ending posts, indicating that being the first a few posts in a conversation is more likely to be replied to, and when the conversations is already lengthy, it is less likely to prolong the discussion. Post time (time since the previous post) is also positively correlated to the outcome, indicating that the longer one waits to reply to a thread, the more likely they will never be replied to.

These findings are quite intuitive and pretty much consistent on two datasets. There are other features that are more intriguing. For example, in online conversations (Reddit threads), the more words a post has, the more likely that it brings in a reply. In everyday conversations (movie dialogs), a long speech does not necessarily bring in responses. Saying too much might result in a silence. Perhaps reading (a forum post) is indeed more efficient than listening? In movie dialogs, a post that is more positive is less likely to end a conversation and negative sentiment is the contrary, which is consistent to our general intuition (being more polite to get people respond to you). Interestingly, these correlations are the opposite in Reddit threads. Considering the unique nature of online forums and the topic (politics), it is perhaps not too surprising. On one hand, political discussions in online forums are known to be intense and controversial, where an attacking or uncivil (usually with extremely negative sentiment) post is likely to bring another negative reply (Cheng et al., 2017). On the other hand, many threads in online forums are asking questions, and when a satisfactory answer is provided, these threads are usually ended with a short post of a simple appreciation. Therefore, a positive sentiment is linked to thread-ending. Apparently, in these two cases, ending a thread politely is not a bad thing, and prolonging an uncivil discussion is much more undesirable. This does intrigue us to rethink about the difference between thread-ending posts and conversation killers, and between conversation killers and killers of a “good conversation.”

Given the correlation analysis of individual features, we are also interested to know how the features work together (see Table 2). Because different types of content and context features can be highly correlated, when they are feed into SVM, the signs and coefficients of the features may or may not be consistent with the correlations.

In the Movie-Dialog data set, apart from the most predictive content features, movie background features (such as the theme of the movie) also contribute significantly. On the contrary, sentiment and length features are less useful on top of content features, which maybe because such information might have already been captured by the content features. The background feature represents the circumstances under which conversations happen. One may also utilize such background information for analyzing online forums if it is available (e.g., politics vs. entertainment).

In the Reddit-Threads data set, post time brings in an significant improvement, while replying structures and sentiment scores can only slightly improve on top of other features. This is because sentiment scores and replying structures may have become redundant when other kinds of features are already used. For example, SVMs might have already learned the sentiment feature from the text content. However, context information like post time is relatively orthogonal to the other types of features, thus bringing in a more noticeable improvement.

6.2. What ConverNet Learns

To gain a better understanding of the behavior of ConverNet, we manually analyzed cases where ConverNet performs better than SVMs. These cases can be put into the following categories.

Posts with an intense tone.

Empirically speaking, a post with an intense tone tends to create a serious chatting atmosphere, where other conversation participants may get nervous or shocked to say anything, thus increasing the possibility to end a conversation. It is hard for SVMs to identify these cases even with the help of sentiment analysis, while ConverNet can have a significantly higher chance to give the right prediction. An example is shown in Figure 

4. Sit and down are both everyday words. Sentiment lexicons might fail to detect any sentiments, but we can sense a commanding tone of the expression, especially for the last utterance. Such subtlety can be detected by neural network’s recurrent mechanism and the attention mechanism.

Figure 4. ConverNet performs better than SVM on thread-ending posts with an intense tone.

Posts with an inquiry tone.

If someone is asking for other people’s opinions or proposing other questions, the conversation may have a much higher possibility to carry on. When these posts are the targets, ConverNet is more likely to detect these questions and give a correct prediction. Some examples are shown in the Figure 5.

Figure 5. ConverNet performs better than SVM on thread-ending posts with an asking tone.

Posts with a vague tone.

We find that a large number of thread-ending posts carry a vague tone, with examples listed in the Figure 6. These posts convey ambiguous meanings, giving no direct response to the questions raised by others before. This kind of vague tone can make other participants think that the speaker is not interested, or is not giving enough attention, which in turn makes the conversation stop. ConverNet’s recurrent mechanism and attention mechanism can help better extract these ambiguous words from given comments, thus yielding a higher precision.

Figure 6. ConverNet performs better than SVM on thread-ending posts with a vague tone.

Based on the analysis of features and the interpretation of the models, if advices have to be given to avoid being a conversation killer, some general implications may be: keep to the point (content), act fast (post time and thread length), be elaborative (post length), be positive (sentiment), and pay attention to your tone (deep patterns in language).

7. Conclusion

How to improve the quality of conversations and engage user participation in online communities is a critical problem that may be relevant to every Internet user. Our work focuses on a novel data mining problem, to identify what type of posts are likely to end a thread of conversation online. We find that while a standard SVM can effectively identify useful signals from both content and context information that are predictive for thread-ending posts, a carefully designed recurrent neural network model, ConverNet, is able to maximize the predictive power of these signals. ConverNet outperforms all the competing baselines in data sets from two representative domains. The results of ConverNet also provides practical implications for improving the quality of online conversations. Our work opens up interesting directions towards understanding the quality of online conversations and increasing user engagement, and towards a deeper understanding of the functionality of language in a conversation.


This work was supported in part by the 973 program (2015CB352302), NSFC (U1611461) and key program of Zhejiang Province (2015C01027), and was partially supported by the National Science Foundation under grant numbers IIS-1054199, IIS-1633370, and SES-1131500.


  • (1)
  • Red ([n. d.]) [n. d.]. Reddit Comments Dataset. ([n. d.]). Year: 2015.
  • Artzi et al. (2012) Yoav Artzi, Patrick Pantel, and Michael Gamon. 2012. Predicting Responses to Microblog Posts. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 3-8, 2012, Montréal, Canada. The Association for Computational Linguistics, 602–606.
  • Ba et al. (2016) Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016).
  • Balali et al. ([n. d.]) A. Balali, H. Faili, and M. Asadpour. [n. d.]. Research Article A Supervised Approach to Predict the Hierarchical Structure of Conversation Threads for Comments. ([n. d.]).
  • Boyd et al. (2010) Danah Boyd, Scott Golder, and Gilad Lotan. 2010. Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter. In 43rd Hawaii International International Conference on Systems Science (HICSS-43 2010), Proceedings, 5-8 January 2010, Koloa, Kauai, HI, USA. IEEE Computer Society, 1–10.
  • Cheng et al. (2017) Justin Cheng, Michael Bernstein, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2017. Anyone can become a troll: Causes of trolling behavior in online discussions. arXiv preprint arXiv:1702.01119 (2017).
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1724–1734.
  • Danescu-Niculescu-Mizil and Lee (2011) Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs.. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL 2011.
  • Fisher (1935) RA Fisher. 1935. The design of experiments. (1935).
  • Ghose and Ipeirotis (2011) Anindya Ghose and Panagiotis G. Ipeirotis. 2011. Estimating the Helpfulness and Economic Impact of Product Reviews: Mining Text and Reviewer Characteristics. IEEE Trans. Knowl. Data Eng. 23, 10 (2011), 1498–1512.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep Sparse Rectifier Neural Networks. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011

    (JMLR Proceedings), Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík (Eds.), Vol. 15., 315–323.
  • Graves et al. (2013) Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 273–278.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780.
  • Honeycutt and Herring (2009) Courtenay Honeycutt and Susan C. Herring. 2009. Beyond Microblogging: Conversation and Collaboration via Twitter. In 42st Hawaii International International Conference on Systems Science (HICSS-42 2009), Proceedings (CD-ROM and online), 5-8 January 2009, Waikoloa, Big Island, HI, USA. IEEE Computer Society, 1–10.
  • Hong et al. (2011) Liangjie Hong, Ovidiu Dan, and Brian D Davison. 2011. Predicting popular messages in twitter. In Proceedings of the 20th international conference companion on World wide web. ACM, 57–58.
  • Hutto and Gilbert (2014) Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the Eighth International Conference on Weblogs and Social Media, ICWSM 2014, Ann Arbor, Michigan, USA, June 1-4, 2014., Eytan Adar, Paul Resnick, Munmun De Choudhury, Bernie Hogan, and Alice H. Oh (Eds.). The AAAI Press.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 (JMLR Workshop and Conference Proceedings), Francis R. Bach and David M. Blei (Eds.), Vol. 37., 448–456.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  • Perumal and Hirst (2016) Krish Perumal and Graeme Hirst. 2016. Semi-supervised and unsupervised categorization of posts in Web discussion forums using part-of-speech information and minimal features. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, WASSA@NAACL-HLT 2016, June 16, 2016, San Diego, California, USA, Alexandra Balahur, Erik Van der Goot, Piek Vossen, and Andrés Montoyo (Eds.). The Association for Computer Linguistics, 100–108.
  • Ritter et al. (2010) Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised Modeling of Twitter Conversations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA. The Association for Computational Linguistics, 172–180.
  • Rowe and Alani (2014) Matthew Rowe and Harith Alani. 2014. Mining and comparing engagement dynamics across multiple social media platforms. In ACM Web Science Conference, WebSci ’14, Bloomington, IN, USA, June 23-26, 2014, Filippo Menczer, Jim Hendler, William H. Dutton, Markus Strohmaier, Ciro Cattuto, and Eric T. Meyer (Eds.). ACM, 229–238.
  • Rowe et al. (2011) Matthew Rowe, Sofia Angeletou, and Harith Alani. 2011. Predicting discussions on the social semantic web. The Semanic Web: Research and Applications (2011), 405–420.
  • Serban et al. (2016) Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 3776–3784.
  • Shang et al. (2015) Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding Machine for Short-Text Conversation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers. The Association for Computer Linguistics, 1577–1586.
  • Siersdorfer et al. (2010) Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, and José San Pedro. 2010. How useful are your comments?: analyzing and predicting youtube comments and comment ratings. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, Michael Rappa, Paul Jones, Juliana Freire, and Soumen Chakrabarti (Eds.). ACM, 891–900.
  • Sordoni et al. (2015) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A Hierarchical Recurrent Encoder-Decoder For Generative Context-Aware Query Suggestion. CoRR abs/1507.02221 (2015).
  • Suh et al. (2010) Bongwon Suh, Lichan Hong, Peter Pirolli, and Ed H. Chi. 2010. Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network. In Proceedings of the 2010 IEEE Second International Conference on Social Computing, SocialCom / IEEE International Conference on Privacy, Security, Risk and Trust, PASSAT 2010, Minneapolis, Minnesota, USA, August 20-22, 2010, Ahmed K. Elmagarmid and Divyakant Agrawal (Eds.). IEEE Computer Society, 177–184.
  • Tan et al. (2014) Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The effect of wording on message propagation: Topic-and author-controlled natural experiments on Twitter. ACL (2014).
  • Tausczik and Pennebaker (2011) Yla R. Tausczik and James W. Pennebaker. 2011. Predicting the perceived quality of online mathematics contributions from users’ reputations. In Proceedings of the International Conference on Human Factors in Computing Systems, CHI 2011, Vancouver, BC, Canada, May 7-12, 2011, Desney S. Tan, Saleema Amershi, Bo Begole, Wendy A. Kellogg, and Manas Tungare (Eds.). ACM, 1885–1888.
  • Tieleman and Hinton (2012) T. Tieleman and G. Hinton. 2012. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. (2012).
  • Wang et al. (2011) Li Wang, Marco Lui, Su Nam Kim, Joakim Nivre, and Timothy Baldwin. 2011. Predicting Thread Discourse Structure over Technical Web Forums. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 13–25.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J Smola, and Eduard H Hovy. 2016. Hierarchical Attention Networks for Document Classification.. In HLT-NAACL. 1480–1489.