STEM: Unsupervised STructural EMbedding for Stance Detection

Stance detection is an important task, supporting many downstream tasks such as discourse parsing and modeling the propagation of fake news, rumors, and science denial. In this paper, we propose a novel framework for stance detection. Our framework is unsupervised and domain-independent. Given a claim and a multi-participant discussion - we construct the interaction network from which we derive topological embeddings for each speaker. These speaker embeddings enjoy the following property: speakers with the same stance tend to be represented by similar vectors, while antipodal vectors represent speakers with opposing stances. These embeddings are then used to divide the speakers into stance-partitions. We evaluate our method on three different datasets from different platforms. Our method outperforms or is comparable with supervised models while providing confidence levels for its output. Furthermore, we demonstrate how the structural embeddings relate to the valence expressed by the speakers. Finally, we discuss some limitations inherent to the framework.



page 1

page 2

page 3

page 4


Compositional embedding models for speaker identification and diarization with simultaneous speech from 2+ speakers

We propose a new method for speaker diarization that can handle overlapp...

Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

Attractor-based end-to-end diarization is achieving comparable accuracy ...

DNN Speaker Tracking with Embeddings

In multi-speaker applications is common to have pre-computed models from...

U-vectors: Generating clusterable speaker embedding from unlabeled data

Speaker recognition deals with recognizing speakers by their speech. Str...

Combination of Deep Speaker Embeddings for Diarisation

Recently, significant progress has been made in speaker diarisation afte...

Speaker Anonymization with Distribution-Preserving X-Vector Generation for the VoicePrivacy Challenge 2020

In this paper, we present a Distribution-Preserving Voice Anonymization ...

Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

We propose Quootstrap, a method for extracting quotations, as well as th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stance detection is the task of classifying the approval level expressed by an individual toward a claim or an entity. Stance detection differs from sentiment analysis in its opaqueness. A favorable stance toward a target opinion or an entity

can be expressed using a negative sentiment without any explicit mention of . For example, the utterance “I did not like the movie because of its stereotypical portrayal of the heroine as a helpless damsel in distress” bears a negative sentiment (“I did not like…”), while one can conjecture that the speaker’s stance toward feminism and women’s rights is favorable.

Understanding the stance of participants in a conversation is a crucial aspect of discourse parsing. Stance detection is often used in studying the social mechanisms that facilitate the propagation of fake news Thorne et al. (2017); Tsang (2020), unfounded rumors Zubiaga et al. (2016); Derczynski et al. (2017), and unsubstantiated science related to, e.g., global warming Luo et al. (2021) and the COVID-19 vaccine Tyagi and Carley (2020).

Recent models for stance detection rely on the textual content provided by the speaker, sometimes within some social or conversational context (see Section 2). These models are supervised, requiring a significant annotation effort. The dependence on language (text) as the primary, if not the sole, input, and the need for a domain (topic)-specific annotation, severely impairs the applicability of the models to broader domains and other languages Hanselowski et al. (2018); Xu et al. (2019).

Online discussions tend to unfold in a tree structure. Assuming a claim is laid at the root of the tree, each further node is a direct response to a previous node (utterance). This tree structure can be converted into an interaction network , where the nodes of are speakers, and edges correspond to interactions. The edges may be weighted, reflecting the intensity of the interaction between the specific pair of speakers (see Section 3.1).

In this paper, we propose a novel approach for stance detection. Our method is unsupervised, domain-independent, and computationally efficient. The premise of our approach is that the conversation structure, emerging naturally in many online discussion boards and social platforms, can be used for stance detection. In fact, we postulate that the structure of a conversation, often ignored in NLP tasks, should be studied and leveraged within the language processing framework.


The main contribution of this paper is threefold:(i) We introduce an efficient unsupervised and domain-independent algorithm for stance classification, based on structural speaker embeddings (ii) We show how multi-agent conversational structure corresponds to speakers’ stance and correlates with the valence expressed in the discussion, and (iii) The speaker embedding induces a soft classification of speakers’ stances, which can be rounded to a discrete output, e.g., “pro”, “con”, and “neutral”, but can also be used to derive other interesting parameters such as the confidence level of the result, which we discuss in Section 7.

We evaluate our model on three annotated datasets: 4forums, ConvinceMe, and CreateDebate. These datasets differ in various aspects, from the number of speakers and discussions to the variety of the topics discussed and the culture and norms shaping the conversational dynamics. Further details about the datasets are provided in Section 5. Despite these differences, our method consistently outperforms or is comparable with supervised models that were studied in other papers and were benchmarked on these datasets.

2 Related Work

Stance detection gained a significant interest in recent years, e.g., Somasundaran and Wiebe (2010); Walker et al. (2012); Sridhar et al. (2015); Mohammad et al. (2016); Derczynski et al. (2017); Sobhani et al. (2017); Joseph et al. (2017); Li et al. (2018); Porco and Goldwasser (2020); Conforti et al. (2020), among many others. A comprehensive survey of the various settings, datasets, and computational approaches is provided in Küçük and Can (2020). In this section, we focus on some perspectives closely related to our work.

Works on stance detection differ in task specification and methodology. Broadly, stance can be assigned to an utterance or a user, and the methodology can take into account text, context or both.

Stance at the user level, sometimes referred to as ‘aggregate’ or ‘collective’ stance, is addressed by Murakami and Raymond (2010); Walker et al. (2012); Yin et al. (2012). A more nuanced relationship between the post and the user level is addressed by Sridhar et al. (2015); Li et al. (2018); Benton and Dredze (2018); Conforti et al. (2020); Porco and Goldwasser (2020). We follow this observation and report results on both post and user levels.

Modal verbs, opinion and sentiment lexicons were used in early works by

Somasundaran and Wiebe (2010); Murakami and Raymond (2010); Yin et al. (2012); Wang and Cardie (2014); Bar-Haim et al. (2017). Recent text-based works use graphical models Joseph et al. (2017), CRFs Hasan and Ng (2013) and various neural architectures Hiray and Duppada (2017); Sun et al. (2018); Chen et al. (2018); Kobbe et al. (2020), among others. These methods are language, and often domain, dependent. Unsupervised methods were also explored in the past, although to a much lesser extent than supervised ones, and using a different methodology than ours, mainly relying on topic modeling Kobbe et al. (2020); Wei et al. (2019).

Leveraging the conversation structure was recently used by Li et al. (2018) to create relational embedding of authors, stance, and posts are combined to form a global representation. Stance-based rumor detection is explored by Wei et al. (2019), considering the structure of the conversation, along with content. While these works leverage the conversational structure, it is done in an opaque way and is filtered through different neural architectures that combine textual queues. It is therefore hard to assess the contribution of the conversation structure to the classification task. Our framework relies solely on the structure, promoting the notion that the conversational structure is as important as the word tokens in processing conversational data.

The intuitive assumption that consecutive utterances express antipodal stance is already explored by Murakami and Raymond (2010), using the solution to the max-cut problem to find a graph partition that reflects the stance taken by users debating policy issues in Japanese. Similarly, a solution to the max-cut problem on the conversation tree was used by Walker et al. Walker et al. (2012).

These works are the most similar to ours, as they use the solution to the max-cut problem as the primary computational tool. Our work differs from these works in several fundamental aspects. First, Murakami and Raymond Murakami and Raymond (2010)

explicitly introduce dis/agreement markers into the network representation – agreement is coded as a positive edge weight and disagreement as a negative weight. These weights are derived from an assortment of simple heuristics and hand-crafted rules. Our approach does not require this noisy, language-dependent, and labor-intensive labeling of the network edges. Second, Walker et al.

Walker et al. (2012) derive a binary output by applying a max-cut solver to the conversation tree. On the other hand, we obtain a soft classification via a speaker embedding extracted from the interaction network.

While most work on stance detection use supervised models, a number of works are unsupervised. Early works such as Somasundaran and Wiebe Somasundaran and Wiebe (2010) use generic opinion and sentiment lexicons. Kobbe et al. Kobbe et al. (2020) classify stance based on frequently used argumentation structures. Other unsupervised approaches include the use of syntactic rules for extraction of topic and aspect pairs Ghosh et al. (2018) or by extracting aspect-polarity-target information Konjengbam et al. (2018). These approaches are language dependant, often use external resources, and are not easily adapted to different domains and communities that present a variety of discussion norms. Our approach, however, is fully unsupervised.

Our unsupervised approach proved superior or comparable to other techniques. Moreover, the speakers’ embeddings allow us to derive deeper insights about the relationship between text and structure beyond the naive hypothesis that edges represent opposing stances. These insights are discussed in Section 7.

3 A Greedy Approach

Figure 1:

The workflow of STEM. First, parsing the discussion thread (tree structure) into a weighted user-interaction graph. Then compute the 2-core of the graph. Next, run the max-cut SDP on the 2-core graph, generating the speaker embedding. A random hyperplane partitions the core speakers into two stance groups (red and green groups). Finally, propagate the labels to speakers not in the core using a simple interchanging rule.

A naive view of the structure of an argumentative dialogue between and is that they are holding different stances. While it is tempting to assume that a simple tree structure, reflecting the turn-taking nature of a discussion, lends itself to accurate classification, this intuition does not hold for multi-participant discussions, as we demonstrate in Section 3.2 and the results in Section 6. The reason is that engaging discussions tend to induce complex user interaction graphs, which are far from being bipartite. Therefore a more subtle approach is needed. We present two algorithms that build upon the same intuition. The first is a simple greedy approach and in Section 4 we discuss the more sophisticated method, which is based on a speaker embedding technique.

3.1 From Conversation Trees to Networks

A discussion could be naturally represented as a tree, where nodes correspond to posts (comment, utterance) and nodes are children to a parent node if they were posted, independently, as direct responses to . Discussion trees capture an array of conversational patterns – turn-taking (direct replies), the volume of direct interaction between pairs of users, and of course, the textual signal, including content and style. However, converting the conversation tree into an interaction network may better capture the conversational dynamics.

In the interaction network, a node corresponds to a speaker, rather than to an utterance, and an edge between two nodes (speakers) and indicates a direct interaction between the two. The edges can be weighted to signify the intensity of the interaction. We use the following edge weighting :


where: denotes the number of times user replied to user ; denotes the number of times user quotes user ; and are constants denoting the significance assigned to a single interaction – a direct response or a quote, respectively. These significance weights could be adjusted to reflect specific conversational norms, for example, quoting other speakers and posts that do not directly precede an utterance are common in one of the datasets while scarcer in the others111A short excerpt of a long discussion is presented in Table 8 in the Appendix, along with the corresponding conversation tree (Fig. 4 a) and the interaction network (Fig. 4b)..

3.2 Algorithm 1: Greedy Speaker Labelling

Recall the intuitive assumption that two speakers, and that intensively engage with each other, inducing a heavy edge in the interaction network, hold opposed stances. We, therefore, begin by proposing a simple greedy algorithm based on this naive assumption. The algorithm receives the interaction network with the OP, , marked with an abstract stance label, say . In the first iteration it initializes the set of labelled speakers . In each consecutive iteration, it finds the heaviest edge that connects a vertex to , and adds the speaker to , labeling and with opposite stance labels. This algorithm is basically Prim’s algorithm for minimum spanning tree, and it runs in nearly linear time, . We call this algorithm .

4 Speaker Embedding

A more sophisticated approach still builds upon the same intuition. It creates speaker embedding that allows a principled comparison rather than an iterative greedy assignment. A desired property of the speaker embedding, let’s call it -separability, is that speakers with opposing stances are assigned vectors with an angle of at least between them (it’s instructive to think of as close to ). We say that an embedding -respects the stance if it satisfies -separability for every pair of speakers.

Suppose and are unit vectors. The separability property can be mathematically encoded by requiring that the expression in Eq. (2) takes a larger value on pairs of opposing speakers. We use

for the cosine similarity between the two vectors.


The maximal value Eq. (2) takes is 1, which is attained if the two vectors are antipodal, namely, the angle between them is exactly , and the cosine similarity is -1. Multiplying Eq. (2) by the corresponding edge weight ensures that the larger values are attained for relevant pairs.

Given an interaction network , with , and edge weights for every edge , our goal is to find a speaker embedding which respects the stance for as many speaker pairs as possible. The proposed candidate speaker embedding is the solution of the optimization problem given in Eq. (3). We use to denote the unit sphere in .


The optimization problem in Eq. (3) is a semi-definite program (SDP), and it can be solved in polynomial time using the Ellipsoid algorithm Seese (1990). This SDP was suggested by Goemans and Williamson (1995) as a relaxation for the NP-hard max-cut problem, which is in line with our intuitive hypothesis about the nature of the interaction between speakers.

4.1 From soft to discrete classification

The speaker embedding gives a continuous range of stance relationships, from “total disagreement” (antipodal vectors) to “total agreement” (aligned vectors). However, in some cases, we want to round the continuous solution to a discreet solution, say “pro” vs. “con”.

In addition, the separability property is relevant for pairs of speakers. Even if the embedding of every pair respects the stance, this still doesn’t lend itself immediately to a partition of the entire set of speakers into two sets, “pro” and “con”, that respects the stance. If the interaction graph is a tree, then pairwise separability immediately induces an overall consistent partition. But when cycles exist, things are messier.

We now describe how to round the speaker embedding into a partition of the speakers. To gain intuition into the rounding technique, let’s assume that the obtained embedding pairwise respects the stance, and further, that the embedding lies in a one-dimensional subspace of . Namely, there exists some vector s.t. for every , or . In such case, the rounding is trivial: all vectors on “one side” are “pro”, and all vectors on the “other side” are “con” (or vice-a-versa).

Building upon this intuition, a random hyper-plane rounding technique is commonly used Goemans and Williamson (1995). A random -dimensional hyper-plane that goes through the origin is selected, and vectors are partitioned in two groups according to which side of the hyper-plane the vector lies. In the one-dimensional example, every random hyperplane will round the vectors correctly into the two opposing stance classes. More generally, the more the vectors are clustered into two “tight” cones, the more accurate the rounding will be (by tight, we mean that the maximum pairwise angle is small).

Figure 2: PCA projection of the 19-dimensional speaker embedding for the core of the interaction network (available in the Appendix). Colors correspond to the speakers’ labels. The black arrows to the left and right correspond to the average vector in each color class
Figure 3: PCA projection of the 35-dimensional speaker embedding of the core of an interaction network also from 4Forum. Shorter vectors have a larger component perpendicular to PC1 and PC2. The induced cones have a large diameter, and therefore the confidence of having a correct prediction on authors within this conversation significantly decreases. Black arrows are cone centers (again shorter).

Figure 2 illustrates this point: two tight cones are observed, as well as some “straying” vectors that are liable to wrong classification. The accuracy of the hyperplane rounding on that conversation was 75%. On the other hand, Figure 3 demonstrates wider cones, and accordingly, the accuracy this time was only 64%. Further illustration about how the diameter of the cones corresponds to an accurate solution is given in Table 1.

Diameter accuracy authors
2.0 0.79 2440
1.0 0.80 2403
0.75 0.80 2341
0.5 0.81 2258
0.25 0.82 2127
0.1 0.83 1921
0.05 0.84 1761
0.01 0.85 1332
0.001 0.85 917
Table 1: Accuracy of speakers classification for speakers whose vector falls inside the cone, for various cone diameters. Evidently, as the cones get tighter, the accuracy increases. The dataset used is the 4Forum conversations.

4.2 Tight cones of vectors respect the stance

It is important to note that the vectors that the SDP assigns the speakers lie in . This dimension provides a lot of freedom in vectors assignment (freedom which is necessary for the SDP to be solvable in polynomial time). Therefore, while the one-dimensional intuition just described is clear for a two-persons dialogue, it is not a-priori clear why the vectors in should simultaneously respect the stance of all, or most, speakers in a multi-participant discussion.

We now explore the conditions that may lead to the desired phenomenon where the SDP solution is such that the vectors are clustered in two tight cones. These conditions are rooted both in the network structure and in the content of the conversation.

From the perspective of the network topology, it is easy to see that the optimal solution to Eq. (3) is the antipodal vectors rank-one solution we described above, where the assignment of vectors corresponds to the max-cut partition of the graph. However, crucially, Eq. (3) does not contain a rank constraint on the solution as this will turn the optimization problem NP-hard. Now enters the assumption that edges represent antipodal stances. If this assumption is correct, and the structure of the network is rich enough to force a unique max-cut solution, then we expect a “tight-cones” solution which is both aligned with the max-cut partition and with the stances.

The assumption of a unique max-cut partition may be too strong to hold for the entire graph (think for example of isolated nodes, or very sparse structures). However, for a special subgraph, the 2-core of the graph, this uniqueness may hold. Indeed, we have found that most of the SDP vectors of the speakers that belong to the 2-core of the graph (a subgraph of in which the minimal degree is 2) are arranged in a tight-cone structure. This phenomenon was observed in other papers as well, that studied related tasks such as community detection and other graph partitioning tasks Reichardt and Bornholdt (2006); Newman (2006); Leskovec et al. (2010); Coja-Oghlan et al. (2007).

But why should the graph contain a large 2-core in the first place? Here enters the content/linguistic aspect. We expect that captivating or stirring topics will lead to lively discussions that result in a complex conversation graph that induces a large 2-core. Together with the basic assumption that edges connect speakers with opposing stances, we arrive at the premise that in such discussions, both the SDP will produce solutions that have the tight-cones structure and that this tight-cone structure will respect the stance. Thus, when rounding the solution using the random hyper-plane technique, we expect to detect the stance of 2-cores users accurately. Section 7 elaborates more about the relationship between the spirit, or valence, of the conversation and the accuracy of the algorithm.

4.3 Algorithm 2: STEM

We now formally describe our main contribution, STEM, an unsupervised structural embedding for stance detection. The below steps are also illustrated in Figure 1. Given a conversation tree , STEM operates as follows:

  1. Convert the conversation tree to an interaction network , as described in Section 3.1.

  2. Compute the 2-core of , i.e. the induced subgraph of where every node has degree at least 2 in .

  3. Solve the SDP in Eq. (3) on to obtain a speaker embedding .

  4. Round the speaker embedding using a random hyper-plane.

  5. Propagate the labels to speakers outside the core, , using interchanging labels assignment.

In Step 2, we compute the core. To compute the 2-core, one iteratively removes vertices whose degree in the remaining graph is smaller than two, until no such vertex remains.

Step 5 does not lead to a contradiction since, by definition, the vertices outside the core do not induce a cycle. Therefore, the propagation of labels in the sub-graphs connected to the 2-core is consistent.

Finally, note that our algorithm produces a partition of speakers, similarly to the problem of community detection, without a label for each part (pro or con). One simple heuristic to obtain the labeling is to label the set containing the OP as “pro”. Another option is to use an off-the-shelf algorithm, e.g. Allaway and McKeown (2020), and noisily label a few posts on each side before taking a majority vote.

To evaluate the performance of our algorithm without additional noise that this last step may incur, we checked the two possible ways of assigning the labels and took the one that resulted in higher accuracy.

5 Data

We evaluate our approach on three different datasets: ConvinceMe Anand et al. (2011), 4Forums Walker et al. (2012), and CreateDebate Hasan and Ng (2014). These datasets were used in previous work, e.g., Walker et al. (2012); Sridhar et al. (2015); Abbott et al. (2016); Li et al. (2018), among others. In this section we briefly describe each of the datasets and highlight some important aspects they differ in. A statistical description of datasets is provided in Table 2.

4Forums CD CM
# Topics 4 4 16
# Conversations 202 521 9,521
# Conversations (core) 202 149 500
# Authors 863 1,840 3,641
# Authors (core) 718 352 490
# Posts 24,658 3,679 42,588
# Posts (core) 23,810 1,250 5,876
Table 2: Basic statistics of the three datasets: 4Forums, CreateDebate (CD), and Convince Me (CM). We also present the number of authors that belong to the 2-core of the interaction graph, and their posts.

ConvinceMe (CM)

ConvinceMe is a structured debate site. Speakers initiate debates by specifying a motion and stating the sides. Debaters argue for/against the motion, practically self-labeling their stance with respect to the original motion. The data was first used by Anand Anand et al. (2011) and incorporated to the IAC2.0 by Abboott et al. Abbott et al. (2016).


4Forums (no longer maintained) was an online forum for political debates. It had a shallow hierarchy of topics (e.g., Economics/Tax), and discussion threads have a tree-like structure. The 4Forum stance dataset, introduced by Walker et al. Walker et al. (2012), provides agree/disagree annotations on comment-response pairs in 202 conversations on four topics (abortion, evolution, gay marriage, and gun control).

CreateDebate (CD)

Similarly to ConvinceMe, CreateDebate is a structured debate forum. Unlike ConvinceMe, the user initiating the debate does not put forward a specific assertion. Rather, she introduces an open question for the community, and speakers can respond by taking sides. Authors must label their posts with either a support, clarify or dispute label. A collection of debates on four topics (abortion, gay rights, legalization of marijuana, Obama) was introduced by Hasan and Ng (2014). This dataset contains many degenerate conversations – speakers responding to the prompt question without engaging in a conversation with other speakers. We filtered out these degenerate conversations, keeping 541 conversation trees (see Table 2). The root of each of the trees is an original response to the initial questions.

6 Evaluation

Model Abortion Gay Rights Marijuana Obama Average
PSL (Sridhar et al., 2015) 0.67 0.73 0.69 0.64 0.68 (macro)
Global Embedding (Li et al., 2018) 0.81 0.77 0.77 0.65 0.75 (macro)
(full) 0.80 0.81 0.74 0.79 0.79
(core) 0.91 0.82 0.82 0.82 0.86
(full) 0.90 0.85 0.74 0.86 0.86
Table 3: Average accuracy on posts’ stance classification of CreateDebate discussions.
Model Abortion Gay Rights Marijuana Obama Average
PSL (Sridhar et al., 2015) 0.67 0.74 0.75 0.63 0.71 (macro)
(full) 0.87 0.86 0.76 0.85 0.85
(core) 0.91 0.79 0.86 0.83 0.88
(full) 0.86 0.80 0.70 0.83 0.85
Table 4: Average accuracy for authors’ stance classification for CreateDebate discussion.
Model Abortion Evolution Gay Marriage Gun Control Average
PSL (Sridhar et al., 2015) 0.77 0.80 0.81 0.69 0.77 (macro)
Global Embedding (Li et al., 2018) 0.87 0.82 0.88 0.83 0.85 (macro)
(full) 0.62 0.61 0.60 0.63 0.62
(core) 0.93 0.88 0.89 0.85 0.89
(full) 0.92 0.87 0.88 0.84 0.89
Table 5: Average accuracy on posts’ stance classification of 4Forum discussions.
Model Abortion Evolution Gay Marriage Gun Control Average
PSL (Sridhar et al., 2015) 0.66 0.79 0.77 0.68 0.73 (macro)
(full) 0.61 0.59 0.59 0.62 0.60
(core) 0.84 0.78 0.79 0.74 0.79
(full) 0.79 0.75 0.77 0.71 0.76
Table 6: Average accuracy of authors’ stance classification for 4Forum discussions.
Topic # Posts STEM Walker
Gay Marriage 708 0.98 0.84
Evolution 688 0.99 0.82
Communism Vs Capitalism 185 0.99 0.70
Marijuana Legalization 261 0.98 0.73
Gun Control 314 0.95 0.63
Abortion 834 0.96 0.82
Climate Change 255 1.00 0.64
Israel/Palestine 36 1.00 0.85
Existence Of God 842 0.98 0.75
Immigration 166 0.87 0.67
Death Penalty 474 0.98 0.65
Legalized Prostitution 108 0.88 NA
Vegetarianism 43 1.00 NA
Women In The Military 22 1.00 NA
Minimum Wage 14 0.95 NA
Obamacare 101 0.98 NA
Other 37,537 0.95 NA
Table 7: Average accuracy of post-level stance achieved by and the Max-Cut algorithm from Walker et al. (2012) on the ConvinceMe dataset.


Implementation Our approach uses only two hyper-parameters, (reply weight) and (quote weight), which are used to compute the weights of the edges in the interaction graph, see Eq. (1). The optimal values may differ between datasets, as the conversational norms may differ. We fixed the values manually; for 4Forum we used as participants tend to reply to the OP regardless of the content to which they are replying, and only quote the relevant content instead. For CreateDebate and ConvinceMe we used as quotes rarely used.

To solve the SDP optimization Eq. (3

) we used the standard open-source PICOS

222 and CVXOPT 333 libraries. All source code required for conducting the experiments and reproduce results is uploaded as supplementary material (including the random seed). The average running-time for computing the solution for a single conversation (including the SDP) was seconds. The average time was taken over conversations from 4Forums as this datasets contains the largest conversations, with an average of speakers in the core-graph ( speakers max). We ran the experiment on a machine equipped with a processor with 8 cores and 16GB RAM (we didn’t use a GPU for the computation).

Evaluation We evaluated the two algorithms (Section 3.2) and (Section 4.3) on the three datasets described in Section 5, both at the speaker level and the post level. The 4Forum dataset had both post-level and speaker-level labels.

In cases where ground-truth labels were available only at the post level (CD and CM), we extended the post-level labeling to speaker-level by taking a majority vote over the posts of each user; in cases where the results were reported at the post level (CD and 4Forum), we labeled the posts according to the stance of that speaker.

Results, compared to previous work on the CreateDebate dataset, are presented in Tables 3 and 4. Two types of results are reported: the accuracy of each algorithm on the speakers that belong to the 2-core, and the accuracy over all speakers. The results are given at the post level (Table 3) and speaker level (Table 4). Similar results on the 4Forums dataset are presented in Tables 5 and 6.

As evident from the tables, outperforms other approaches across all topics and datasets. Also evident from the tables is that the accuracy of on the 2-core is always higher than the accuracy, over all speakers. We note that even the algorithm significantly outperforms SOTA results reported in the literature.

We complete our evaluation with a direct comparison to the Max-Cut approach used by Walker et al. Walker et al. (2012). Walker et al. solve the Max-Cut problem on the conversation tree (where posts are also linked to authors), using some Max-Cut solver (not SDP). They report results at the post level for the ConvinceMe dataset. Table 7 presents results for each topic separately, demonstrating the usefulness of our more elaborate way of using the Max-Cut intuition.

7 Discussion


Our work suggests that a rich interaction graph structure leads to useful speaker embedding. The latent link between the linguistic aspects of the conversation and the graph structure may relate to the valence of the conversation. To explore this, we computed the valence of the conversations in 4Forum using Python’s PySentiStr Thelwall et al. (2010). Each conversation was scored with the average valence of its posts. We found that the average accuracy of on conversations whose valence is at the lower end (0–0.5), was 0.75, while the average accuracy on conversations with medium valence (0.5–0.8) was 0.8, and the average accuracy on conversations exhibiting high valence (0.8–1) is increased to 0.92. These results support our hypothesis that stirred-up discussions lead to richer interaction graph structure, resulting in more accurate speaker embedding. Future work should further investigate this link between content, stance, and conversation structure.


The soft classification induced by the speaker embedding allows us to attribute confidence levels to our result. Specifically, Table 1 demonstrates how the accuracy of the algorithm improves as we perform the rounding of vectors on increasingly tighter cones. Therefore along with the binary stance classification, we can add a score, which is proportional to the cone diameter of the 2-core, which informs the user how certain we are about the accuracy of our results. This is illustrated in Figure 3, where a larger diameter of the cones resulted in lower accuracy, 64%.

Rounding the embedding of the 2-core and propagating the results to the non-core speakers may be sub-optimal. As Table 1 suggests, it might be better to round a subgraph of the 2-core that corresponds to tighter cones, at the expense of labeling fewer speakers in the rounding step, and then propagate the labels to the remaining core and non-core vertices.


Finally, let us mention the limitations of our approach. The task of stance classification is not limited to structured platforms like ConvinceMe or 4Forum. Indeed, debates take place on general-purpose platforms such as Twitter or Facebook, where a wider range of reactions is available. We have not tested our method on such data, and it may be the case that the conversational norms on these platforms differ radically from those in the three datasets we used.

Another limitation is the 2-core requirement. It might be that discussions in some platforms result in core-free graphs or graphs with several small 2-cores. We have tested our method on interaction graphs that are trees. Our approach worked well for some trees while it stumbled on others.

8 Conclusion

We proposed an unsupervised and domain-independent approach to stance detection. Our approach leverages the conversation structure to compute a useful speaker embedding. We demonstrate the benefits of this approach by evaluating it on three datasets and comparing the performance to the state-of-the-art results reported on them. Moreover, we have demonstrated how the speaker embedding allows for soft classification, which can be viewed as a confidence measure for classification results of specific instances. Finally, we explore the relations between the valence expressed in a discussion, the conversational structure, the interaction network, and the participants’ stance. We observed a correlation between stance classification accuracy and the valence levels, as well as a correlation between the accuracy and the size of the network core. These relations will be explored in future work.


  • R. Abbott, B. Ecker, P. Anand, and M. Walker (2016) Internet argument corpus 2.0: an sql schema for dialogic social media and the corpora to go with it. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 4445–4452. Cited by: §5, §5.
  • E. Allaway and K. McKeown (2020) Zero-shot stance detection: a dataset and model using generalized topic representations. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 8913–8931. Cited by: §4.3.
  • P. Anand, M. Walker, R. Abbott, J. E. F. Tree, R. Bowmani, and M. Minor (2011) Cats rule and dogs drool!: classifying stance in online debate. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011), pp. 1–9. Cited by: §5, §5.
  • R. Bar-Haim, L. Edelstein, C. Jochim, and N. Slonim (2017) Improving claim stance classification with lexical knowledge expansion and context utilization. In Proceedings of the 4th Workshop on Argument Mining, Copenhagen, Denmark, pp. 32–38. External Links: Link, Document Cited by: §2.
  • A. Benton and M. Dredze (2018) Using author embeddings to improve tweet stance classification. In

    Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

    pp. 184–194. Cited by: §2.
  • D. Chen, J. Du, L. Bing, and R. Xu (2018) Hybrid neural attention for agreement/disagreement inference in online debates. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 665–670. External Links: Link, Document Cited by: §2.
  • A. Coja-Oghlan, M. Krivelevich, and D. Vilenchik (2007) Why almost all k-colorable graphs are easy to color. Theory of Computing Systems 46, pp. 523–565. External Links: Document Cited by: §4.2.
  • C. Conforti, J. Berndt, M. T. Pilehvar, C. Giannitsarou, F. Toxvaerd, and N. Collier (2020) Will-they-won’t-they: a very large dataset for stance detection on twitter. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1724. Cited by: §2, §2.
  • L. Derczynski, K. Bontcheva, M. Liakata, R. Procter, G. W. S. Hoi, and A. Zubiaga (2017) SemEval-2017 task 8: rumoureval: determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 69–76. Cited by: §1, §2.
  • S. Ghosh, K. Anand, S. Rajanala, A. B. Reddy, and M. Singh (2018) Unsupervised stance classification in online debates. In

    Proceedings of the ACM India Joint International Conference on Data Science and Management of Data

    pp. 30–36. Cited by: §2.
  • M. X. Goemans and D. P. Williamson (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42 (6), pp. 1115–1145. External Links: ISSN 0004-5411, Link, Document Cited by: §4.1, §4.
  • A. Hanselowski, P. Avinesh, B. Schiller, F. Caspelherr, D. Chaudhuri, C. M. Meyer, and I. Gurevych (2018) A retrospective analysis of the fake news challenge stance-detection task. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1859–1874. Cited by: §1.
  • K. S. Hasan and V. Ng (2013) Stance classification of ideological debates: data, models, features, and constraints. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 1348–1356. Cited by: §2.
  • K. S. Hasan and V. Ng (2014) Why are you taking this stance? identifying and classifying reasons in ideological debates. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 751–762. External Links: Link, Document Cited by: §5, §5.
  • S. Hiray and V. Duppada (2017) Agree to disagree: improving disagreement detection with dual grus. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), Vol. , pp. 147–152. External Links: Document Cited by: §2.
  • K. Joseph, L. Friedland, W. Hobbs, D. Lazer, and O. Tsur (2017) ConStance: modeling annotation contexts to improve stance classification. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1115–1124. Cited by: §2, §2.
  • J. Kobbe, I. Hulpuș, and H. Stuckenschmidt (2020) Unsupervised stance detection for arguments from consequences. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 50–60. Cited by: §2, §2.
  • A. Konjengbam, S. Ghosh, N. Kumar, and M. Singh (2018) Debate stance classification using word embeddings. In International conference on big data analytics and knowledge discovery, pp. 382–395. Cited by: §2.
  • D. Küçük and F. Can (2020) Stance detection: a survey. ACM Computing Surveys (CSUR) 53 (1), pp. 1–37. Cited by: §2.
  • J. Leskovec, K. J. Lang, and M. Mahoney (2010) Empirical comparison of algorithms for network community detection. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, New York, NY, USA, pp. 631–640. External Links: ISBN 978-1-60558-799-8, Link, Document Cited by: §4.2.
  • C. Li, A. Porco, and D. Goldwasser (2018) Structured representation learning for online debate stance prediction. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3728–3739. Cited by: §2, §2, §2, §5.
  • Y. Luo, D. Card, and D. Jurafsky (2021) Detecting stance in media on global warming. External Links: 2010.15149 Cited by: §1.
  • S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry (2016) Semeval-2016 task 6: detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 31–41. Cited by: §2.
  • A. Murakami and R. Raymond (2010) Support or oppose? classifying positions in online debates from reply activities and opinion expressions. In Coling 2010: Posters, Beijing, China, pp. 869–875. External Links: Link Cited by: §2, §2, §2, §2.
  • M. E. J. Newman (2006) Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103 (23), pp. 8577–8582. External Links: Document, Link, Cited by: §4.2.
  • A. Porco and D. Goldwasser (2020) Predicting stance change using modular architectures. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 396–406. Cited by: §2, §2.
  • J. Reichardt and S. Bornholdt (2006) Statistical mechanics of community detection. Phys. Rev. E 74, pp. 016110. External Links: Document, Link Cited by: §4.2.
  • D. Seese (1990)

    Groetschel, m., l. lovasz, a. schrijver: geometric algorithms and combinatorial optimization. (algorithms and combinatorics. eds.: r. l. graham, b. korte, l. lovasz. vol. 2), springer-verlag 1988, xii, 362 pp., 23 figs., dm 148,-. isbn 3–540–13624-x

    Biometrical Journal 32 (8), pp. 930–930. External Links: Document, Link, Cited by: §4.
  • P. Sobhani, D. Inkpen, and X. Zhu (2017) A dataset for multi-target stance detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 551–557. Cited by: §2.
  • S. Somasundaran and J. Wiebe (2010) Recognizing stances in ideological on-line debates. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 116–124. Cited by: §2, §2, §2.
  • D. Sridhar, J. Foulds, B. Huang, L. Getoor, and M. Walker (2015) Joint models of disagreement and stance in online debate. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 116–125. External Links: Link, Document Cited by: §2, §2, §5.
  • Q. Sun, Z. Wang, Q. Zhu, and G. Zhou (2018) Stance detection with hierarchical attention network. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2399–2409. Cited by: §2.
  • M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas (2010) Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61, pp. 2544–2558. External Links: Document Cited by: §7.
  • J. Thorne, M. Chen, G. Myrianthous, J. Pu, X. Wang, and A. Vlachos (2017) Fake news stance detection using stacked ensemble of classifiers. In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism, pp. 80–83. Cited by: §1.
  • S. J. Tsang (2020) Issue stance and perceived journalistic motives explain divergent audience perceptions of fake news. Journalism, pp. 1464884920926002. Cited by: §1.
  • A. Tyagi and K. M. Carley (2020) Divide in vaccine belief in covid-19 conversations: implications for immunization plans. medRxiv. Cited by: §1.
  • M. A. Walker, J. E. F. Tree, P. Anand, R. Abbott, and J. King (2012) A corpus for research on deliberation and debate.. In LREC, Vol. 12, pp. 812–817. Cited by: §2, §5, §5.
  • M. Walker, P. Anand, R. Abbott, and R. Grant (2012) Stance classification using dialogic properties of persuasion. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montréal, Canada, pp. 592–596. External Links: Link Cited by: §2, §2, §2, §5, Table 7, §6.
  • L. Wang and C. Cardie (2014) Improving agreement and disagreement identification in online discussions with a socially-tuned sentiment lexicon. In Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Baltimore, Maryland, pp. 97–106. External Links: Link, Document Cited by: §2.
  • P. Wei, W. Mao, and G. Chen (2019) A topic-aware reinforced model for weakly supervised stance detection.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33 (01), pp. 7249–7256.
    External Links: Link, Document Cited by: §2.
  • P. Wei, N. Xu, and W. Mao (2019) Modeling conversation structure and temporal dynamics for jointly predicting rumor stance and veracity. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4787–4798. External Links: Link, Document Cited by: §2.
  • B. Xu, M. Mohtarami, and J. Glass (2019) Adversarial domain adaptation for stance detection. arXiv preprint arXiv:1902.02401. Cited by: §1.
  • J. Yin, N. Narang, P. Thomas, and C. Paris (2012) Unifying local and global agreement and disagreement classification in online debates. In Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis, Jeju, Korea, pp. 61–69. External Links: Link Cited by: §2, §2.
  • A. Zubiaga, E. Kochkina, M. Liakata, R. Procter, and M. Lukasik (2016) Stance classification in rumours as a sequential task exploiting the tree structure of social media conversations. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2438–2448. Cited by: §1.