Argumentation is a discussion in which reasons are provided for and against some proposition or proposal (Toulmin1958-TOUTUO-2). It is a crucial activity in decision-making since it encourages critical thinking and motivates people to make fair and informed decisions. The emergence of social media and online argumentation platforms has made it easier for people to express their opinions and debate with other individuals on controversial topics. The reliance on social media and online argumentation platforms as key venues for opinionated discussions (elisa_2020) has motivated increased research in computational argumentation. One area of focus in computational argumentation has been to explore techniques to automatically analyze the characteristics of arguments, such as their persuasiveness (wachsmuth-etal-2017-computational; habernal-gurevych-2016-makes; 10.1145/2872427.2883081; DBLP:conf/argmining/HideyMHMM17). Furthermore, researchers have started building systems to automatically generate arguments to present people with diverse viewpoints in order to help them make more informed decisions (wang-ling-2016-neural; hua-wang-2018-neural; hua-etal-2019-argument; hidey-mckeown-2019-fixed).
This thesis focuses on understanding the factors of persuasion in computational argumentation. Persuasion is an act of presenting arguments to change people’s opinions, values, and behaviors on a controversial topic or an event (edsjsr.208777219540601). Theories of persuasive communication are applied to various fields such as marketing, advertising, social psychology, and politics (Shrum2012PersuasionIT). Researchers in these areas are interested in understanding the factors that influence the success of persuasive communication. The emergence of social media and online argumentation platforms has made it more accessible for people to engage in argumentative discussions with others who may hold differing views. Interpretation of the underlying dynamics of argumentative communication online can help develop methods to improve the effectiveness of arguments. For example, it could be used to provide feedback to users in order to help them improve the structure of their arguments. Moreover, analyzing these interactions can help understand the factors that influence people’s behavior in the argumentative process. We specifically study persuasion on online debating platforms to get insights into the factors that govern people’s decision-making in persuasion.
Language is the primary tool that is used to convey the content of an argument. Therefore it is a crucial component in persuasion (1984-28616-00119840101; edsovi.00005205.198111000.0000219811101; perelman1971new; van2009examining). It is only natural then that the majority of the work in computational studies of persuasion has focused mainly on understanding the characteristics of persuasive text, e.g., what distinguishes persuasive from non-persuasive text (10.1145/2872427.2883081; zhang2016conversational; wachsmuth2016using; habernal-gurevych-2016-makes; habernal-gurevych-2016-argument; fang2016learning; DBLP:conf/argmining/HideyMHMM17). However, language is not the only factor in persuading people. Prior research in Social Sciences and Psychology has shown that the recipients of an argument may form their opinion on an issue based on non-content cues such as the characteristics of the speaker (i.e., the source) and their own predispositions (cialdini2001influence; 1984-28616-00119840101; edsovi.00005205.198111000.0000219811101). For example, the credibility and trustworthiness of a speaker (edsgcl.1760738119951001; edsovi.00005205.198011000.0000119801101) and the prior beliefs of the audience (edsbds.17414237419960101) have been shown to have a substantial effect on persuasive communication. Furthermore, people with strong prior beliefs on controversial issues have been shown to have biased stances even when presented with empirical evidence: i.e., they tend to find empirical evidence that confirms their prior beliefs more convincing (edsovi.00005205.197911000.0001619791101). Given the evidence from Social Sciences and Psychology, we believe that accounting for the impact of these factors, in addition to the language, in computational studies of persuasion is crucially important. This thesis introduces several contributions to make progress towards this goal by exploring the following questions:
What is the role of people’s prior beliefs and initial stance on persuasion?
How can we disentangle the effect of source and audience factors in order to understand the characteristics of persuasive language?
What is the effect of social interaction on people’s success at persuasion over time?
Does pragmatic context play an important role in predicting the impact of arguments?
In order to understand the role of speaker and audience effects in persuasion, we primarily look at the following factors of interaction on online argumentation:
Prior beliefs and initial stances of the speaker and the audience.
Language and pragmatic context.
Specifically, we make the following contributions:
A Dataset to Model Source and Audience Factors in Persuasion. One of the main bottlenecks in studying the effect of source and audience factors is the lack of large-scale datasets that contain information about the characteristics of the users. In order to bridge this gap and enable further studies in this area, we present a large-scale dataset (DDO) with a wide range of user information collected from an online debating website.111https://www.debate.org This dataset contains debates on a wide range of controversial topics. Each debate consists of two debaters with opposing views on a controversial topic, who take turns to provide their arguments. Figure 1.2 shows an example debate on ”Evolution” contained in this dataset. Along with the debate, the dataset also contains votes from the audience evaluating various aspects of the debaters, such as the persuasiveness of their arguments and their overall debating skills. Besides the debates, the dataset also includes information about the debaters and the audience, such as their stance on controversial topics, political and religious ideologies, education level, etc. We obtain this from the self-identified information that the users provide on their profiles.
The Role of Prior Beliefs in Persuasion. The majority of work in computational persuasion has focused on understanding the characteristics of persuasive language. In this thesis, we mainly focus on understanding the effect of user factors on persuasion. We use the debates, votes, and user information available on the DDO dataset to study the effect of prior beliefs in predicting which debater an individual voter will find more persuasive for a given debate. We find that user factors play a critical role in this prediction task. Furthermore, controlling for the effect of user-level factors allows us to investigate characteristics of persuasive language without any influence from these potentially confounding factors.
Effect of Social Interaction on Persuasion Success. Inspired by prior work that shows a strong relationship between a user’s social interaction and their influence on social media (cha2010measuring; 10.1145/1963192.1963250), we study whether success in persuasion might also depend on an individual’s social interaction and engagement. In particular, we study whether users can improve the persuasiveness of their arguments as they gain more experience using the debating platform. We show that a user’s social interaction is an essential factor in predicting their overall success at debating.
Representing Pragmatic Context in Modeling Argument Impact. We present a new dataset to study the effect of the pragmatic and discourse context when determining an argument’s impact. We further propose predictive models that can incorporate pragmatic and discourse context. We find that these models outperform models that rely only on claim-specific linguistic features for predicting the perceived impact of individual claims within a particular line of argument.
1.3 Organization of Thesis
In Chapter 2, we first give an overview of recent developments in computational argumentation, describing recent work on argument analysis and argument generation. In Chapter 3, we discuss the details and statistics of the DDO dataset. With the dataset in hand, in Chapter 4, we present methods that can account for the effects of the speaker and the audience in predicting the persuasiveness of an argument. In Chapter 5, we describe our contributions on understanding the impact of social interaction on persuasion on online debating platforms. In Chapter 6, we propose a new dataset that allows us to study the effect of pragmatic context (i.e., kairos) on assessing the impact of an argument. Finally, in Chapter 7, we summarize our contributions and provide directions for future work.
2.1 Computational Argument Mining
Computational argumentation mining aims to extract argument components and the relationships between them from unstructured text, building on theoretical models of argument (Toulmin1958-TOUTUO-2; walton_reed_macagno_2008). The main goal is to understand the points in an argument and get insights into how these points support or oppose each other. Having a deeper understanding of the structure of the arguments is important for various applications such as debating technologies (PMID:33731946_debater), legal decision-making (10.1145/1276318.1276362), automated essay scoring (ong-etal-2014-ontology), and computer-assisted writing (stab-gurevych-2017-parsing). The identification of argument structure involves several sub-tasks:
Determining the “argumentative” vs. “non-argumentative” parts of the text (10.1145/1276318.1276362).
Classifying argumentative components into categories such as “Claim” or “Premise” (article_moens; stab-gurevych-2017-parsing; chakrabarty-etal-2019-imho).
Identifying relations between argument components (carstens-toni-2015-towards; feng2011classifying; 10.1145/1568234.1568246; Cabrio2013ANL; stab-gurevych-2017-parsing; niculae-etal-2017-argument; park-cardie-2014-identifying; hua-wang-2017-understanding).
Most research in computational argumentation mining has proposed methods for a subset of the subtasks mentioned above. persing-ng-2016-end
was among the first to present an end-to-end pipeline approach to determine argumentative components and their relationship using an Integer Linear Programming (ILP) framework. Similarly,stab-gurevych-2017-parsing has proposed a joint model that globally optimizes argument component types and relations using ILP. eger-etal-2017-neural has presented the first end-to-end neural argumentation mining model obviating the need for designing hand-crafted features and constraints.
Argumentation mining has been applied to various domains such as persuasive essays, legal documents, political debates, and social media data (dusmanu-etal-2017-argument). For instance, stab-gurevych-2017-parsing has built an annotated dataset of persuasive essays with corresponding argument components and relations. Using this corpus, eger-etal-2017-neural developed an end-to-end neural method for argument structure identification. DBLP:conf/aaai/NguyenL18 has further proposed an end-to-end method to parse argument structure and used the argument structure features to improve automated persuasive essay scoring. Furthermore, levy-etal-2014-context has studied context-dependent claim detection by collecting annotations for Wikipedia articles. Using this corpus, rinott-etal-2015-show has investigated the task of automatically identifying the corresponding pieces of evidence given a claim. bar-haim-etal-2017-stance has further proposed the task of claim-stance detection (i.e., given a topic and claims, identifying for each claim whether it supports or opposes the topic.) by further annotating Wikipedia articles with stance information. walker-etal-2012-corpus has collected posts from 4forums.com, a debating forum, and have further annotated part of this corpus for various aspects of arguments such as topic, stance, agreement, and sarcasm. park-cardie-2018-corpus has proposed an argument mining corpus from Consumer Debt Collection Practices (CDCP) rule by the Consumer Financial Protection Bureau (CFPB) posted on regulationroom.org. Using this corpus, niculae-etal-2017-argument proposed a structured prediction model for argumentation mining.
Although most of the research in Argumentation Mining has focused on English monologues, Peldszus2015AnAC has collected a corpus of microtexts in German and used this corpus for argument component detection. Furthermore, basile:hal-01414698 has studied relation prediction task in Italian news blogs. Similarly, there has been some recent work investigating argumentation mining beyond monologues, i.e., looking at the process of argumentation in dialogues. For example, chakrabarty-etal-2019-ampersand has proposed a method to identify the argument structure in persuasive dialogues that can model the micro-level (i.e., the structure of a single argument) and the macro-level (i.e., the interplay between the arguments) characteristics of arguments.
Stance detection and Argumentation Mining are closely related tasks, given that they both aim to understand standpoints from the text on a controversial topic. Contrary to stance detection, argumentation mining aims also to extract a more fine-grained structure of arguments, identifying claims, premises, and the relationship between them. There has been a lot of research on identifying the stance of arguments on a controversial topic (sobhani-etal-2015-argumentation; hasan-ng-2013-stance; bar-haim-etal-2017-improving; sun-etal-2018-stance). For example, sobhani-etal-2015-argumentation has shown that using argument structure features improves the performance of stance detection models. wachsmuth-etal-2018-retrieval has further studied retrieval of the best counter-arguments, using arguments opposing the same aspect of the controversial topic. In our work (durmus-etal-2019-determining), we have found that encoding contextual information using the argument structure tree is crucial to achieving state-of-the-art performance for argument stance detection. kobbe-etal-2020-unsupervised has proposed an unsupervised method to assess the stance of the arguments inferring whether the outcome is good vs. bad.
For a more detailed discussion of the argumentation mining literature, refer to comprehensive surveys by 10.4018/jcini.2013010101, ijcai2018-766, and lawrence-reed-2019-argument.
2.2 Computational Studies of Persuasion
Understanding the characteristics of persuasive language has been a great interest of computational studies of persuasion. Most of the work in this domain has focused solely on language (DBLP:conf/argmining/HideyMHMM17; habernal-gurevych-2016-makes; guerini-etal-2015-echoes; li-etal-2020-exploring-role; el-baff-etal-2020-analyzing; 10.1145/2872427.2883081; atkinson-etal-2019-gets; morio-etal-2019-revealing; 10.5555/3171837.3171856). For instance, habernal-gurevych-2016-argument has collected a new corpus to study the task of predicting which argument from an argument pair is the more convincing. zhang2016conversational has studied the role of conversational flow and interplay between debaters on persuasion in Oxford-style debates. DBLP:conf/aaai/HideyM18 has further investigated the role of larger context on persuasion, modeling the sequence of arguments in a discussion thread on “Change My View (CMV)”, a discussion forum on Reddit. wang-etal-2019-persuasion has investigated which types of persuasion strategies have a more significant impact in convincing people to donate to a specific charity. This work is the first step in building personalized persuasive dialogue systems. Furthermore, to study whether particular types of people find particular argument styles more convincing, lukin2017argument has collected a new corpus of personality information and belief change in socio-political arguments. They have shown that belief change is affected by personality factors. For example, conscientious people are more convinced by dialogic, emotional arguments, while agreeable people are more likely to be persuaded by dialogic, factual arguments. Inspired by this line of research, this dissertation further investigates the effect of source and audience factors in persuasion by asking the following questions:
How do the prior beliefs of the audience affect the process of persuasion?
Do social interactions play an essential role in people’s success in online argumentation?
How can we measure the relative impact of source and audience factors, language, and the pragmatic context in computational studies of persuasion and argument impact prediction?
Prior work has also investigated the related tasks of argument quality assessment and argument impact prediction (persing-ng-2015-modeling; el-baff-etal-2018-challenge; wachsmuth-etal-2017-argumentation). For example, persing-ng-2015-modeling has introduced a corpus of argumentative student essays annotated with argument strength scores. They have further proposed a supervised, feature-based model to score the essays based on argument strength automatically. wachsmuth-etal-2017-computational has studied logical, rhetorical, and dialectical quality dimensions and proposed a taxonomy of argumentation quality from these dimensions. el-baff-etal-2018-challenge has explored argument quality in news editorials, collecting annotations for the perceived effect of editorials from the New York Times. We have further explored the role of pragmatic context in predicting the perceived impact of arguments on online argumentation platforms (durmus-etal-2019-role).
2.3 Argument Generation
Argumentation is a significant part of a wide range of human activities. Humans are constantly confronted by situations where they are trying to persuade or are being persuaded. A major goal of computational argumentation is to build systems that can have meaningful debates and argumentative interactions with humans. Recent work in the area has made progress towards this goal through the automated generation of argumentative text (zukerman-etal-2000-using; sato-etal-2015-end; hua-wang-2018-neural; bar-haim-etal-2020-arguments). zukerman-etal-2000-using and alshomary-etal-2020-target have proposed a Bayesian argument generation system to generate arguments given the corresponding argumentation strategies. sato-etal-2015-end has presented a sentence-retrieval-based end-to-end argument generation system that can participate in English debating games. hua-etal-2019-argument has explored a neural counter-argument generation method that consists of a text planning decoder and a content realization decoder to select the main talking points and generate an argument given the talking points. hidey-mckeown-2019-fixed has further proposed a neural model that edits the original claim semantically to produce a claim with an opposing stance. Similarly, hua-wang-2018-neural has studied the task of generating arguments of a different stance for a given argument. They have further incorporated external knowledge into the encoder-decoder architecture and have shown that their model can generate arguments that are more likely to be on topic. wang-ling-2016-neural and bar-haim-etal-2020-arguments have investigated the problem of summarizing the key points of an argument. Most recently, PMID:33731946_debater has proposed an autonomous debating system (Project Debater) that can engage in a competitive debate with humans by generating a pipeline of four main modules: argument mining, an argument knowledge base (AKB), argument rebuttal, and debate construction. They have shown that their debating system can engage in a competitive debate with humans. However, they highlight the difficulty of achieving this end-goal due to the following reasons:
The outcome of the debates (i.e., selection of the winner) is highly subjective and open to interpretation since it may dependend on the characteristics of the audience.
Unlike other games such as chess (CAMPBELL200257) or backgammon (Tesauro1995), humans would expect to be able to interpret every move of the system since they vote to decide the winner of the debate.
There are a limited number of structured debate datasets to train such systems.
3.1 Related Work and Datasets
There has been a tremendous amount of research effort to understand the important linguistic features for identifying argument structure and determining effective argumentation strategies in a monologic text (article_moens; feng2011classifying; Stab2014IdentifyingAD; guerini-etal-2015-echoes). For example, habernal-gurevych-2016-makes has experimented with different machine learning models to predict which of the two given arguments is more convincing. To understand what kind of persuasive strategies are effective, DBLP:conf/argmining/HideyMHMM17 has further annotated different modes of persuasion (i.e., ethos, logos, pathos) and looked at which combinations appear most often in more persuasive arguments.
Understanding argumentation strategies in conversations and the effect of the interplay between the participants’ language has also been an important avenue of research. 10.1145/2872427.2883081, for example, has examined the effectiveness of arguments on ChangeMyView111https://www.reddit.com/r/changemyview/., a debate forum in which people invite others to challenge their opinions. They found that the interplay between the language of the opinion holder and that of the counterargument provides highly predictive cues of persuasiveness. zhang2016conversational has examined the effect of conversational style in Oxford-style debates and found that the side that can best adapt in response to opponents’ discussion points throughout the debate is more likely to be more persuasive.
Although research on computational argumentation has mainly focused on identifying important linguistic features of the argument, there is also evidence that it is important to model and account for the information about the debaters and the people who are judging the quality of the arguments: multiple studies in Social Sciences and Psychology show that people perceive arguments from different perspectives depending on their backgrounds and experiences (correll2004affirmed; hullett2005impact; petty1981personal; lord1979biased; vallone1985hostile). lukin2017argument is one of the first to computationally study the impact of the audience, looking at the effect of their ocean personality traits (doi:10.1177/0146167202289008; norman_1963) on how they judge the persuasiveness of monologic arguments. This work is the most similar to ours since the effect of users’ personalities is explored in the persuasion process. Our dataset does not have explicit information about users’ personality traits; however, we have extensive information about their demographics, social interactions, beliefs, and language use. durmus-cardie-2019-corpus describes the details of this dataset.
3.2 DDO Dataset
We collected debates from debate.org (DDO)222www.debate.org from different topic categories, including Politics, Religion, Health, Science, and Music.333The dataset is publicly available at http://www.cs.cornell.edu/ esindurmus/. DDO is an online argumentation platform where people can engage in debates, participate in forums and polls, and post their opinions on controversial topics. Participating in debates provides users an opportunity to challenge other users to change their opinions. After participating in debates, they receive feedback from the audience on the platform. This feedback mechanism is helpful for users to develop strategies to improve their debating skills over time. In addition to the text of the debates, we collected votes from the readers of these debates. Votes evaluate different dimensions of the debate, and they are important to determine which debaters are more successful in persuading other users.
Each user creates a profile on this platform to share information about their background and preferences. To study the characteristics of users on persuasion, we collected user information for different users. In the next section, we share more details about the debates and the user information on this dataset.
Debate rounds. Each debate consists of a sequence of rounds in which two debaters from opposing sides (one is supportive of the claim (i.e., pro) and the other is against the claim (i.e., con)) provide their arguments. Each debater has a single chance in a round to make their points. Figure 3.2.1 shows an example round 1 for the debate claim “Preschool Is A Waste Of Time”. The number of rounds in a debate ranges from to , and most debates contain or more rounds. The goal of the debaters in each round is to provide arguments that would refute the opponent’s points and convince readers to side with their stance.
All users in the debate.org community can vote on debates. As shown in Figure 3.2.1, voters share their stances on the debate topic before and after the debate and evaluate the debaters’ conduct, spelling and grammar, persuasiveness, and reliability of the sources they refer to. For each such dimension, voters can choose one of the debaters as better or indicate a tie.
The audience scores the debaters on these different aspects, and a winner is declared accordingly.
444Having better conduct: 1 point, having better spelling and grammar: 1 point, making more convincing arguments: 3 points, using the most reliable sources: 2 points. This fine-grained voting system gives a glimpse into the reasoning behind the voters’ decisions.
3.2.2 User information
The dataset includes extensive information about the users’ demographics and private state, their activity on this platform, and their stance on various controversial topics. In this section, we describe the user information that is available in this dataset.
Demographic and Private State Information
On debate.org, each user has the option to share demographic and private state information such as their age, gender, ethnicity, political ideology, religious ideology, income level, education level, and the political party they support. Figure 3.2.1 provides an example for the demographic and state information included in a user profile. We can see that these users select their political ideology, ethnicity, education, religious ideology, etc. However, they prefer not to share some of the information about themselves, such as their birthday, email, and income level, since sharing the demographic and state information is optional.
User Activity Information
Beyond the demographic and private state information, we have access to information about their activities on the website, such as their debating success rate, their participation both as debaters and voters, their votes, their forum posts, opinion arguments, opinion questions, poll votes, and poll topics that they created. The activities of an example user is shown in Figure 3.2.2. The availability of this information provides an opportunity to study users’ interactions and success on this platform over time.
User Opinions on the Big Issues
The editors of the platform determine a list of the most controversial debate topics. These are referred to as big issues555http://www.debate.org/big-issues/. Each user has the option to share their stance on each big issue on their profile (see Figure 3.2.2): either pro (in favor), con (against), n/o (no opinion), n/s (not saying), or und (undecided). This gives a glimpse into the prior stance of users on a wide range of controversial topics. Moreover, this information can be used to determine opinion similarity between a pair of users.
3.3 Data Statistics
The dataset consists of 78,376 debates from October of 2007 until November of 2017 with comprehensive user profile information for 45,348 users. Statistics on the number of debates with their corresponding number of rounds and votes are shown in Figure 3.3 and Figure 3.3, respectively. The majority of debates have 3 to 5 rounds. There are some debates with only one round; however, most debates have two or more rounds since the debates are highly interactive.
Although there are many debates with no votes, around 21k debates have three or more votes. We disregard the debates with votes in our studies in order to have enough feedback to model the factors of success in persuasion.
Figure 3.3 shows the number of debates that users participated in. The majority have participated in only a single debate. However, some users actively participate in many debates. For example, around 2k debaters have participated in more than ten debates during the period included in the dataset. We study these debaters to understand the factors of debating success over time.
The dataset includes comprehensive information about users on the platform, which allows us to model user factors in persuasion. However, we acknowledge that we are unable to represent all demographics due to a lack of data. Participation on the platform tends to be highly skewed towards an American audience. Moreover, even within this group, the distribution of user characteristics may not be representative enough. Therefore, some valid opinions may be under-represented, and this should be accounted for while employing models derived from this data. Furthermore, we assume that the information users share on their profiles is accurate, and we use this information to model their characteristics. However, there is no mechanism on this platform to ensure that users provide accurate information.
3.5 Chapter Summary
In this chapter, we present a novel dataset, DDO, of debates collected from debate.org. The dataset includes interactive debates along with votes from the audience to evaluate various aspects of each debater. Moreover, the dataset has comprehensive information about the users on the platform. This allows us to study the effect of source and audience factors in persuasion (Chapter 4). We further use this dataset to model the impact of social interactions on long-term success in online debating (Chapter 5).
Most of the recent work in computational persuasion has focused on identifying the characteristics of persuasive language (habernal-gurevych-2016-makes; DBLP:conf/argmining/HideyMHMM17)
. However, there is evidence from the Social Sciences and Psychology that non-content cues such as the factors of the speaker and the audience play an essential role in persuasion and opinion formation. Instead of carefully processing the content of the arguments, people may rely on simple non-content heuristics in decision making(edsovi.00005205.197911000.0001619791101). Understanding the effect of persuasion strategies on people, the biases people have, and the impact of people’s prior beliefs on their opinion change has been an active area of research (correll2004affirmed; hullett2005impact; petty1981personal).
Prior work has shown that the speaker’s credibility is an essential factor for people’s perceptions of the arguments (edsgcl.1760738119951001; edsovi.00005205.198011000.0000119801101). For example, there is a significant correlation between the communication speed and the persuasive effect of the arguments. The audience perceived a communicator with a faster communication rate as more credible without really focusing on the content of the arguments (mcguire1985attitudes; edsgcl.1760738119951001). Furthermore, edsovi.00005205.198011000.0000119801101 has studied the effect of a communicator’s perceived likability in opinion formation and found that low-involvement subjects perceive the arguments of likable communicators as more persuasive. High involvement subjects (i.e., the subjects who feel their opinion judgments have essential consequences for themselves) have shown to have a more systematic strategy that assigns a higher weight to the message content in opinion formation. A communicator’s perceived attractiveness is also positively correlated with their persuasiveness since the audience perceives more attractive communicators as more effective (1980-32482-00119790801; eagly1975attribution).
There is further evidence showing that people’s prior beliefs significantly affect their opinion formation (edsbds.17414237419960101). People with strong prior beliefs on controversial issues have shown to have biased stances even when they are presented with empirical evidence: i.e., they tend to find empirical evidence that is confirming their prior beliefs more convincing (edsovi.00005205.197911000.0001619791101). Similarly, people judge the fairness and reliability of source content in a biased way; i.e., they accept evidence that supports their stance at face value while scrutinizing evidence that threatens their initial position (vallone1985hostile). Inspired by these findings, we study the impact of prior beliefs in computational persuasion in this chapter. lukin2017argument is the most relevant work to ours since they investigated the effect of an individual’s personality features (open, agreeable, extrovert, neurotic, etc.) on the type of argument (factual vs. emotional) they find more persuasive. Our work differs from this work since we study debates. In addition, we look at different types of user profile information, such as a user’s religious and ideological beliefs and prior beliefs and opinions of the audience on various topics (durmus-etal-2019-role; longpre-etal-2019-persuasion).
4.2 Role of Prior Beliefs in Computational Persuasion
Using the DDO dataset (described in Chapter 3), we first analyze which dimensions of argument quality are the most important for determining the successful debater. Then, we investigate whether there is any connection between selected user-level factors and users’ opinions on the big issues to see if we can infer their opinions from these factors. Finally, using our findings from these analyses, we perform the task of predicting which debater will be perceived as more successful by a given voter. In this study, we particularly aim to understand the role of users’ prior beliefs (i.e., their self-identified political and religious ideology) in predicting the more successful debater.
4.2.1 Relationships between argument quality dimensions
In Section 3.2.1, we describe the aspects the voters evaluate in order to determine which debater is more successful. There are two alternative criteria for determining the successful debater. We consider both in our experiments.
Criterion 1: Argument quality. Debaters get points for each dimension of the debate. The most important dimension — in that it contributes most to the point total — is making convincing arguments. The debater with the highest point total is declared the winner. debate.org uses Criterion 1 to determine the winner of a debate.
Criterion 2: Convinced voters. Alternatively, since voters share their stances before and after the debate, the debater who convinces more voters to change their stance can be considered the winner.
Figure 4.2.1 shows the correlation between pairs of voting dimensions (in the first eight rows/columns) along with the correlation of each dimension with (1) getting the highest point total (row/column 9) and (2) convincing more to change their stance (final row/column). The abbreviations in Figure 4.2.1 stand for (on the con side): has better conduct (cbc), makes more convincing arguments (cca), uses more reliable sources (crs), has better spelling and grammar (cbsg), gets more total points (cmtp) and convinces more voters (ccmv). For the pro side we use pbc, pca, and so on.
From Figure 4.2.1, we can see that making more convincing arguments (cca) correlates the most with total points (cmtp) and convincing more voters (ccmv). This suggests that the language of the argument is important in persuading the audience, and it motivates us to identify the linguistic features that are indicative of convincing arguments while taking into account speaker and audience factors.
4.2.2 The relationship between a user’s opinions on the big issues and their prior beliefs
As described in Section 3.2.2, users share their self-identified political and religious ideologies along with their opinions on various controversial issues (i.e., big issues). Note that many people prefer not to share their political and religious ideologies. Figures 4.2.2 and 4.2.2 show the number of users who self-identify with the given political or religious ideology.
We disentangle different aspects of a person’s prior beliefs in order to understand how they correlate with their opinions on the big issues. We focus on prior beliefs in the form of self-identified political and religious ideology.
Representing the big issues. To represent a user’s opinion on a particular big issuepro, con, n/o (no opinion), and und (undecided), consecutively. Note that we do not have a representation for n/s (not saying) since we eliminate users who indicate n/s for any of the big issues. We then concatenate the vector for each of the big issue to represent a user’s stance on all the big issues as shown in Figure 4.2.2. We denote this vector by BigIssues.
We test the correlation between an individual’s opinion on the Big Issues and the selected user-level factors in this study using two approaches: clustering and classification.
Clustering the users’ decisions on the big issues. We apply PCA on the BigIssues vector of users who identified themselves as conservative vs. liberal ( users). We do the same for the users who identified themselves as atheist vs. christian ( users). In Figure 4.2.2, we see distinct clusters of conservative vs. liberal users in the two-dimensional representation, while for atheist vs. christian, the separation is not as distinct. This suggests that people’s opinions on the big issues identified by debate.org correlate more with their political ideology than their religious ideology.
Classification approach. We can also treat this as a classification task111 For all the classification tasks described in this paper, we experiment with logistic regression, optimizing the regularizer (
For all the classification tasks described in this paper, we experiment with logistic regression, optimizing the regularizer (1 or 2) and the regularization parameter C (between and ). using the BigIssues vector for each user as the input feature and the user’s religious and political ideology as the labels to be predicted. Table 4.2.2 shows the prediction accuracy for religious and political ideology. We see that using the BigIssues vector as a feature performs significantly better222We performed the McNemar significance test. than the majority baseline.333The majority class baseline predicts conservative for political and christian for religious ideology for each example, respectively.
|Prior belief type||Majority||BigIssues|
Accuracy using majority baseline vs. BigIssues vectors as features.
This analysis shows a clear relationship between people’s opinions on the big issues and the selected user-level factors. It raises the question of whether it is even possible to persuade someone to change their stance on a given issue. It may be the case that people prefer to agree with the individuals with the same (or similar) beliefs regardless of the quality of opposing arguments. Therefore, it is crucial to understand the relative effect of prior beliefs vs. argument strength on persuasion.
4.2.3 Task formulation
Some of the previous work in NLP on persuasion, focuses on predicting the winner of a debate as determined by the change in the number of people supporting each stance before and after the debate (zhang2016conversational; potash2017towards). However, we believe that studies of the effect of language on persuasion should consider extra-linguistic factors that can affect opinion change. In particular, we propose an experimental framework for studying the effect of language on persuasion by controlling for the prior beliefs of the audience. In order to do this, we formulate a more fine-grained prediction task: for a given voter, predict which side/debater/argument the voter will declare as the winner.
Task 1: Controlling for religious ideology. In the first task, we control for religious ideology by selecting debates where the debaters have differing religious ideologies (e.g., debater 1 is atheist, debater 2 is christian). Also, we only consider voters that (a) self-identify with one of these religious ideologies (e.g., the voter is either atheist or christian) and (b) changed their stance on the topic after the debate. For each such voter, we want to predict which debater did the convincing. Thus, in this task, we use Criterion 2 to determine the winner of the debate from the voter’s point of view. We hypothesize that a voter will be convinced by the debater that espouses the religious ideology of the voter. Given this setting, we can study the factors that govern whether a debater can convince any given voter. It also provides an opportunity to understand how voters who change their minds perceive arguments from a debater with the same vs. opposing prior beliefs.
To study the effect of the debate topic, we perform this study for two cases — debates belonging to the Religion category only vs. all categories. The Religion category contains debates like “Is the Bible against women’s rights?” and “Religious theories should not be taught in school”. We expect to see a stronger effect due to prior beliefs for debates on Religion.
Task 2: Controlling for political ideology. Similar to the setting described above, Task 2 controls for political ideology. In particular, we only select debates where the debaters have differing political ideologies (conservative vs. liberal). In contrast to Task 1, we consider all voters that self-identify with any of the debater’s ideologies (regardless of whether the voter’s stance changed post-debate). For this task, we predict which debater will get assigned more points from a given voter. Thus, Task 2 uses Criterion 1 to determine the winner of the debate from the point of view of a voter. We hypothesize that a voter will assign more points to a debater who shares the same political ideology.
Similar to task 1, we perform the study for two cases — debates from the Politics category only and debates from all categories. We expect to see a stronger effect due to prior beliefs for debates on Politics.
|Opinion similarity.||For and
, the cosine similarity ofBigIssues and BigIssues.
|Matching features.||For and , 1 if ==, 0 otherwise where political ideology, religious ideology. We denote these features as matching political ideology and matching religious ideology.|
|Length.||Number of tokens.|
|Tf-idf.||Unigram, bigram and trigram features.|
|Referring to the opponent.||Whether the debater refers to their opponent using words or phrases like “opponent, my opponent”.|
|Politeness cues.||Whether the text includes any signs of politeness such as “thank” and “welcome”.|
|Showing evidence.||Whether the text has any signs of citing any other sources (e.g., phrases like “according to”), or quotation.|
|Sentiment.||Average sentiment polarity.|
|Number of words with negative strong, ne-|
|(wilson2005recognizing)||negative weak, positive strong, and positive weak subjectivity.|
|Swear words.||# of swear words.|
|Connotation score||Average # of words with positive, negative|
|(feng2011classifying)||and neutral connotation.|
|Usage of first, second, and third person pronouns.|
|Modal verbs.||Usage of modal verbs.|
Argument lexicon features.
|# of phrases corresponding to different ar-|
|Spelling.||# of spelling errors.|
|Links.||# of links.|
|Numbers.||# of numbers.|
|Exclamation marks.||# of exclamation marks.|
|Questions.||# of questions.|
The features we use in our model are shown in Table 4.2.3. They can be divided into two groups — features that describe the prior beliefs of the users and linguistic features of the arguments.
We use cosine similarity between a voter and a debater’s big issue vectors. This feature gives an approximation of the overall similarity of two users’ opinions. We also use indicator features to encode whether the religious and political beliefs of a voter match that of a debater.
We extract linguistic features separately for both the pro and con side of the debate (combining all the utterances of each side across the different turns). Table 4.2.3 contains a list of these features. It includes features that carry information about the style of the language (e.g., usage of modal verbs, length, punctuation), represent different semantic aspects of the argument (e.g., showing evidence, connotation (feng2011classifying), subjectivity (wilson2005recognizing), sentiment, swear word features) as well as features that convey different argumentation styles (argument lexicon features (somasundaran2010recognizing). Argument lexicon features include the counts for the phrases that match various argumentation styles such as assessment, authority, conditioning, contrasting, emphasizing, generalizing, empathy, inconsistency, necessity, possibility, priority, rhetorical questions, desire, and difficulty. We then concatenate these features to get a single feature representation for the entire debate.
4.3 Results and Analysis
For each of the tasks, prediction accuracy is evaluated using 5-fold cross-validation. We pick the model parameters for each split with 3-fold cross-validation on the training set. We do ablation for each of user-based and linguistic features. We report the results for the feature sets that perform better than the baseline.
We perform analysis by training logistic regression models using only user-based features, only linguistic features, and finally combining user-based and linguistic features for both the tasks.
|Matching religious ideology||%|
|All two features above||%|
|User-based + Linguistic features|
|user* + Personal pronouns||%|
|user* + Connotation||%|
|user* + language*||%|
Results for Task 1 for debates in category Religion. user* represents the best performing combination of user-based features. language* represents the best performing combination of linguistic features. Since using linguistic features only would give the same prediction for all voters in a debate, the maximum accuracy that can be achieved using language features only is %.
Task 1 for debates in category Religion. As shown in Table 4.3, the majority baseline (predicting the winning side of the majority of training examples out of pro or con) gets % accuracy. User features alone perform significantly better than the majority baseline. The most important user-based feature is matching religious ideology. This means it is very likely that people change their views in favor of a debater with the same religious ideology. In a linguistic-only feature analysis, the combination of the personal pronouns and connotation features emerges as most important and performs significantly better than the majority baseline with % accuracy. When we use both user-based and linguistic features, the accuracy improves to % with connotation features. An interesting observation is that including the user-based features and the linguistic features changes the set of important linguistic features for persuasion, removing personal pronouns from the important linguistic features set. This shows the importance of studying potentially confounding user-level factors.
|Matching religious ideology||%|
|Length444This linguistic feature is the one achieving the best performance.||%|
|User-based + Linguistic features|
|user* + Length||%|
Results for Task 1 for debates in all categories. The maximum accuracy that can be achieved using language features only is %.
Task 1 for debates in all categories. As shown in Table 4.3, for experiments with user-based features only, matching religious ideology and opinion similarity features are the most important. For this task, length is the most predictive linguistic feature and can significantly improve the baseline (%). When we combine the language features with user-based features, we see that with exclamation mark, the accuracy improves to (%).
|Matching political ideology||%|
|linguistic feature set||%|
|User-based + Linguistic features|
|user*+ linguistic feature set||%|
Results for Task 2 for debates in category Politics. The maximum accuracy that can be achieved using linguistic features only is %. The linguistic feature set includes rhetorical questions, emphasizing, approval, exclamation mark, questions, politeness, referring to opponent, showing evidence, modals, links, and numbers as features.
|User-based + Linguistic features|
Results for Task 2 for debates in all categories. The maximum accuracy that can be achieved using linguistic features only is %.
Task 2 for debates in category Politics. As shown in Table 4.3, using user-based features only, the matching political ideology feature performs the best (%). Linguistic features (refer to Table 4.3 for the full list) alone can still obtain significantly better accuracy than the baseline (%). The most important linguistic features include approval, politeness, modal verbs, punctuation, and argument lexicon features such as rhetorical questions and emphasizing. When combining this linguistic feature set with the matching political ideology feature, we see that accuracy improves (%). The length feature does not improve when it is combined with the user features.
Task 2 for debates in all categories. As shown in Table 4.3, when we include all categories, we see that the best performing user-based feature is the opinion similarity feature (%). When using language features only, the length feature (%) is the most important. For this setting, the best accuracy is achieved when we combine user features with length and Tf-idf features. We see that the set of language features that improves the performance of user-based features does not include some of the features that performed significantly better than the baseline when used alone (modal verbs and politeness features).
4.4 Persuasion of the Undecided
Research in psychology and political science suggests that there are critical differences in the persuasion of undecided versus decided voters/audience members. For example, petty96 has found that prior experiences and beliefs can lead to the re-framing of a message perceived by a person to maintain consistency between their prior beliefs and their attitudes towards the topic of the message. In particular, studies show that a priori decided voters simply ignore certain information to maintain this consistency (sween; vecc; kos14). In contrast, an undecided voter is asked to decide on an issue for which previously received information was somehow unconvincing; and prior work has shown that, as a result, these voters are likely to rely heavily on information conveyed in a new message (kos10; kos14; schill).
Furthermore, the undecided voter group holds the highest potential for persuasion (kos10; sheh17). Public support for social and political causes often critically depends on the undecided decision-makers. Therefore, in our work, we explicitly study the factors that govern persuasion for a priori undecided versus decided members of the audience (longpre-etal-2019-persuasion).
4.4.1 Task Formulation
We aim to study the most important factors in influencing audience members to be persuaded to one side or the other for each case (a priori undecided or decided) of persuasion. Encoding audience-level and linguistic factors as features, we structure the prediction task as follows: Given an individual voter, predict which debater/side (PRO or CON) the voter will be convinced by after the debate. We experiment with the features described in Section 5.2.3.
We consider only samples from the data where (1) a voter was undecided before the debate and then adopted a stance, i.e., voted for one of the debaters as the winner (from-middle); and (2) a voter was (seemingly) decided beforehand and then flipped their stance from-opposing. We do not consider samples where (1) a voter declared a “tie” between the debaters after the debate; and (2) a voter was decided beforehand and voted for the debater with the stance that they agreed with beforehand. To study the effect of each of the debaters’ linguistic and user-based features on persuasion, we specifically look at which side (PRO vs. CON) did the convincing for a particular voter. Figure 4.4.1 illustrates example user votes for each of the two cases. Distinguishing instances of voters being persuaded into these case groupings allows us to examine what makes an argument persuasive to undecided versus decided audience members. Table 4.4.1 summarizes the dataset statistics relevant to the voter cases.
|Persuasion Case||# instances||# debates|
Number of voters in from-middle and from-opposing categories.
4.4.2 Differences Between Persuasion Groups
We find distinct differences in the important features for predicting the outcome for voter groups from-middle and from-opposing. Best-performing set of linguistic features for from-middle includes all features minus the use of citations, referring to the opponent, and swear words, while the best-performing set of linguistic features for from-opposing includes all features minus subjectivity, modals555The usage of modal verbs, i.e., can, should, will, and may., and bi-/tri-gram TF-IDF.666Calculated with a maximum of 30 terms.
The set of linguistic features that are important for each the two groups have subtle differences in nature. A possible analysis that distinguishes the groups is that there is a difference in the rhetorical strategies that are the most effective. The use of modals, subjectivity, and general word choice are semantic features of an argument that can affect the perception of an argument’s content. Based on our results, these content-based features are more important for undecided voters than for decided voters. In comparison, the use of swear words, citing sources, and referring to the opponent are stylistic features of an argument that can affect the perception of the debater. Our results indicate that these style-based features are not as important for undecided voters as for decided voters. This account is consistent with the findings of schill that undecided voters respond most to content-rich rhetorical strategies and the findings of vecc; sween that decided voters tend to selectively attend to information in a message based on prior attitudes. The account is also in line with experiments conducted by adams, which found that affiliated voters do not adjust their positions in response to a party’s actual policy statements but instead adjust their positions based on their subjective perceptions of the party. We have further found that audience-level aspects are comparatively more predictive of outcomes for undecided voters.
In this study, we develop a framework to account for users’ prior beliefs in their opinion formation. We mainly focus on users’ political and religious ideologies and whether they are undecided vs. decided a priori. However, there are many user aspects such as debating experience, prior interactions, education level, etc., which can impact their opinion formation. We do not propose a method to account for all these factors simultaneously. Moreover, we do not suggest any causal implications since our findings are correlational.
4.6 Chapter Summary
In this chapter, we study the effect of the users’ prior beliefs (i.e., political and religious ideology) and their initial stance on persuasion. We formulate the prediction task of determining which debater an individual voter finds persuasive in order to study the effect of these factors. We show that prior beliefs play a crucial role in this task. Furthermore, we explore the factors that govern persuasion for an a priori undecided vs. decided audience and find differences in the most predictive features for persuasion.
There has been a tremendous amount of research on understanding user interactions and behaviour on social media (backstrom2011center; nagarajan2010qualitative; macskassy2011people; Maia:2008:IUB:1435497.1435498; Benevenuto:2009:CUB:1644893.1644900; burke2009feed; golder2007rhythms; wilson2009user; lim2015mytweet; kumar2011understanding). For example, wilson2009user analyze the interaction graphs of Facebook user traces and show that interaction activity on Facebook is significantly skewed towards a small portion of each user’s social links. lim2015mytweet investigates how people interact in multiple online social networks. It has been further shown that there is a strong relationship between a user’s social interaction and their influence on social media. For example, 10.1145/1963192.1963250 and cha2010measuring and have shown that individuals with more activity and personal engagement are more influential on Twitter. Although there is a lot of work on understanding user behavior on social media sites such Facebook and Twitter, understanding the influence of user behavior on their persuasion success on debating platforms has been limited.
10.1145/1963192.1963250 is the most similar to our work, in that the authors study the effect of interaction dynamics, such as participant entry order and degree of back-and-forth exchange in the discussion, on success in changing an opinion holder’s stance in a thread. Note that, unlike our study, this work does not consider the effect of social interaction features (such as friendship network or voter network) on users’ success. Moreover, we study the overall success of users over their lifetime, rather than a single debate or discussion thread.
We hypothesize that it is essential to account for the effect of social interactions in computational persuasion. Success in persuasion might also depend on an individual’s social interaction and engagement with other users (on the debate platform) over time. For example, being more engaged with others over time may expose an individual to more diverse ideas and people, which could foster argumentation skills that are more applicable to convincing a more diverse audience. Focusing on only individual debates and discussion threads, prior work has not investigated the relative effect of an individual’s social interaction, personal traits, and language use on their success in persuasion. In this chapter, we focus on online debates and study success over a user’s lifetime by looking at interaction and engagement with the community over time, rather than focusing on individual debates to understand the relative impact of these factors on a user’s success in persuasion.
Our study employs the DDO (debate.org) dataset described in Section 3. Its extensive user information and multiple well-structured debates/interactions per user provides a unique opportunity to study users’ success over time while accounting for the effect of individuals’ social interactions, personal traits, and language use. Users provide demographic information as well as their stance on controversial topics. They interact with one another in many ways: 1) debating, 2) evaluating the performance of other debaters, 3) commenting on debates, 4) asking/answering opinion questions, 5) voting in polls, 6) creating polls, 7) becoming friends.
5.2.1 Task Description
This section describes the methods used to investigate the underlying dynamics of success in online debate. First, we explain how we measure the users’ success, and then we explore the role of personal traits, social interactions, and language in predicting success.
5.2.2 User Success
We compute the overall success in debating for a user u as:
We treat users with % as successful, % as unsuccessful and %% as mediocre.
5.2.3 Prediction Task
To understand the relative effect of a user’s personal traits, social interaction, and language on their success, we study the following prediction task: given a pair of debaters where one of them is successful, and the other is unsuccessful over the second and third stage of their lifetime, predict the successful one. Note that while determining our label for success, we consider only the debates in the second and third stage of a user’s lifetime to be able to study the relative effect of success in their first life stage (success prior) vs. other factors in a controlled way. We experiment with two settings where we control for the effect of debate experience and success prior respectively.
setting 1. To control the effect of debate experience in success, we create the pairs by matching users according to the number of debates that they participated in (i.e., users within a pair have the same number of debates).111There are 2,154 such pairs in our dataset.
setting 2. Given that we’re interested in understanding the factors that correlate with success, we control for the success prior in a very specific way – we only consider users that were unsuccessful in their initial life stage (success prior222There are 957 such pairs in our dataset.). This allows us to directly study the factors correlated with users that were initially unsuccessful, but later went on to become successful debaters.
In the following subsections, we describe each of the factors (i.e., personal traits, social interactions, and language) that we study in our experiments.
In Chapter 4, we describe our findings on the role of prior beliefs on users’ persuasion success in online argumentation, looking at the individual debates. We further investigate this effect in a debater’s success over their lifetime. We also extend this study by considering additional personal traits, such as the degree to which a debater’s demographic (e.g., gender and ethnicity) matches those of their friends and the voters participating in the debates.
We extract features to encode the similarity for a user’s opinion, political ideology, religious ideology, gender, and ethnicity with that of her friends and voters. To compute opinion similarity, we used the information about users’ opinions on the big issues.333We consider issues where users identified their side as either pro or con and measure the similarity of their opinion for these issues with their friends and voters.
Figure 5.2.3 shows the similarity of successful and unsuccessful users’ personal traits with that of their friends and voters respectively. We find that successful users have significantly higher opinion similarity with their friends than unsuccessful users. Moreover, they have significantly higher opinion similarity, religious ideology match, gender match, and ethnicity match with voters than unsuccessful users. This implies that having voters with a similar background may be an important factor for success, since an audience’s decision about the performance of debaters may be influenced by the extent to which their prior beliefs match (durmus-cardie-2018-exploring).
The users interact with each other on the platform in the following ways: 1) debating 2) evaluating the performance of other debaters, 3) commenting on debates, 4) asking/answering opinion questions, 5) voting in polls, 6) creating polls, 7) becoming friends. We present examples for an opinion question, an opinion argument, and a poll topic below:
Example Opinion Question. ”Does God exist?” 444Full discussion on the topic can be found at https://www.debate.org/opinions/does-god-exist.
Example Opinion Argument.
”He probably does not exist. I don’t think that it’s possible to say yes or no either way. We can only conclude that there is more logical evidence to say that a God probably does not exist, …”
Example Poll Topic. Do you believe in Evolution or Creationism?
We hypothesize that modeling these interactions is important to understand the differences between how successful and unsuccessful users interact on this platform and whether or not these are important factors for success. The ability to interact with others in a myriad of different ways provides users with ample opportunity to learn interesting new strategies and improve their skills over time, as they are exposed to a diverse set of perspectives.
Figure 5.2.3 shows the interaction statistics for successful and unsuccessful users.555We controlled for number of debates to remove the effect of “being a new user” by pairing successful and unsuccessful users according the number of debates they participated in. We see that, overall, successful users have significantly higher participation on the platform.
Friendship network. We represent the friendship network as an undirected graph where represents the set of users, and represents the set of edges where if and are friends.
Voter network. We represent the voter network as a weighted directed graph where represents the set of users, and represents the set of edges where if voted in a debate in which participated as a debater. The weight of the graph represents how many times voted in debates was a debater. Note having edge in the graph does not imply that voted for in a debate.
Hubs and authorities in voter network. Using the HITS algorithm (kleinberg1999authoritative), we compute hub and authority scores for each node (user) in the voter network graph. We expect that users that participate in debates as debaters are the authoritative sources of information on the controversial topics on this platform; therefore, they should have higher authority scores. On the other hand, users with higher hub scores represent people who may not necessarily be authoritative sources of information on the topic, but they are interested in the topic and; therefore, by providing feedback, they lead other users to these debates. We find that successful users have, on average, a significantly higher hub score than unsuccessful users (p ). As shown in Figure 5.2.3, we further observe that successful users have, on average, a significantly higher in-degree centrality and out-degree centrality than unsuccessful users in the voter network. Similarly, successful users have higher degree centrality and page rank than unsuccessful users in their friendship network.
To capture the linguistic style of the debaters’ language and its relationship to their success, we use textual features that encode 1) users’ own language and 2) the interplay between users’ and their opponents’ language.
|Personal Traits||1) match of the personal traits (e.g., gender, political ideology, religious ideology and ethnicity) with friends and voters.|
|2) opinion similarity with friends and voters.|
|Social Interactions||1) participation features : # of comments, # of votes, # of friends, # of opinion questions and arguments, # of voted debates, # of poll votes and topics.|
|2) friendship network features : degree, degree centrality, page rank scores.|
|3) voter network features: in-degree, out-degree, in-degree centrality, out-degree centrality, page rank, hub and authority scores.|
|Language||1) features of debaters’ own language : # of words, # of definite articles, # of indefinite articles, # of person pronouns, # of positive words, # of negative words, # of hedges, # of swear words, # of punctuation, # of links, average sentiment, type-token ratio, # of quotes, distribution of POS tags, distribution of named entities, BOW.|
|2) features to encode the interplay : exact content word match, exact stop word match, content word match with synonyms.|
Personal Traits, Social Interactions and Language Features.
Modeling users’ own language. We extract features from the text of users’ debates, opinion questions, opinion arguments, poll votes, and poll topics. These features include # of words, word category features (e.g., # of personal pronouns, # of positive and negative words), structural features (e.g., distribution of POS tags and named entities), and features to encode the characteristics of the entire language (e.g., type-token ratio)
Modeling interplay between a debater and their opponent. We measure the interplay between debaters and their opponents by measuring how similar a debater’s language is to the previous statement made by her opponent. To measure the similarity of a debater’s language (D) to that of the opponent’s (O) in a round, we look at # of content words that are in both D and O, # of stop words that are in both D and O and # of content words that are in D and have synonyms in O.
The content word match with synonyms feature aims to capture the cases where the opponent refers to similar concepts but does not necessarily use the same words as the debater.
The complete list of features modeling the aspects of personal traits, social interactions, and language features is shown in Table 5.2.3.
5.2.4 Prediction Results
We use weighted logistic regression and choose the amount and type of regularization (1 or 2) by grid search over five cross-validation folds. We compute weighted precision, recall and F1 scores.
In setting 1, we create user pairs (,) where:
and have an equal number of debates they participated in as debaters.
One of or is successful and the other one is unsuccessful over the second and third stage of their lifetime.666We consider success only over the second and third stage of users’ lifetime in our prediction task, in order to study the effect of success prior vs. the other aspects. We use the success in the first life stage as success prior.
In setting 2, in addition to the requirements of setting 1, we also require and to both have success prior 0.3.
Task. For both setting 1 and setting 2, we aim to predict whether or is successful over the second and third stage of her lifetime.
In setting 2, by only studying user pairs with low success priors, we aim to understand the factors that are important for a user to improve as a debater over time.
||(2) Debating experience|
||(3) Success prior|
(4) Overall similarity with voters
||(5) Overall similarity with friends|
|(6) Participation features|
||(7) Friendship network features|
||(8) Voter network features|
||(6) + (7) + (8)|
|(9) # of words|
||(10) Features of debaters’ interplay|
||(11) Features of debaters’ own language|
|(6) + (7) + (8) + (11)|
||(6) + (7) + (8) + (10) + (11)|
Prediction Task Results for setting 1. Voter network features are the most predictive social interaction features. Combining interaction and language features achieves the best predictive performance.
Results for setting 1
Table 5.2.4 shows the results for setting 1. We compare our model with three simple baselines – majority, debating experience, and success prior. For the majority baseline, we predict the most common label in the training data for each test example. For debating experience baseline, we use # of debates as the only feature to predict the successful debater. For success prior baseline, we pick the user with the higher success prior as successful.
In setting 1, since we do not control for the success in the first life stage, we see that the success prior information alone can achieve F1 score. This implies that there is a correlation between users’ success in their early life stage and later life stages. This factor may be related to users’ prior debating skills. We observe that the features that encode debaters’ overall similarity with voters and friends achieve significantly better F1 scores than majority and debating experience baselines. However, these features do not have as high a predictive power as the success prior. We perform an ablation study for participation features, friendship network features, and voter network features. We find that voter network features are significantly more predictive than the baselines, personal trait features, and other social interaction features. We also perform an ablation study for the language features and find that # of words is a very predictive feature of success. When we combine the language features with the interaction features, we get the best predictive performance (81.61% F1 score) for this task which is significantly better than the baselines. This indicates that it is important to account for social interaction and language factors to determine the successful debater since these two components encode different kinds of information about the users.
||(2) Debating experience|
||(3) Success prior|
(4) Overall similarity with voters
||(5) Overall similarity with friends|
|(6) Participation features|
||(7) Friendship network features|
||(8) Voter network features|
||(6) + (7) + (8)|
|(9) # of words|
||(10) Features of debaters’ interplay|
||(11) Features of debaters’ own language|
|(6) + (7) + (8) + (11)|
||(6) + (7) + (8) + (10) + (11)|
Prediction Task Results for setting 2. Similar to setting 1, voter network features are the most predictive social interaction features, and combining interaction and language features achieves the best predictive performance.
Results for setting 2
In this task, by controlling for prior success, we aim to understand the factors correlated with success by reducing the effect of prior debating skills of the users. As shown in Table 5.2.4, the F1 score for the success prior baseline is not as quite as high as in setting 1, since we control for this aspect by ensuring both users in the pair are unsuccessful in their initial life stage. However, this does not necessarily mean that the two paired users will have the same success prior, which explains why success prior still performs better than the other baselines. We do not observe any significant difference between the performance of the features encoding personal traits, participation, and the baseline. However, consistent with the setting 1, we see that features of the voter network are significantly better (69.65%) in predicting success. Although language features achieve a significantly better F1 score than the baseline, they perform significantly worse than the voter network features. Similar to setting 1, combining these language features with the social interaction features improves the performance significantly (78.05% F1 score).
To understand the important social interaction and language features, we 1) compute the correlation coefficients for the feature values and the labels, 2) analyze the coefficients of the logistic regression classifier, and 3) apply the recursive feature elimination method (guyon2002gene) to rank features according to their importance. In this section, we present the consistently important features for each of these methods.
Analysis of Social Interaction Features. We find that the most important social interaction features for setting 1 are authority score, hub score, in and out-degree centrality and the page rank of the voter network. Note that all these important features are positively correlated with success. Although participation and friendship network features (e.g., # of voted debates, degree of the user node in friendship network) are also positively correlated with success, the correlation values for these are not as high as the ones of the voter network features. We also find a high correlation between some of the user activities. For example, users with more # of comments are more active in making friends, voting, providing poll votes, and having higher centrality value in the friendship network. Perhaps surprisingly, we do not observe any correlation between # of voted debates and hub/authority scores in the voter network. However, we see a highly positive correlation between hub scores, authority scores, in-degree centrality, out-degree centrality, and page rank values of the voter network. This implies that success is not only about the quantity of voted debates but also about the characteristics of the debaters involved in these debates, since the hub score of a user is influenced by the authority scores of the debaters they vote for. Similarly, the authority score of a user is influenced by the hub scores of the voters that participate in her debates. Therefore, besides the frequency of interaction, the type of the interaction and characteristics of users involved in the interaction are important to consider. Consistent with setting 1, in setting 2, the most important features (positively correlated with success) are authority score, hub score, in and out-degree centrality and the page rank of the voter network. We observe the same patterns of user activities and authority and hub scores as in setting 1.
Analysis of Language Features. We find that number of words is positively correlated with success. It may be the case that longer text may convey more information and explain the points more explicitly (doi:10.1080/00028533.1997.11978023; doi:10.1080/00028533.1998.11951621). The bag of words feature is not as predictive as the # of words feature. For both setting 1 and setting 2, we observe that the value of average sentiment is negatively correlated with success. The reason for this may be that negative information is more attention grabbing than positive information (article_ditto; doi:10.1080/00913367.1992.10673357; article_pratto) since people are more used to seeing arguments that are phrased in a more positive way (meyerowitz1987effect). We also find that type-token ratio (diversity of language) is negatively correlated with success for both settings. It may be the case that people who talk about a smaller set of topics gain expertise on these topics over time; therefore, they may be more successful. We observe that other textual features are positively correlated with success for both of these settings. However, the degree of correlation is not as high as it is for type-token ratio and sentiment.
(2) Debating experience
(3) Success prior
(4) Overall similarity with voters
(5) Overall similarity with friends
|(6) Participation features|
|(7) Friendship network features|
|(8) voter network features|
|(6) + (7) + (8)|
|(9) # of words|
|(10) Features of debaters’ interplay|
|(11) Features of debaters’ own language|
|(6) + (7) + (8) + (11)|
|(6) + (7) + (8) + (10) + (11)|
Prediction Task Results for loss of success. Participation features are the most important social interaction features. Combining the social interaction features with the language features gives the best prediction performance.
5.3 Understanding the loss of success
In the previous section, we show that social interaction and language features are important to predict successful debaters. Our findings are consistent for the case when 1) we only control for users’ debating experience and 2) we also control for users’ success prior. Users’ participation, the types of interactions they have on the platform, and the characteristics of the users they interact with are predictive of their success, regardless of their prior expertise in debating (encoded by the success prior).
In setting 1, since we did not control for the success prior, we studied the factors that are important for a user to become successful in their second and third life stages, regardless of their success in the beginning. In setting 2, we studied the factors that are important for unsuccessful users to improve their performance and become successful over time. As a natural follow-up, we would also like to understand what factors are correlated with users who are initially successful, but later become unsuccessful in their lifetime. To do that, in setting 3, in addition to the requirements of setting 1, we have an additional criterion for all user pairs (,):
and both have success prior 0.7.777We have user pairs with these criteria.
As shown in Table 5.2.4, features of personality traits, social interactions, and language perform significantly better than the baselines. For this task, the success prior
baseline performs relatively worse than in the previous two settings. Upon closer examination, we observed that the variance of success priors for this task is an order of magnitude smaller than insetting 2. Therefore, as a possible explanation, the success prior may not be as predictive for this task.
In social interaction features, similarity with friends is the most predictive feature. However, participation features perform significantly better than the features of personal traits. For this task, contrary to setting 1 and setting 2, we see that participation features are the most predictive in the set of social interaction features. This implies that a user’s participation is important for them to remain successful. Lower participation could be a contributing factor for these users to become unsuccessful eventually. Although friendship and voter network features are still significantly more predictive than the baselines, they are not as highly predictive as the participation features. For users with high success priors, continued participation may be the most important aspect of their social interaction. We observe that language features alone achieve a similar performance as the social interaction features. Consistent with the setting 1 and setting 2, combining social interaction and language features gives the best predictive performance (73.43% F1 score).
Analysis of Social Interaction Features. The most important social interaction features include # of voted debates, degree of the user node in the friendship network, and hub scores, authority scores, in-degree centrality, out-degree centrality and page rank values of voter network. All these features indicate higher participation on the platform, and they are positively correlated with staying successful. Although the other social interaction features, such as authority and hub scores of the voter network are also positively correlated with success, the value of correlation for these is not as high as the previously mentioned features. For users who are initially unsuccessful, participation alone may not be enough for them to become successful debaters – the types of interactions and the characteristics of people with whom they interact are crucially important for their success. On the other hand, users who are initially successful may already be experienced debaters, and staying active and participating may be sufficient for them to remain successful.
Analysis of Language Features. As in setting 1 and setting 2, # of words is positively correlated with staying successful. We find that the # of first person pronouns is the language feature with the highest positive correlation with staying successful. We observe that users who refer to their personal experiences and opinions use first person pronouns more often. It may be the case that debaters may try to appeal to logos by citing personal experience (cooper1992power). Consistent with setting 1 and setting 2, the value of average sentiment is negatively correlated with staying successful.
In this study, we investigate the impact of social interaction on debating success. We find that higher participation and engagement improves the success of debaters over time. One potential reason is that users develop strategies to improve their debating skills. Another factor could be that users only participate in the topics they are comfortable with and do not improve their debating skills overall. Moreover, they may be debating with users that they are confident about defeating to increase their chances of winning. Therefore, in this setup, becoming more successful over time may not necessarily imply developing better argumentative skills. In future work, we would like to explore the effect of debate topics on users’ success. Moreover, we aim to understand what characteristics of a user’s language change over time and how it affects debating success.
5.5 Chapter Summary
This chapter explores the effect of a user’s social interaction on their success in debating over time. We investigate the impact of language, personal traits, and social interaction simultaneously for predicting the successful debater given a pair of debaters where one of them is successful and the other is unsuccessful. We observe that successful debaters are significantly more engaged with others and more active on the platform. We find that a user’s social interaction characteristics play a crucial role in determining their success in debates. We achieve the best predictive performance by combining social interaction features with features that encode information on language use.
Previous work in the social sciences and psychology has shown that the impact and persuasive power of an argument depend not only on the language employed but also on the credibility and character of the communicator (i.e., ethos) (fb566a52435647fcbb369ed48db6fbec; 1980-32482-00119790801; source-effect), the traits and prior beliefs of the audience (lord1979biased; edsfra.236698719980101; correll2004affirmed; hullett2005impact), and the pragmatic context in which the argument is presented (i.e., kairos) (10.1086/209393; context_joyce).
Research in Natural Language Processing (NLP) has only partially corroborated these findings. One very influential line of work, for example, develops computational methods to automatically determine the linguistic characteristics of persuasive arguments (habernal-gurevych-2016-makes; 10.1145/2872427.2883081; zhang2016conversational), but it does so without controlling for the audience, the communicator, or the pragmatic context. Very recent work, on the other hand, shows that attributes of both the audience and the communicator constitute important cues for determining argument strength (lukin2017argument; durmus-cardie-2018-exploring). They further show that audience and communicator attributes can influence the relative importance of linguistic features for predicting the persuasiveness of an argument. These results confirm previous findings in the social sciences that show a person’s perception of an argument can be influenced by their background and personality traits. To the best of our knowledge, however, no NLP studies explicitly investigate the role of kairos — a component of pragmatic context that refers to the context-dependent “timeliness” and “appropriateness” of an argument and its claims within an argumentative discourse — in argument quality prediction.
Among the many social science studies of attitude change, the order in which argumentative claims are shared with the audience has been studied extensively: 10.1086/209393, for example, summarize studies showing that the argument-related claims a person is exposed to beforehand can affect his perception of an alternative argument in complex ways. context_joyce similarly finds that changes in an argument’s context can have a big impact on the audience’s perception of the argument.
Some recent studies in NLP have investigated the effect of interactions on the overall persuasive power of posts in social media (10.1145/2872427.2883081; DBLP:conf/aaai/HideyM18). However, in social media, not all posts have to express arguments or stay on topic (DBLP:journals/corr/abs-1709-03167), and qualitative evaluation of the posts can be influenced by many other factors such as interaction between the individuals (Durmus:2019:MFU:3308558.3313676). Therefore, it is difficult to measure the effect of argumentative pragmatic context alone in argument quality prediction without these confounding factors using the datasets and models presented in prior work.
In this chapter, we study the role of kairos on argument quality prediction by examining the individual claims of an argument for their timeliness and appropriateness in the context of a particular line of argument. We define kairos as the sequence of argumentative text (e.g., claims) along a particular line of argumentative reasoning. We first present a dataset extracted from kialo.com of over 47,000 claims that are part of a diverse collection of arguments on 741 controversial topics. The website’s structure dictates that each argument must present a supporting or opposing claim for its parent claim, and stay within the topic of the main thesis. Rather than being posts on a social media platform, these are community-curated claims. Furthermore, for each presented claim, the audience votes on its impact within the given line of reasoning. Critically then, the dataset includes the argument context for each claim, allowing us to investigate the characteristics associated with impactful arguments.
With the dataset in hand, we then propose the task of studying the characteristics of impactful claims by (1) taking the argument context into account, (2) studying the extent to which this context is important, and (3) determining the representation of context that is more effective. To the best of our knowledge, ours is the first dataset that includes claims with both impact votes and the corresponding context of the argument.
Claims and impact votes. We collected claims from kialo.com111The data is collected from this website in accordance with the terms and conditions.222There is prior work by durmus-etal-2019-determining which created a dataset of argument trees from kialo.com. That dataset, however, does not include any impact labels. for 741 controversial topics and their corresponding impact votes. The users of the platform provide impact votes to evaluate how impactful a particular claim is. Users can pick one of possible impact labels for a particular claim: no impact, low impact, medium impact, high impact and very high impact. While evaluating the impact of a claim, users have access to the full argument context. Therefore, they can assess how impactful a claim is in the given context of an argument. Interestingly, in this dataset, the same claim can have different impact labels depending on the context in which it occurs.
Figure 6.2 shows a partial argument tree for the argument thesis “Physical torture of prisoners is an acceptable interrogation tool.”. Each node in the argument tree corresponds to a claim, and these argument trees are constructed and edited collaboratively by the users of the platform.
Except for the thesis, every claim in the argument tree either opposes or supports its parent claim. Each path from the root to a leaf node corresponds to an argument path which represents a particular line of reasoning on the given controversial topic.
The distribution of argument trees for a given range of claims, and depth is shown in Figures 6.2 and 6.2 respectively. We see that for the majority of trees, the depth is or higher, and the number of claims is greater than .
Figure 6.2 shows the total number of claims at a given depth. We see that only out of claims directly support or oppose the theses of the controversial topics. The majority of the claims lie at depth or higher. This shows that the dataset has a rich set of supporting and opposing claims not only for the theses but for claims at different depths of the tree.
Moreover, around 47,000 claims in this dataset have impact votes assigned by the users of the platform. The impact vote evaluates how impactful a claim is within its context, which consists of its predecessor claims from the thesis of the tree. For example, claim O1 “It is morally wrong to harm a defenseless person” is an opposing claim for the thesis, and it is an impactful claim since most of its impact votes belong to the category of very high impact. However, claim S3 “It is illegitimate for state actors to harm someone without the process” is a supporting claim for its parent O1 and it is a less impactful claim since most of the impact votes belong to the no impact and low impact categories.
|# impact votes||# claims|
Number of claims for the given range of number of votes. There are 19,512 claims in the dataset with or more votes. Out of the claims with or more votes, majority of them have or more votes.
|3-class case||5-class case|
|Agreement score||Number of claims||Number of claims|
Number of claims, with at least five votes, above the given threshold of agreement percentage for 3-class and 5-class cases. When we combine the low impact and high impact classes, there are more claims with high agreement score.
Impact label statistics. Table 6.2 shows the distribution of the number of votes for each of the impact categories. The claims have total votes. The majority of the impact votes belong to medium impact category. We observe that users assign more high impact and very high impact votes than low impact and no impact votes respectively. When we restrict the claims to the ones with at least impact votes, we have votes in total33326,998 of them no impact, 33,789 of them low impact, 55,616 of them medium impact, 47,494 of them high impact and 49,380 of them very high impact..
Agreement for the impact votes. To determine the agreement in assigning the impact label for a particular claim, for each claim, we compute the percentage of the votes that are the same as the majority impact vote for that claim. Let denote the count of the claims with the class labels C=[no impact, low impact, medium impact, high impact, very high impact] for the impact label at index .
For example, for claim S1 in Figure 6.2, the agreement score is since the majority class (no impact) has votes and there are impact votes in total for this particular claim. We compute the agreement score for the cases where (1) we treat each impact label separately (5-class case) and (2) we combine the classes high impact and very high impact into a one class: impactful and no impact and low impact into a one class: not impactful (3-class case).
Table 6.2 shows the number of claims with the given agreement score thresholds when we include the claims with at least votes. There are more claims with high agreement scores when we combine the low impact and high impact classes. This may imply that distinguishing between no impact-low impact and high impact-very high impact classes is difficult. In our experiments, we use a 3-class representation for the impact labels to decrease the sparsity issue. Moreover, to have a more reliable assignment of impact labels, we consider only the claims with have more than 60% agreement.
Context. In an argument tree, the claims from the thesis node (root) to each leaf node form an argument path. This argument path represents a particular line of reasoning for the given thesis. Similarly, for each claim, all the claims along the path from the thesis to the claim, represent the context for the claim. For example, in Figure 6.2, the context for O1 consists of only the thesis, whereas the context for S3 consists of both the thesis and O1 since S3 is provided to support the claim O1 which is an opposing claim for the thesis.
|Impact label||# votes- all claims|
|Very high impact||58,846|
|Total # votes||241,884|
Number of votes for the given impact label. There are total votes and majority of them belongs to the category medium impact.
Distribution of impact votes. The distribution of claims with the given range of number of impact votes are shown in Table 6.2. There are 19,512 claims in total with or more votes. Out of the claims with or more votes, majority of them have or more votes. We limit our study to the claims with at least votes to have a more reliable assignment for the accumulated impact label for each claim.
|Context length||# claims|
Number of claims for the given range of context length, for claims with more than votes and an agreement score greater than .
The claims are not constructed independently from their context since they are written in considering the line of reasoning so far. In most cases, each claim elaborates on the point made by its parent and presents cases to support or oppose the parent claim’s points. Similarly, when users evaluate the impact of a claim, they consider if the claim is timely and appropriate given its context. There are cases in the dataset where the same claim has different impact labels when presented within a different context. Therefore, we claim that it is not sufficient to study only the linguistic characteristic of a claim to determine its impact, but it is also necessary to consider its context in determining the impact.
Context length () for a particular claim C is defined by number of claims included in the argument path starting from the thesis until the claim C. For example, in Figure 6.2, the context length for O1 and S3 are and respectively. Table 6.2 shows number of claims with the given range of context length for the claims with more than votes and agreement score. We observe that more than half of these claims have or higher context length.
6.3.1 Hypothesis and Task Description
Similar to prior work, we aim to understand the characteristics of impactful claims in argumentation. However, we hypothesize that the qualitative characteristics of arguments are not independent of the context in which they are presented. To understand the relationship between argument context and the impact of a claim, we aim to incorporate the context along with the claim itself in our predictive models.
Prediction task. Given a claim, we want to predict the impact label that is assigned to it by the users: not impactful, medium impact, or impactful.
Preprocessing. We restrict our study to claims with at least or more votes and greater than agreement to have a reliable impact label assignment. We have claims in the dataset satisfying these constraints444We have 1,633 not impactful, 1,445 medium impact and 4,308 impacful claims.. We see that the impact class impacful is the majority class since around of the claims belong to this category.
For our experiments, we split our data to train (70%), validation (15%), and test (15%) sets.
6.3.2 Baseline Models
The majority baseline assigns the most common training example label (high impact) to every test example.
SVM with RBF kernel
Similar to habernal-gurevych-2016-makes, we experiment with SVM with RBF kernel, with features that represent (1) the simple characteristics of the argument tree and (2) the linguistic characteristics of the claim.
The features that represent the simple characteristics of the claim’s argument tree include the distance and similarity of the claim to the thesis, the similarity of a claim with its parent, and the impact votes of the claim’s parent claim. We encode the similarity of a claim to its parent and the thesis claim with the cosine similarity of their tf-idf vectors. The distance and similarity metrics aim to model whether claims which are more similar (i.e., potentially more topically relevant) to their parent claim or the thesis claim are more impactful.
We encode the quality of the parent claim as the number of votes for each impact class and incorporate it as a feature to understand if it is more likely for a claim to be impactful given an impactful parent claim.
Linguistic features. To represent each claim, we extracted the linguistic features proposed by habernal-gurevych-2016-makes such as tf-idf scores for unigrams and bigrams, ratio of quotation marks, exclamation marks, modal verbs, stop words, type-token ratio, hedging (:/content/books/9789027282583)
, named entity types, POS n-grams, sentiment(ICWSM148109) and subjectivity scores (wilson2005recognizing), spell-checking, readibility features such as Coleman-Liau (1975-22007-00119750401), Flesch (1949-01274-00119480601), argument lexicon features (somasundaran2007detecting) and surface features such as word lengths, sentence lengths, word types, and number of complex words555 We pick the parameters for the SVM model according to the performance validation split, and report the results on the test split..
introduced a simple yet effective baseline for text classification, which they show to be competitive with deep learning classifiers in terms of accuracy. Their method represents a sequence of text as a bag of n-grams, and each n-gram is passed through a look-up table to get its dense vector representation. The overall sequence representation is simply an average over the dense representations of the bag of n-grams, and is fed into a linear classifier to predict the label. We use the code released byjoulin-etal-2017-bag to train a classifier for argument impact prediction, based on the claim text666
We used maxNgram length of 2, learning rate of 0.8, num epochs of 15, vector dim of 300. We also used the pre-trained 300-dim wiki-news vectors made available on the fastText website..
BiLSTM with Attention
Another effective baseline (zhou-etal-2016-attention; yang-etal-2016-hierarchical)
for text classification consists of encoding the text sequence using a bidirectional Long Short Term Memory (LSTM)(Hochreiter:1997:LSM:1246443.1246450), to get the token representations in context, and then attending (luong-etal-2015-effective) over the tokens to get the sequence representation. For the query vector for attention, we use a learned context vector, similar to yang-etal-2016-hierarchical
. We picked our hyperparameters based on performance on the validation set and report our results for the best set of hyperparameters777Our final hyperparams were: 100-dim word embedding, 100-dim context vector, 1 layer BiLSTM with 64 units, trained for 40 epochs with early stopping based on validation performance.. We initialized our word embeddings with glove vectors (pennington-etal-2014-glove)
pre-trained on Wikipedia + Gigaword, and used the Adam optimizer(DBLP:journals/corr/KingmaB14) with its default settings.
|SVM with RBF Kernel|
|Distance from the thesis|
|BiLSTM with Attention|
|Claim + Parent|
Results for the baselines and the BERT models with and without the context. Best performing model is BERT with the representation of previous claims in the path along with the claim representation itself. We run the models
times and we report the mean and standard deviation.
6.3.3 Fine-tuned BERT model
devlin2018bert fine-tuned a pre-trained deep bi-directional transformer language model (which they call BERT) by adding a simple classification layer on top and achieved the state of the art results across a variety of NLP tasks. We employ their pre-trained language models for our task and compare them to our baseline models. For all the architectures described below, we fine-tune for 10 epochs, with a learning rate of 2e-5. We employ an early stopping procedure based on the model performance on a validation set.
Claim with no context
In this setting, we attempt to classify the impact of the claim based on the text of the claim only. We follow the fine-tuning procedure for sequence classification detailed in devlin2018bert, and input the claim text as a sequence of tokens preceded by the special [CLS] token and followed by the special [SEP] token. We add a classification layer on top of the BERT encoder, to which we pass the representation of the [CLS] token and fine-tune this for argument impact prediction.
Claim with parent representation
In this setting, we use the parent claim’s text, in addition to the target claim text, in order to classify the impact of the target claim. We treat this as a sequence pair classification task and combine both the target claim and parent claim as a single sequence of tokens, separated by the special separator [SEP]. We then follow the same procedure above for fine-tuning.
Incorporating larger context
In this setting, we consider incorporating a larger context from the discourse in order to assess the impact of a claim. In particular, we consider up to four previous claims in the discourse (for a total context length of 5). We attempt to incorporate larger context into the BERT model in three different ways.
Flat representation of the path. The first, simple approach is to represent the entire path (claim + context) as a single sequence, where each of the claims is separated by the [SEP] token. BERT was trained on sequence pairs, and therefore the pre-trained encoders only have two segment embeddings (devlin2018bert). So to fit multiple sequences into this framework, we indicate all tokens of the target claim as belonging to segment A and the tokens for all the claims in the discourse context as belonging to segment B. This way of representing the input requires no additional changes to the architecture or retraining, and we can just fine-tune in a similar manner as above. We refer to this representation of the context as a flat representation, and denote the model as , where indicates the length of the context that is incorporated into the model.
|Claim + Parent|
F1 scores of each model for the claims with various context length values.
Attention over context. Recent work in incorporating argument sequence in predicting persuasiveness (DBLP:conf/aaai/HideyM18) has shown that hierarchical representations are effective in representing context. Similarly, we consider hierarchical representations for representing the discourse. We first encode each claim using the pre-trained BERT model as the claim encoder and use the representation of the [CLS] token as claim representation. We then employ dot-product attention (luong-etal-2015-effective), to get a weighted representation for the context. We use a learned context vector as the query for computing attention scores, similar to yang-etal-2016-hierarchical. The attention score is computed as shown below:
Where is the claim representation that was computed with the BERT encoder as described above, is the learned context vector that is used for computing attention scores, and is the set of claims in the discourse. After computing the attention scores, the final context representation is computed as follows:
We then concatenate the context representation with the target claim representation and pass it to the classification layer to predict the quality. We denote this model as .
GRU to encode context
Similar to the approach above, we consider a hierarchical representation for representing the context. We compute the claim representations, as detailed above, and we then feed the discourse claims’ representations (in sequence) into a bidirectional Gated Recurrent Unit (GRU)(cho-al-emnlp14), to compute the context representation. We concatenate this with the target claim representation and use this to predict the claim impact. We denote this model as .
6.4 Results and Analysis
Table 6.3.2 shows the macro precision, recall, and F1 scores for the baselines as well as the BERT models with and without context representations888For the models that result in different scores with a different random seed, we run them times and report the mean and standard deviation..
We see that parent quality is a simple yet effective feature, and the SVM model with this feature can achieve significantly higher ()999 We perform a two-sided t-test for significance analysis.
We perform a two-sided t-test for significance analysis.F1 score () than distance from the thesis and linguistic features. Claims with higher impact parents are more likely to have a higher impact. Similarity with the parent and thesis is not significantly better than the majority baseline. Although the BiLSTM model with attention and FastText baselines performs better than the SVM with distance from the thesis and linguistic features, it has similar performance to the parent quality baseline.
We find that the BERT model with claim only representation performs significantly better () than the baseline models. Incorporating the parent representation only along with the claim representation does not give significant improvement over representing the claim only. However, incorporating the flat representation of the larger context along with the claim representation consistently achieves significantly better () performance than the claim representation alone. Similarly, attention representation over the context with the learned query vector achieves significantly better performance then the claim representation only ().
We find that the flat representation of the context achieves the highest F1 score. It may be more difficult for the models with a larger number of parameters to perform better than the flat representation since the dataset is small. We also observe that modeling claims on the argument path before the target claim achieves the best F1 score ().
To understand for what kinds of claims the best performing contextual model is more effective, we evaluate the BERT model with flat context representation for claims with context length values , , and separately. Table 6.3.3 shows the F1 score of the BERT model without context and with flat context representation with different lengths of context. For the claims with context length , adding and representation along with the claim achieves significantly better F1 score than modeling the claim only. Similarly for the claims with context length and , and perform significantly better than BERT with claim only ( and respectively). We see that models with larger context are helpful even for claims which have limited context (e.g., ). This may suggest that when we train the models with larger context, they learn how to represent the claims and their context better.
In this study, we find that incorporating pragmatic context is crucial in impact prediction. First, we present a new dataset for this task. We assume that the impact labels in this dataset are provided in good faith by the users. However, we note that the user demographics on the platform may not have a fair representation, and prior beliefs and background could affect which arguments are perceived as more impactful. We should account for this potential bias while using the systems built from this dataset. We further observe that BERT-based models achieve the best predictive performance. However, it is difficult to interpret these systems to understand what aspect of the context plays an important role. In future work, we aim to employ methods such as local surrogate (10.1145/2939672.2939778) or input saliency models (li-etal-2016-visualizing) to interpret these systems.
6.6 Chapter Summary
This chapter proposes a new dataset of arguments along with their impact label and the argument path, representing a particular line of reasoning on the given controversial topic. We further propose predictive models that incorporate the pragmatic and discourse context of argumentative claims to predict argument impact. We show that the models representing the pragmatic context outperform models that rely on only claim-specific linguistic features for predicting the perceived impact of individual claims within a particular line of argument.
7.1 Summary of Contributions
In Chapter 3, we propose a new dataset of debates with extensive user information extracted from an online argumentation platform (i.e., debate.org). This is the largest available dataset with such extensive user information, including political ideology, religious ideology, and stance on various controversial topics. Availability of this information has motivated further research in exploring the effect of user factors in persuasion (10.1162/tacl_a_00281; Durmus:2019:MFU:3308558.3313676).
With the dataset in hand, we study the role of prior beliefs, of both speakers and audience members, on the perceived persuasiveness of arguments. We do this by formulating a new task to determine which debater will be able to persuade a given voter to change their stance. We find that features associated with a user’s initial stance are very predictive for this task. This is especially true for debates on political and religious issues, where these features are even more predictive than linguistic features of the arguments.
In Chapter 5
, we further explore whether a user’s social interaction impacts their debating success over time on online argumentation platforms. We extract features from a user’s friendship and voter networks. We then use these features to explore the role of social interactions as compared to personality traits and language in predicting debating success over time. We find that social interaction features (i.e., primarily features extracted from the voter network) are the most predictive of success. We observe that the best predictive performance is achieved when combining social interaction features with linguistic features. This implies that the characteristics of interactions on online debating platforms are essential to becoming more experienced and successful in persuasion.
Finally, we propose a dataset to study the role of kairos (i.e., pragmatic context) in determining argument impact. As described in Chapter 6, the dataset includes the argument context for each claim, along with the impact score within the given line of reasoning. We further explore whether a flat vs. a hierarchical representation of context is more effective for this task. We find that a flat representation of the context achieves the best performance since the dataset may not be large enough to learn the additional parameters needed for a hierarchical model. We observe that models that incorporate context perform significantly better than those that use the claim only. This implies that the context in which an argument is presented is crucial in assessing its impact.
7.2 Ethical Considerations
All the data in our research has been collected and used in accordance with the terms of service of the source. For user studies, we take the utmost care in making sure that the anonymity of the users is preserved. Finally, we make sure that our work does not take a stance on any of the controversial topics, but rather just analyzes the viewpoints of the participants in the datasets we use. One shortcoming we acknowledge is that we are unable to represent all demographics due to a lack of data. The sources we used tend to be highly skewed towards an American audience, and even within this audience, the distribution may not be representative enough.
Given that argumentation is a fundamental part of human communication, the work in this area could be used in both good and ethically less acceptable manners. The driving motivator of this dissertation has always been that argumentation can be used for social good, such as exposing people to diverse viewpoints to help them make more informed decisions or using persuasion to encourage people to contribute to the environment and society. However, even for such use cases, it is vital to be transparent and inform users about the nature of these systems. Moreover, user consent should be required to employ such methods in real-world scenarios.
7.3 Future Directions
Modeling Users in Computational Persuasion. In our study, we explore the role of prior beliefs in persuasion, focusing on political and religious ideologies. However, there are many aspects of the source and the audience (e.g., education level, prior argumentation skills, credibility, personality traits) that may influence the persuasion process. It is challenging to control for all potential confounding factors to isolate the effect of the linguistic features due to data sparsity. We think it is vital to explore better representations for users to disentangle the impact of user aspects. We further want to explore the following research questions: 1) How do different aspects of users influence their perceptions of the arguments? 2) How do these aspects affect people’s language choice while interacting with more similar vs. different people? 3) How does the language use change for different groups of speakers?
Personalized Argument Generation. Understanding the effect of user factors in persuasion could be the first step towards designing personalized argument generation systems capable of conveying relevant and interesting information for a more effective persuasion process (edseee.928406920201001; dijkstra_2000). Personalization is important in increasing engagement and attachment in social interactions on online platforms (edselc.2-52.0-8497816854120160801; edsemr.10.1108.AJIM.03.2018.006720180907). Therefore, having personalized systems may increase the quality of persuasive communication and the outcome of this process. For example, wang-etal-2019-persuasion has recently proposed a personalized dialogue system that tries to persuade people to donate to a specific charity. They show that personalized argumentation generation systems can be used for social good. Moreover, such systems could be used to present people with a diverse set of viewpoints to help them make more informed decisions.
Interpretation of Neural Models.Neural networks can model more complex representations that help achieve state-of-the-art performance in various syntactic and semantic tasks in Natural Language Processing. However, unlike feature-based linear models, it is more challenging to interpret neural models to explain what these models learn and improve them. For example, in Chapter 6, we have found that incorporating context with the argument helps predict its impact. However, it is not straightforward to interpret explicitly which aspects of the context helps to improve the overall performance. Similarly, although neural methods achieve state-of-the-art performance in persuasion prediction tasks, it is difficult to identify the characteristics of persuasive language and effective persuasion strategies. We believe that improved interpretation of neural networks is crucial to draw valuable conclusions in computational persuasion studies and build better models for these tasks.