Deliberation — “an extended conversation among two or more people in order to come to a better understanding of some issue” beauchampmodeling – forms the grease of societal decision-making machinery, lubricating consensus among participants via fair and informed debate. The process of opinion exchange in deliberation alleviates polarization, minority under-representation and several other drawbacks of consensus formation arising from non-deliberative processes (such as in majority-voting without discussion) by educating potentially uninformed participants and broadening their awareness of alternative perspectives list2013deliberation,thompson2008deliberative.
An increasing amount of deliberation takes place online davies2009online, both on social media and on specialized platforms developed for participatory democracy111For example, see the Stanford Online Deliberation Platform: https://stanforddeliberate.org/, knowledge curation222For example, see Wikipedia Talk pages used to discuss Wikipedia edits: https://en.wikipedia.org/wiki/Help:Talk_pages and software planning333For example, see Github Issues used to plan open-source projects: https://guides.github.com/features/issues/
, among others. While going virtual broadens participation, the increased visibility of “reputation” indicators online could distort the equitability of the deliberation process. For example, marlow2013impression find that project managers on Github (used for open-source software development by Google, Facebook and Microsoft, among several other technology firms) use visible reputation indicators when evaluating users’ feature requests and critiquing developers’ code contributions. At the same time, this creates opportunities for firms to exploit reputation indicators when directly interacting with consumers online to promote sales and mitigate churn (as gaming giant Electronic Arts does on Reddit, for example). These opportunities also extend to philanthropic organizations engaged in curbing the spread of misinformation online.
Whether reputation indeed has persuasive power in online deliberation is thus an important concern, but one that is difficult to quantify due to the difficulty of recognizing opinion-change and persuasion even on the rare occasions when it does occur. We overcome this challenge by assembling a dataset of deliberation from the ChangeMyView444http://reddit.com/r/changemyview/ online argumentation platform, containing over a million debates spanning 7 years from 2013 to 2019. Strict curation by a team of over 20 moderators ensures that debates on ChangeMyView are well-informed, balanced and civil, thus satisfying the key tenets of authentic deliberation fishkin2005experimenting. The debates in our dataset cover a variety of topics, from politics and religion to comparisons of products and brands, reflecting the diverse interests of over 800,000 ChangeMyView users. ChangeMyView users initiate debates by sharing opinions, engage in dyadic deliberation with other users that challenge their opinion, and (uniquely) provide explicit indicators of successful persuasion for each challenger that persuaded them to change their opinion. For every user persuaded, challengers earn reputation points that are prominently displayed with their username on the platform. The screenshot in Figure 1 illustrates the deliberation process and the nature of reputation indicators on ChangeMyView.
We use this dataset to analyze whether an individual’s reputation impacts their persuasiveness in deliberation online, beyond the content of their arguments. Our identification strategy to answer this question draws on four key components, enabled by several unique characteristics of our dataset:
[topsep=6pt, itemsep=6pt, label=.]
Within-opinion variation: We exploit the availability of multiple challengers of each opinion to analyze within-opinion variation (via opinion fixed-effects). This controls for unobserved characteristics of the opinion (such as the topic) and the poster (such as their agreeability) that may introduce biases arising from users endogenously selecting which opinions to challenge.
Approximating persuasive ability: We exploit the availability of multiple persuasion attempts for each user over time to measure and control for their past (lagged) persuasion rate, as a proxy for their unobserved persuasive ability (or skill) in each debate.
Instrumenting for reputation: We derive an instrument for the reputation of the challenger in each debate to address potential confounding due to unobserved challenger characteristics that vary over time, and are hence not controlled for by their past persuasion rate.
Controlling for the response text: Each challenger’s response text is the primary medium through which their persuasive ability, linguistic fluency and other major determinants of persuasion are observed by the poster. By controlling for the response text nonparametrically, we control for and address potential confounding arising from all such determinants.
We instrument for the challenger’s reputation in each debate with their average position in the sequence of responses to opinions they challenged previously (their mean past position). For a given opinion, challengers responding earlier (at lower positions) exhaust the limited space of good arguments, making it harder for challengers responding later (at higher positions) to persuade the poster555This resembles the mechanism of the cable news channel position instrument used to quantify the persuasive power of Fox News on voting Republican martin2017bias.
. Hence, we expect challengers with higher (worse) mean past positions to have lower reputations in the present, motivating our instrument’s relevance. While users can strategically select opinions to challenge that have fewer earlier challengers, our instrument remains exogenous after controlling for the user’s present position in each debate. To further alleviate concerns of instrument validity, we derive conservative bounds on our estimates with relaxed instrument validity assumptions using the plausibly-exogenous instrumental variable framework conley2012plausibly.
Text plays a key role in ensuring instrument validity. All confounders of the instrument must affect both the instrument and the debate outcome (whether the poster was persuaded). To affect the debate outcome, such confounders must operate through channels observable by the poster, the most prominent of which is the text of the challenger’s response. Hence, “controlling for” the challenger’s response text blocks the causal pathways between such confounders and the debate outcome, ensuring that they do not violate instrument validity.
To operationalize this intuition in an ideal world, we would manually annotate, measure and control for every possible characteristic of the response text that could affect the debate outcome, which is infeasible at scale. An alternative is to control for a bag-of-words666A bag-of-words representation of a document is a high-dimensional vector of the frequencies of all the words it contains. Its dimensionality is the size of the vocabulary of words in the document corpus, which is typically of the order of millions.
harris1954distributional vector of the response text, assuming that functions of this vector capture all text characteristics that determine the debate outcome. However, the high dimensionality of bag-of-words representations introduces statistical difficulties that prevent consistent estimation and valid inference.
Dimensionality-reduction techniques are commonly employed to alleviate these difficulties, whether manually via hand-selected features, or automatically via inverse-regression taddy2013multinomial, topic-modeling blei2003latent,roberts2016model,roberts2018adjusting and neural text embeddings mikolov2013distributed. However, these techniques provide no guarantees that the confounders present in the original text are retained in the low-dimensional text representation, which raises concerns of omitted variable bias. In addition, there is often little substantive theory to guide the manual feature selection process. Automated dimensionality reduction techniques, including supervised ones such as feature selection via LASSO tibshirani1996regression, could result in inconsistent estimates due to model misspecification and invalid confidence intervals due to feature selection uncertainty that is not accounted for in the inference procedure belloni2014high.
We depart from the focus on dimensionality-reduction and instead incorporate the response text as a control nonparametrically, using recent advances in semiparametric inference with machine learning models. Specifically, we estimate “nuisance functions” of the response text via machine learning to predict the debate outcome, challenger reputation and instrument, and partial-out their effects in the manner of Frisch-Waugh-Lovell frisch1933partial,lovell1963seasonal. This procedure was introduced as early as robinson1988root for parametric nuisance functions and recently extended to nonparametric nuisance functions estimated via machine learning van2003unified,chernozhukov2018double. The recent extensions show that the partialling-out procedure guarantees -consistent and asymptotically normal estimates, as long as each estimated nuisance function converges to the true nuisance function at the rate of or better.
In particular, we use a recent econometric extension of the partialling-out procedure called double machine-learning chernozhukov2018double to estimate a partially-linear instrumental variable specification with text as a control. For our nuisance functions, we use neural networks with rectified linear unit (ReLU) activation functions nair2010rectified. These neural networks pass the input text through a series of intermediate layers, each of which learns a latent “representation” that captures textual semantics at different granularities. The networks are trained via backpropogation rumelhart1986learning with first-order gradient-based techniques kingma2015adam to minimize classification or regressions loss functions. Though recurrent hochreiter1997long and convolutional kim2014convolutional neural networks are more commonly used for textual prediction tasks, neural networks with ReLU activation functions come with guaranteedconvergence rates farrell2018deep that enable consistent estimation and valid inference in the double machine-learning framework.
Results. We find a significant positive effect of reputation on persuasion. Our instrumental variable estimates indicate that having 10 additional units of reputation increases the probability of persuading a poster by 1.09 percentage points. This corresponds to a 31% increase over the platform average persuasion rate of 3.5%. Since each poster successfully persuaded increases a challenger’s reputation, the long-run effect of reputation on persuasion is compounded over time. The effect remains statistically significant across a range of specifications, including ones where the instrument exclusion restriction is relaxed. Our findings counter the prevailing notion on the ChangeMyView platform that the persuasive power of reputation can be ignored.
The estimated effect of reputation on persuasion is the local average treatment effect (LATE) imbens1994identification in the population of compliers, comprised of debates where the challenger’s reputation (the treatment) is affected by their mean past position (the instrument). Such challengers are less persuasive at higher (later, worse) response positions and more persuasive at lower (earlier, better) response positions. Hence, we expect debates in the complier population to involve challengers with moderate to high persuasive ability, since challengers with low persuasive ability are unlikely to be any more persuasive at any response position.
To investigate possible mechanisms for this effect, we test the predictions of a theoretical model of persuasion with information-processing shortcuts called reference cues bilancini2018rational. We examine how the proportional effects of a challenger’s reputation and skill vary with characteristics of the opinion and response content. Using the challenger’s response text length as a proxy for the cognitive complexity of their arguments, we find that the reputation effect share (of the total effect magnitude of reputation and skill) increases from 82% to 89% from the first to the fourth response length quantile. This suggests that posters rely more on reputation when the challenger’s arguments are cognitively complex. This is consistent with the theoretical prediction that individuals will rely more on low-effort heuristic processing (using reputation as a proxy for the quality of the challenger’s response) instead of high-effort systematic processing (directly evaluating the challenger’s response) when subject to greater cognitive overload.
The theoretical model also predicts that individuals will rely less on low-effort heuristic processing when they are more involved in the issue being debated. We test this prediction using the opinion text length as a proxy for the issue-involvement of the poster and find that the reputation effect share decreases from 90% to 83% from the second to the fourth opinion length quantile. This is consistent with the prediction that more issue-involved posters will rely less on reputation. We find similar patterns using text complexity measures (such as the Flesch-Kincaid Reading Ease) as proxies for cognitive complexity and issue-involvement, instead of the response and opinion text length. Overall, our findings are consistent with reputation serving as a reference cue and used by posters as an information-processing shortcut under cognitive overload.
We also examine how the effect of reputation on persuasion is moderated by the total number of opinion challengers. While we expect that having more challengers will increase the cognitive burden placed on the poster (and hence push them to rely more on heuristic information-processing), we find no evidence that posters rely more on reputation as the number of opinion challengers increases. We do find evidence that challengers with higher reputation have longer conversations with posters, which could be an important mediator of the effect of reputation on persuasion. We also find evidence that challengers with higher reputation are more likely to attract collaboration from other (non-poster) users, although reputation continues to have a significant (positive) direct effect on persuasion after excluding the potential effect of such collaboration.
Contributions and related work. Our research contributes to the economics of persuasion, the practitioners of which comprise over a quarter of the United States’ GDP mccloskey1995one, including lawyers, judges, lobbyists, religious workers and salespeople. antioch2013persuasion revises this number to 30 percent, after including marketing, advertising and political campaigning professionals. An extensive body of past work on persuasion spans the economics, marketing and political science literature (among others), and is comprised of both theoretical models kamenica2011bayesian,kamenica2018bayesian and empirical analyses of the efficacy of persuasive communication via field or natural experiments (see [dellavigna2010persuasion] for a survey).
Our work differs from previous empirical studies on the economics of persuasion in three ways. First, previous work focused on identifying the existence of persuasion by quantifying the causal effect of persuasive communication on some observable behavior, without the ability to observe individual-level opinion-change. Content and persuader-based moderators of persuasion were then analyzed conditional on non-zero persuasive effects having been identified landry2006toward,bertrand2010s. In our work, the explicit indicators of persuasion provided by posters allows us to sidestep the task of identifying persuasion, and directly analyze its determinants. Second, we observe attempts at persuasion made by thousands of unique individuals, in contrast with previous work. This enables a broader investigation of the impact of persuader and content characteristics, which are predicted to play an important role by belief-based persuasion models stigler1961economics,mullainathan2008coarse,kamenica2011bayesian. Finally, we observe repeated attempts at persuasion made by each individual that enables approximating and disentangling the impact of their persuasive ability from other factors.
More specifically, our work informs persuasive information design kamenica2018bayesian in interactive settings by quantifying the impact of extraneous signals that could serve as low-effort information-processing heuristics petty1986elaboration,chaiken1989heuristic,todorov2002heuristic. Such heuristics play an increasingly important role in this era of information overload jones2004information, as emphasized by Cialdini in his seminal book on the principles of influence cialdini2007influence:
"Finally, each principle is examined as to its ability to produce a distinct kind of automatic, mindless compliance from people, that is, a willingness to say yes without thinking first. The evidence suggests that the ever-accelerating pace and informational crush of modern life will make this particular form of unthinking compliance more and more prevalent in the future. It will be increasingly important for the society, therefore, to understand the how and why of automatic influence."
Interactive persuasion channels are common today, with firms adopting online channels such as live-chat to triangulate consumers’ beliefs and influence them via dialogue. Interactive channels are often preferred for defensive marketing tasks hauser1983defensive such as addressing complaints and mitigating churn. Some firms invest in interaction further and embed themselves as bonafide members of influential enthusiast-run online forums777A notable examples is gaming giant Electronic Arts (https://www.reddit.com/user/EACommunityTeam/).. Marketing communication designed to persuade in such channels closely resembles the dyadic deliberation we examine in our work.
Our work is also related to research on the impact of certification and reputation systems dranove2010quality in markets for labor moreno2014doing,kokkodis2016reputation, knowledge dev2019quantifying, and other goods and services tadelis2016reputation,hui2016reputation,lu2018can. Consumers studied in this line of research engage in costly information-processing to evaluate item quality under cognitive, temporal or financial constraints. Hence, the findings therein are interpreted using the same underlying psychological mechanisms as we employ in our work petty1986elaboration,chaiken1989heuristic. The distinguishing feature of our work is the focus on explicitly stated opinion-change as the outcome, as a consequence of interpersonal deliberation. Importantly, there are no monetary transactions involved and the reputation in our setting cannot be purchased at any cost; it is a truthful proxy for past persuasive ability. Thus, persuasion as exhibited in our setting is sufficiently different from the purchasing or hiring decisions analyzed in the literature on certification and reputation systems to warrant separate investigation.
Our work complements studies on deliberation in online settings, such as on political forums and social media beauchampmodeling,shugars2019keep. Specifically, our findings contribute to the understanding of opinion-change and polarization888https://www.wsj.com/articles/to-get-along-better-we-need-better-arguments-1531411024 online quattrociocchi2016echo. By quantifying how an individual’s reliance on heuristic and systematic information-processing varies with the cognitive complexity of the persuasive message content, our findings could inform online campaigns that involve persuasive information design aimed at reducing polarization by affecting opinion-change.
Finally, our work contributes an application to the nascent study of causal inference from text, and more broadly to the literature on text as data gentzkow2019text,netzer2019words,toubia2019extracting. Our setting involves text as a control (see [keith2020text] for a recent survey of work in this setting). Previous approaches to accommodate text as a control (though with treatments assumed to be exogenous) include sridhar2019estimating which controls for topics in the text, roberts2018adjusting which assumes a structural topic model roberts2016model of text and controls for its sufficient reduction taddy2013multinomial, and veitch2019using,shi2019adapting which incorporate neural language models of text in the targeted learning inference framework van2011targeted.
Our work also links the social science literature on persuasion with the computational natural language processing literature on argument-mining lippi2016argumentation, where online argumentation platforms have been extensively studied tan2016winning,jo2018attentive,luu2019measuring,atkinson2019gets,srinivasan2019content.
Outline. We begin in Section 2 by introducing background, formalizing our conceptual framework and motivating our hypotheses. We then describe our dataset in Section 3 and detail our empirical strategy in Section 4, including a description of our estimation procedure and evidence supporting the validity of our instrument. We discuss our results in Section 5 and interpret them through the lens of a theoretical model of persuasion. We conclude by summarizing our findings, discussing managerial implications for platforms facilitating online deliberation for public and private organizations, and noting the limitations of our research in Section 6.
2 Background and Conceptual Framework
The ChangeMyView online argumentation platform was created in January, 2013 to foster good-faith discussions on polarizing issues and has received praise for helping combat the proliferation of echo chambers online999“Civil discourse exists in this small corner of the internet” — The Atlantic. December 30, 2018.. In this section, we formalize the process of deliberation on ChangeMyView and describe important platform features to motivate our empirical analyses in Section 4.
Opinion posters, opinion challengers and debates. Our unit of analysis is a debate. Each debate is associated with an opinion shared by an opinion poster, which is titled with the poster’s primary claim and contains at least 500 characters of supporting arguments. A response to the opinion by a challenger initiates a debate between the poster and challenger. Other users can (but rarely) join the ongoing discussion between a poster and a challenger with their own comments; we term such debates multi-party. Debates must follow several rules (detailed in Appendix A) enforced by over 20 moderators. Notable rules are: (i) the poster must personally hold a non-neutral opinion, (ii) the poster must engage with all challengers for at least 3 hours after sharing their opinion, and (iii) a challenger’s response must counter at least one claim made by the poster. Responses to an opinion are ordered chronologically and popularity votes on responses are hidden for the first 24 hours after an opinion is shared. These rules mitigate popularity biases, irrelevant digressions and hostility.
Opinion selection by users. The titles of posted opinions and the identities of the posters who shared them are displayed in a paginated list on the platform’s homepage, ordered by a combination of recency and popularity votes101010Specifically, in decreasing order of the score: .. A tab on the homepage also allows users to order opinions by recency only. Clicking on an opinion title opens a new page displaying the opinion text and any ongoing or concluded debates between the poster and other challengers. Users could select opinions to challenge based on various factors such as the opinion text, their own topical preferences, the poster’s identity, and the number and status of the debates between the poster and other challengers.
The -system. In mid-February, 2013, ChangeMyView introduced a reputation system called the -system to incentivize challenging opinions on the platform. At any point in a debate, the poster may reply to the challenger indicating that their opinion has changed using the symbol or equivalent alternatives. We term debates where the poster awarded a to the challenger as successful and opinions that led to at least one successful debate as conceded. Due to the platform rules requiring active engagement, 98% of the s from the poster in our dataset were awarded within 24 hours of the opinion being posted, with over 50% being awarded within just 90 minutes. This short delay reduces concerns of opinion-change occurring due channels external to the debate. Each awarded grants the challenger a reputation point. Other non-poster users can (but rarely) also award s to any challenger and contribute to their reputation. The total reputation points earned previously, if non-zero, are displayed next to the challenger’s username with all of their responses on the platform.
The poster’s decision. Consider an opinion that is challenged by user . The poster observes ’s username, reputation and the text of their immediate response to the opinion. Based on this information, the poster may initiate a discussion with the challenger, elicit additional responses (which we do not model) and eventually award a if persuaded to change their opinion. We model the poster ’s decision to award a to challenger as a function of an opinion-specific threshold and the perceived quality of ’s response:
Here, is the observed debate outcome if the poster awarded a to and otherwise. The unobserved threshold encodes opinion-specific characteristics such as the opinion topic and the poster’s openness to persuasion. Based on dewatripont2005modes,bilancini2018rational, we model the perceived quality as a weighted linear combination of the challenger’s reputation and the “true” response quality , which the poster can determine by evaluating the challenger’s response at some cognitive cost. Posters choose and endogenously based on this cognitive cost and their reliance on heuristic and systematic information-processing petty1986elaboration,chaiken1989heuristic. If , reputation in this model serves as a reference cue: a proxy for the true response quality that can be processed with lesser effort than evaluating directly.
“True” response quality. We model the true response quality as a function of the user’s “skill” at the time they challenged opinion and their position in the sequence of challengers of opinion . captures the overall impact of previous challengers’ responses. For example, challengers responding earlier could exhaust the limited space of good arguments, making it harder for later challengers to respond with arguments of similar quality. We formalize this as follows:
We approximate ’s skill by the Laplace-smoothed manning2008introduction fraction of posters persuaded before opinion , where is chronologically-ordered and if challenged opinion :
Here, is a “prior” set to the empirical persuasion probability of users in their first debate (). Smoothing ensures that the skill of users measured when they have challenged few opinions tends to instead of to 0. A user’s skill is thus their (smoothed) lagged persuasion rate, which captures all user characteristics that affect persuasion and do not change with the their tenure on the platform.
Hypotheses. Based on prior analytical work bilancini2018rational, we test three complementary hypotheses on the weights and , which reflect the poster’s endogenously-determined reliance on heuristic and systematic information-processing respectively:
[label=H0., leftmargin=0.57in, rightmargin=1.5in, listparindent=itemsep=12pt]
Reputation has persuasive power, .
The relative persuasive power of reputation, , increases as the cognitive cost of processing the challenger’s response increases.
The relative persuasive power of reputation, , decreases as the involvement of the poster in the debated issue increases.
Confirming (H1) indicates that reputation has persuasive power, and confirming (H2) and (H3) lends support to the mechanism proposed by the model of (Bilancini and Boncinelli, 2018).
We collect all the discussions on the ChangeMyView platform between January, 2013 and October, 2019 using a combination of the official Reddit API111111https://www.reddit.com/dev/api/ and the third-party PushShift API baumgartner2020pushshift, in full compliance with their terms of service. We exclude submissions to ChangeMyView that are not opinions using the fact that opinion titles are required to be prefixed with “CMV:”. The excluded submissions encompass discussions about the platform, announcements of platform changes and celebrations of milestones. We also exclude the opinions and responses posted to ChangeMyView before the reputation system became fully functional on March 1, 2013.
We extract indicators of successful persuasion from the debate text using the same extraction rules employed by ChangeMyView to programmatically parse s and other alternative symbols121212Code obtained from: https://github.com/alexames/DeltaBot. We use the extracted indicators to label debate success, to reconstruct each challenger’s reputation and to measure each challenger’s skill in each debate. Figure 2 shows the empirical variation in skill with reputation in our dataset, with each point indicating the reputation and skill for each challenger measured in a single debate, colored based on the number of debates they participated in previously. At values of skill outside the low and high extremes, there is a wide variation in the reputation (). This variation is essential to disentangle the effects of reputation and skill on persuasion.
Debates by challengers who had deleted their ChangeMyView accounts before data collection appear in our dataset with the “[deleted]” placeholder username. The inability to link the debates by such challengers over time makes it impossible to measure their true reputation and skill. Assuming that such challengers have zero reputation and skill (based on equation 3) is likely to attenuate our estimates due to measurement error. Hence, we exclude all 118,277 such debates from our dataset131313For completeness, we also report our main results including debates with deleted challengers in Appendix C.. Our final dataset contains 91,730 opinions (23.5% of them conceded) shared by 60,573 unique posters, which led to 1,026,201 debates (3.5% of them successful) with 143,891 unique challengers. Table 1
reports descriptive statistics of our dataset, and Figure3 reports user-level distributions of participation and debate success. Table 2 summarizes the notation that will use in all subsequent sections.
|Statistics of challengers in each debate|
|Mean past position||10.4||13.0||7.5|
|Number of past debates||244.4||591.7||24.00|
|Statistics of overall dataset|
|Number of opinions||91,730|
|Opinions leading to more than 1 debate||84,998||(number of clusters with opinion fixed-effects)|
|Number of debates||1,026,201|
|Number of debates per opinion||11.2||12.7||9|
|Successful debates per opinion||0.4||0.9||0|
|Number of unique posters||60,573|
|Opinions per poster||1.5||2.4||1|
|Number of unique challengers||143,891|
|Challengers with more than 1 debate||64,871||(number of clusters with user fixed-effects)|
|Number of debates per challenger||7.1||58.5||1|
|Successful debates per challenger||0.3||3.2||0|
|Chronological opinion index|
|Chronological response index|
|Tuple representing a debate: the response to opinion|
|Opinion fixed-effect; captures unobserved opinion characteristics|
|Challenger fixed-effect; captures unobserved challenger characteristics|
|Reputation of the challenger in debate ; sum of the past s earned|
|Skill of the challenger in debate ; smoothed lagged persuasion rate|
|Position of the challenger in debate|
|Calendar month-year fixed-effect for debate|
|Vector representation of the text of the challenger’s immediate response in debate|
|Binary outcome of debate ; for successful debates, otherwise|
|Binary opinion selection indicator; if challenged opinion , otherwise|
|Instrument (mean past position) for the challenger’s reputation in debate|
4 Empirical Strategy
4.1 Baseline Specifications
Here, is an opinion fixed-effect and is an error term with zero conditional mean. Since the fixed-effects are at the opinion level and skill (a function of lagged dependent variables) is at the user level, these are not dynamic panel specifications, and are hence unaffected by Nickell bias nickell1981biases. Including the opinion fixed-effects excludes 6,732 debates from the sample, which were the only responses to their respective opinions. If distributional assumptions (such as Gumbel or Gaussian) on hold and there are no unobserved confounders, the estimate of quantifies the change in the probability of persuading the poster of opinion upon increasing the challenger’s reputation by one unit, with all else equal. In Section 5, we report estimates from logistic and linear probability models.
While the assumption of no unobserved confounding is restrictive (and relaxed in Section 4.2), the baseline specifications address two important sources of confounding. First, controlling for the challenger’s skill controls for all challenger characteristics that affect persuasion (such as their rhetorical ability and linguistic fluency) and that do not vary with their tenure on ChangeMyView. To see why such characteristics confound the effect of reputation on persuasion, note that a user’s reputation largely depends on the number of posters persuaded previously: (since users who are not posters rarely award s). Hence, any unobserved challenger characteristic that affects the outcome of every debate will also affect their reputation , and thus confound the effect of reputation on the debate outcome .
|Dependent Variable: Debate Success|
|No. of opinions challenged previously||()|
|Position (std. deviations)||()|
|User fixed-effects ()||✓|
|Month-year fixed-effects ()||✓|
|No. of debates|
Note: Standard errors displayed in parentheses.
Note: Standard errors displayed in parentheses.
However, skill does not capture challenger characteristics that vary with their tenure on ChangeMyView. By assuming the absence of such characteristics, the baseline specifications implicitly assume that users do not learn to be more persuasive with experience on the platform. We provide empirical evidence to support this assumption by estimating the following linear probability model:
where is a user fixed-effect capturing all unobserved time-invariant user characteristics, is a calendar month-year fixed-effect capturing unobserved temporal factors, is the (standardized) user’s position in the sequence of challengers of opinion and is a Gaussian error term. is the number of opinions that challenged previously, serving as a measure of their past experience. is the within-user correlation between past experience and the debate outcome. If users improve with experience, we expect to be positive. However, the estimates of reported in Table 3 are small and statistically insignificant. We attribute this to users having already acquired argumentation experience outside the platform, with little to gain from additional experience on the platform.
Second, controlling for the opinion fixed-effect addresses confounding due to users endogenously selecting which opinions to challenge. To see why opinion selection is a concern, recall the opinion selection indicator that equals 1 when user challenges opinion . Since we estimate our specifications on observed debates, our specifications implicitly condition on . If the opinion selection probability is correlated with (i) reputation, and (ii) debate success (for example, if users prefer to challenge opinions on topics that are easier to persuade in), the effect of reputation on debate success will be confounded due to endogenous sample selection james1979sample.
We characterize this confounding using the causal graph in Figure 4 based on the analyses in hernan2004structural. In causal graphs pearl2009causality, an edge implies that may or may not cause , while the absence of an edge implies the stronger assumption that does not cause . An undirected edge implies potential causality in either direction. Observed variables are shaded and unobserved variables are unshaded.
The shaded nodes , and correspond to the reputation, debate outcome and opinion selection indicator respectively. The unshaded node is any unobserved opinion characteristic that could directly affect both opinion selection and debate success, such as the opinion topic.
In the causal graph in Figure 4, is a collider. A collider is any node that is a common outcome in causal substructures of the form . Conditioning on opens a causal pathway between and that would otherwise be blocked. If reputation is correlated with opinion selection (depicted by the undirected edge ), conditioning on the collider (which our specifications do implicitly) opens the confounding causal pathway . This confounds the effect of reputation on the debate outcome, since now affects both and (via ).
We test for correlation between and by estimating the following linear probability model of a user challenging more than one opinion after opinion :
where is a user fixed-effect, is a calendar month-year fixed-effect and is a Gaussian error term. The estimate of in Table 4 suggests a significant positive correlation between and . This correlation may arise either because users that were successful in the past (and hence have higher reputation) are more likely to challenge opinions in the future, or because more active users are likely to have higher reputation (a mechanical relationship). Fortunately, the opinion fixed-effect controls for all opinion characteristics, including the unobserved , thus addressing potential confounding.
In summary, our baseline specifications address potential confounding due to (i) time-invariant challenger characteristics that affect persuasion, and (ii) users endogenously selecting which opinions to challenge. In the next section, we introduce specifications that instrument for the challenger’s reputation in each debate. The instrumental variable specifications inherit the robustness of the baseline specifications to confounding from time-invariant challenger characteristics and endogenous opinion selection, while further addressing potential confounding due to time-varying challenger characteristics that affect debate success.
4.2 Instrumental Variable Specifications
Our instrumental variable specifications address confounding due to unobserved user characteristics that affect persuasion and vary with their experience on the platform. Estimates from this specification quantify the local average treatment effect (LATE) of reputation on debate success if instrument relevance, exogeneity, exclusion and monotonicity hold imbens1994identification. In this section, we derive our instrument and provide empirical evidence to support its validity.
Our instrument is motivated by the fact that a user’s reputation largely depends on the number of posters persuaded previously, since other users who are not the poster rarely award s:
From equation (2), we also know that a user’s position in the sequence of challengers of opinion is correlated with debate success . Hence, we define our instrument for the challenger’s reputation as the mean past position of user before challenging opinion :
where is a chronologically-ordered opinion index, and the opinion selection indicator if user challenged opinion and otherwise. We expect users who were consistently late challengers of opinions in the past (and thus, have larger mean past positions) to have persuaded fewer posters on average than users who were consistently early, and hence have lower reputation in the present. Thus, we expect to be negatively correlated with .
We confirm this relationship with the following first-stage regression:
where is an opinion fixed-effect, is user ’s skill, is user ’s position in the sequence of challengers of opinion and is a zero-mean Gaussian error term.
Our first-stage estimates in Table 5 indicate that a one unit increase in the mean past position of a user predicts a 0.18 unit decrease in their present reputation. The F-statistic on the instrument greatly exceeds the rule-of-thumb threshold stock2005testing, alleviating concerns about instrument strength. Skill has a positive first-stage correlation with reputation, which is expected since higher skilled users are likely to have persuaded more posters previously. The response position has a negative first-stage correlation with reputation, which we expect if users are consistent in their preference to respond early or late.
|Dependent Variable: Reputation|
|Mean past position||()|
|Position (std. deviations)||()|
|Opinion fixed-effects ()||✓|
|No. of debates|
Note: Standard errors displayed in parentheses.
An immediate concern is users selecting opinions to challenge based on their anticipated position in the sequence of challengers, since users can observe the number of ongoing and concluded debates with the poster before deciding to challenge an opinion. We characterize this scenario using the causal graph in Figure 5, which extends the causal graph in Figure 4 with shaded nodes (for the instrument) and (for the challenger’s present position). affects the debate outcome , based on equation (2). Recall from Section 4.1 that our specifications implicitly condition on the collider . If the instrument is correlated with opinion selection (depicted by the undirected edge ) and users select opinions to challenge based on their anticipated position (depicted by the edge ), conditioning on will open the confounding causal pathway . Hence, it is essential to control for the challenger’s present position , which could otherwise confound the instrument.
The causal graph in Figure 5 reveals a second source of instrument confounding that has received recent attention hughes2019selection,swanson2019practical. If the instrument is correlated with opinion selection (depicted by the undirected edge ) and some unobserved opinion characteristic (such as the opinion topic) affects both opinion selection and debate success, conditioning on opens the confounding causal pathway that violates instrument exogeneity.
We can test for correlation between the instrument and opinion selection by estimating the following linear probability model of a user challenging more than one opinion after opinion , where is a user fixed-effect, is a calendar month-year fixed-effect and is a Gaussian error term:
The estimates of in Table 6 suggest a small but significant negative correlation between and , justifying our concerns of endogenous opinion selection violating instrument exogeneity. Fortunately (as discussed Section 4.1), the opinion fixed-effect controls for all opinion characteristics, including unobserved . This alleviates concerns of instrument exogeneity being violated due to endogenous opinion selection.
Another plausible concern is of the instrument affecting the debate outcome via channels that do not include the user’s reputation, which violates the instrument exclusion restriction. For example, if users learn to be more persuasive from the earlier challengers of an opinion, a user with a high mean past position could be more persuasive in the present than one with a low mean past position.
We address this concern in two ways. First, note that any user characteristic correlated with successful persuasion is likely to affect the debate outcome through the text of their responses. Hence, controlling for the response text will block direct channels of influence between the instrument and the debate outcome. This is formalized by the causal graph on the right. Here, the reputation , debate outcome , response text and instrument are observed. contains all unobserved confounders of the instrument or reputation (or both) that affect the outcome through the text . If we decompose the text into conceptual components a, b, c and d, it is sufficient to control for a to block the causal pathway.
We operationalize this idea by estimating the following partially-linear instrumental variable specification with endogenous , as formulated by chernozhukov2018double:
In this specification, the high-dimensional covariates (the opinion fixed-effects) and (a vector representation of ’s response text) have been moved into the arguments of the “nuisance functions” and . As earlier, is ’s reputation, is ’s skill, is ’s position and (the instrument) is the mean past position of before opinion . and are error terms with zero conditional mean. is the parameter of interest, quantifying the causal effect of reputation on persuasion.
No distributional assumptions are placed on and
, and hence, this specification does not assume any functional form (in contrast with logit, probit and linear probability models).and can be flexible nonparametric functions. We discuss estimation and inference in Section 4.3.
Second, we use the “plausibly exogenous” instrumental variable framework conley2012plausibly to relax the instrument exclusion restriction and include directly in the debate outcome model141414conley2012plausibly proposes four inference strategies that incorporate plausibly exogenous instruments. The inference strategy we use relies on the fewest assumptions and provides the most conservative estimates of .:
where encodes by how much the exclusion restriction is violated. For a fixed and conditionally exogenous instrument, the effect of reputation on debate success can be quantified via two-stage least-squares estimation of the following regression, using as an instrument for :
If users indeed learn to be more persuasive from earlier challengers, we would expect . We report estimates of from the specification above for a range of values in Section 5.
While instrument relevance, exogeneity and exclusion are sufficient to guarantee identification of the effect of reputation on debate success, we also require instrument monotonicity to interpret our estimate as a local average treatment effect (LATE) imbens1994identification. Instrument monotonicity will be violated if there exists a subpopulation of debates where increasing the mean past position of the challenger would increase their present reputation, and decreasing their mean past position would decrease their present reputation (members of this subpopulation are called defiers). Such challengers are more likely to persuade a poster when they respond later. The large and precisely-estimated negative within-user correlation between the number of earlier challengers and debate success () in Table 3 suggests that the existence of such challengers is unlikely.
The LATE is the effect of reputation on debate success for compliers, comprised of debates where the challenger’s reputation is indeed affected by their mean past position. The challengers in these debate subpopulations are more persuasive at earlier (lower) positions, and less persuasive at later (higher) positions. Hence, we expect that the compliers exclude debates with challengers having low persuasive ability, who are unlikely to be more or less persuasive in any position. We also expect challengers with moderate to high persuasive ability to benefit more from an increase in their reputation than challengers with low persuasive ability, since a high reputation is unlikely to substitute for low persuasive ability. Hence, we expect the LATE to be larger than the average treatment effect of reputation on debate success.
4.3 Estimation and Inference
Our baseline linear probability model and linear instrumental variable specifications can be estimated using ordinary least-squares and two-stage least squares respectively, adapted to accommodate high-dimensional fixed-effects correia2017hdfe. In this section, we describe how the double machine-learning framework chernozhukov2018double can be used to consistently estimate the effects of reputation, skill and position in the partially-linear instrumental variable specification.
Double machine-learning extends the partialling-out procedure of Frisch-Waugh-Lovell frisch1933partial,lovell1963seasonal to use flexible nonparametric functions estimated via machine learning. We first describe the basic setup assuming reputation is conditionally exogenous (given the response text), ignoring the opinion fixed-effects, and ignoring the effects of skill and position. Consider the following partially-linear probability model:
where is the debate outcome, is the challenger’s reputation, is their response text and and are Gaussian error terms with zero conditional mean. and are unknown nonparametric functions. We are interested in consistently estimating and performing valid inference on .
If and were fixed and known, consistent estimation is possible by solving for
in an empirical version of the following moment condition (equivalent to ordinary least-squares estimation):
However, is unknown and needs to be jointly estimated with . A solution is to first estimate on a separate subsample of the data, and then estimate by solving an empirical version of the moment condition above on the remaining subsample . This procedure, called sample-splitting, eliminates the “overfitting-bias” introduced in the process of estimating .
If is estimated via machine learning, the procedure above results in inconsistent estimates . chernozhukov2018double decomposes the scaled bias of into the following two terms:
Term converges at a rate to a zero-mean Gaussian. However, by virtue of being estimated via machine learning, term will typically converge to zero at a rate slower than due to the slow convergence of the estimation error . This is called the “regularization bias” of .
Double machine-learning eliminates regularization bias via a procedure called orthogonalization. is estimated by solving an empirical version of the following “Neyman-orthogonal” moment condition:
The empirical version of this moment condition can be solved via a procedure similar to the residual-on-residuals regression of robinson1988root. The procedure is as follows (where and are disjoint subsamples of the data, and and are nonparametric functions):
Estimate the conditional expectation function on to get .
Estimate the conditional expectation function on to get .
Estimate the outcome residual on .
Estimate the treatment residual on .
Regress on to obtain .
Note that we no longer need to estimate , and instead need to estimate the conditional expectations and that can be arbitrary nonparametric functions of (such as neural networks). This procedure can be extended to include skill and position as controls by estimating additional conditional expectation functions to predict the challenger’s skill and position from their response text on , estimating the residuals and on , and then regressing on , and .
The resulting estimate is -consistent and asymptotically normal. chernozhukov2018double shows that term of the scaled bias of is now given by following expression:
This contains the product of nuisance function estimation errors. Hence, orthogonalization enables -consistent estimation of as long as the product of the convergence rates of and is . This is more viable than requiring each nuisance function to converge at a rate.
If is endogenous and is a valid instrument for , (Chernozhukov et al., 2018) proposes the following Neyman-orthogonal moment condition to estimate in a partially-linear instrumental variable specification:
By a similar bias derivation, the estimated is shown to be -consistent and asymptotically normal, as long as the instrument is valid and the product of the nuisance function convergence rates is .
We now detail our overall estimation procedure for the partially-linear instrumental variable specification. We include the opinion fixed-effect , skill and position as controls. and are disjoint subsamples of the data, and , and are nonparametric functions that we detail in the next subsection. The procedure is as follows:
Estimate the following conditional expectation functions on sample :
to get .
to get .
to get .
to get .
to get .
Estimate the following residuals on sample :
Run a two-stage least-squares regression of on using as an instrument for to obtain the estimated local average treatment effects of reputation, skill and position on debate success.
We partition the debates for opinions with more than one response (mirroring the data used in the specifications with opinion fixed-effects) uniformly at random into an estimation subsample containing 10% of the debates (101,946 debates) and an inference subsample containing 90% of the debates (917,523 debates), ensuring that every opinion is represented in both and . In the next section, we describe how we use neural networks with rectified linear unit (ReLU) activation functions for the nonparametric functions , and , which have been shown to converge at rates farrell2018deep that enables consistent estimation and valid inference.
4.4 Neural Models of Text as Nuisance Functions
A fully-connected neural network with hidden layers is parameterized by matrices and activation functions (called activations) . The hidden layer sizes
are architectural hyperparameters that determine the sizes of the matricesas follows, where and are the dimensionalities of the neural network input and output, respectively:
Each layer multiplies the intermediate vector produced by the previous layer with , and applies the activation function to produce . Figure 6 illustrates a neural network with one hidden layer (), input dimensionality and output dimensionality . The neural network transforms the input, a concatenation of the response text vector and the fixed-effects indicator vector for , into the 1-dimensional predicted output .
We estimate five neural networks with rectified linear unit (ReLU) activations to predict (i) debate success , (ii) reputation , (iii) skill (as a percentage), (iv) position
(standardized to have zero-mean and unit-variance) and (v) the instrumentfrom the response text and opinion fixed-effects . Though recurrent hochreiter1997long and convolutional kim2014convolutional neural networks are more popular for textual prediction tasks, ReLU neural networks have guaranteed convergence rates farrell2018deep that we require for consistent estimation and valid inference. Hence, we set each of the hidden layer activations to the rectifier function . Since the output of each neural network is one-dimensional, we set the size of the output layer matrix to .
Output layer activations and loss functions. For the debate success prediction network with the binary target
, we set the output layer activation to the logistic sigmoid function:. For the skill prediction network with the bounded target , we set the output layer activation to the scaled logistic sigmoid function: . For the reputation and instrument prediction networks with nonnegative targets and , we set the output layer activation to the rectifier function: . For the position prediction network with unbounded target , we set the output layer activation to the identity function: . We estimate the parameters for the debate success prediction network by minimizing the binary cross-entropy loss (where is the predicted output), and for the other networks by minimizing the mean squared error.
Neural network input. We follow recommendations from the text-as-data literature gentzkow2019text and construct a term-frequency inverse-document-frequency (TF-IDF) matrix from the text of the challengers’ responses. We preprocess the text to remove links, numbers, pronouns, punctuation and text formatting symbols, and replace each word with its lower-cased stem (for example, “economically” and “economics” will be replaced by the stem “economic”). We exclude very rare words (present in less than 0.1% of the responses) and very frequent words (present in more than 99.9% of the responses), since these words will contribute negligibly towards more accurate predictions. The final vocabulary contains 4,926 distinct words. Each row of the TF-IDF matrix corresponds to a vector . We also construct an indicator vector (since there are 84,998 unique opinion clusters), where only the element of is set to 1 and the rest are set to zero. The concatenation of these two vectors, , is passed as input to the neural networks.
We train each network via backpropagation rumelhart1986learning with the Adam gradient-based optimization algorithm kingma2015adam iterating over mini-batches of the training data. We begin the optimization process by initializing the parameters using theKaiming uniform
initialization scheme he2015delving, which has been shown to perform well both empirically and theoretically hanin2018start. We perform batch-normalization ioffe2015batch on each layer’s output after applying the activation function to prevent internal covariate shift and accelerate convergence. To prevent overfitting to the training data, we apply weight-decay (a form of-norm penalization) krogh1992simple to all the parameters, along with early-stopping (halting the training process once the out-of-sample predictive power starts decreasing with training iterations). We do not employ dropout regularization srivastava2014dropout, since it reduces out-of-sample predictive power when combined with batch-normalization li2019understanding.
Architectural and optimization hyperparameters. The number of hidden layers , hidden layer sizes , weight-decay penalty, optimization learning rate and mini-batch size are architectural and optimization hyperparameters that need to be tuned empirically. Hence, we further partition the debates in the estimation subsample uniformly at random into a training subsample containing 75% of the debates (76,459 debates) and a validation subsample containing 25% of the debates (25,487 debates). During the hyperparameter tuning process, we train the neural network on and evaluate its loss at each training iteration on both and .
We fix the size of the hidden layers to the dimensionality of the response text vector (=4,926) and tune the number of hidden layers for each neural network. Deep, fixed-width ReLU networks of this type have been shown to generalize well both empirically and theoretically safran2017depth,hanin2019universal. For each neural network, we evaluate the training loss (for at most 5,000 mini-batch iterations with early-stopping) with an increasing number of hidden layers, until the training loss no longer improves. Each neural network with the number of hidden layers thus found has enough representational capacity to capture patterns in the training data, but is likely to have overfit the training data and suffer from poor out-of-sample predictive power.
|Number of||Activation Functions|
|Prediction target||Hidden layers||Hidden Layer||Output Layer||Loss Function|
|Debate success||5||ReLU||Sigmoid||Binary Cross-Entropy|
|Reputation||3||ReLU||Rectifier||Mean squared error|
|Skill (percentage)||3||ReLU||Sigmoid||Mean squared error|
|Position (standardized)||3||ReLU||Identity||Mean squared error|
|Instrument||5||ReLU||Rectifier||Mean squared error|
|Prediction target||Learning Rate||Batch Size||Weight-Decay||Train||Validation||Inference|
Hence, after having selected the number of hidden layers for each neural network via the aforementioned procedure, we evaluate the validation loss of each neural network (for at most 5,000 mini-batch iterations with early-stopping) with an increasingly large weight-decay penalty (in the logarithmically-spaced range ), until the validation loss no longer improves. The final neural network thus found will have sufficient representational capacity and be sufficiently regularized to generalize well out-of-sample. During the process of tuning the number of hidden layers and the weight-decay penalty, we also empirically evaluate and select the values of the learning rate and mini-batch size that deliver the minimum validation loss with fast and stable convergence.
Table 7 summarizes the selected architectural hyperparameters. Table 8 summarizes the selected optimization hyperparameters and the losses on each data subsample, which reflect the extent to which each target is correlated with potential confounders present in response text. After fixing the selected hyperparameters, we re-estimate the neural networks with on the full estimation subsample , estimate the prediction residuals on the inference sample and run a two-stage least-squares regression with these residuals, as described in the double machine-learning procedure in Section 4.3.
|Dependent Variable: Debate Success|
|Reputation (10 units)|
|Position (std. deviations)|
|Response text ()||✗||✗||✗||✗||✓|
|Opinion fixed-effects (|