Rating Reliability and Bias in News Articles: Does AI Assistance Help Everyone?

04/02/2019 ∙ by Benjamin D. Horne, et al. ∙ Rensselaer Polytechnic Institute Virginia Polytechnic Institute and State University The Regents of the University of California 0

With the spread of false and misleading information in current news, many algorithmic tools have been introduced with the aim of assessing bias and reliability in written content. However, there has been little work exploring how effective these tools are at changing human perceptions of content. To this end, we conduct a study with 654 participants to understand if algorithmic assistance improves the accuracy of reliability and bias perceptions, and whether there is a difference in the effectiveness of the AI assistance for different types of news consumers. We find that AI assistance with feature-based explanations improves the accuracy of news perceptions. However, some consumers are helped more than others. Specifically, we find that participants who read and share news often on social media are worse at recognizing bias and reliability issues in news articles than those who do not, while frequent news readers and those familiar with politics perform much better. We discuss these differences and their implication to offer insights for future research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Today, false and misleading news are widespread and persistent [Lewandowsky et al.2012]. While both have long histories, they have become a recent focal point of researchers and practitioners due to the their surge in political news, and ultimately their negative impact on voting and public opinion worldwide [Lazer et al.2018, Lewandowsky et al.2012]. This persistence of false information is aided by the structure of social media platforms, which boost passive news consumption and information overload [Shu et al.2017]. Information in the form of news articles is particularly prone to spreading misinformation (and disinformation) in this environment, as facts can be decontextualized in the headline to gain clicks, partial information can be reported in order to favor one side of an argument, or information can be completely fabricated to mimic news media content [Chakraborty et al.2016, Lazer et al.2018].

Due to the wide reach of false and misleading news, many automated methods have been introduced to counter their spread. These methods have focused on various aspects of information veracity, including bias [Baly et al.2018, Budak, Goel, and Rao2016, Gentzkow and Shapiro2010, Horne et al.2018], reliability [Baly et al.2018, Horne and Adali2017, Popat et al.2016, Potthast et al.2017, Singhania, Fernandez, and Rao2017], and source-level trustworthiness [Pennycook and Rand2018, Swire et al.2017]. In lab settings, many of these automatic methods have been shown to be highly accurate in detecting or approximating the integrity of information. For example, Popat et al. introduced a tool for assessing the credibility of individual claims using both content-based and source-based features  [Popat et al.2016]. This tool achieved over 71% accuracy (0.80 ROC AUC) on a large test set of true and false claims from Wikipedia. Similarly, Horne et al. built a tool for assessing the credibility of full news articles using content-based features [Horne et al.2018]. This tool achieved 0.89 ROC AUC on a large weakly-labeled test set of news articles from various sources. Baly et al. achieved similarly high performance predicting source-level veracity using a mixture of features derived from content, Twitter, Wikipedia, and the source’s web traffic [Baly et al.2018].

Despite these recent successes in automated news credibility, there has been little work exploring how these methods affect human subjects’ perception of the news. Focusing on human interactions with decision support algorithms, there has been numerous works on the general trust of users in algorithmic forecasts and on explaining algorithmic decisions to users [Dietvorst, Simmons, and Massey2015, Dietvorst, Simmons, and Massey2016, Kleinberg et al.2017, Kizilcec2016, Rader, Cotter, and Cho2018]. However, these studies did not focus on the news context. We argue that it is very different to recommend an employee for promotion to a manager, or a weather forecast to a meteorologist, than it is to tell a user that the news she reads is misleading, false, or biased. This difference is because individuals often have strongly-held beliefs concerning the news [Lazer et al.2018]

and those individuals often use mental shortcuts or heuristics when assessing the news rather than deeper information processing 

[Petty and Cacioppo1986]. Furthermore, human interactions with decision support systems are often ”pull”-based, where users seek a recommendation, while news credibility support may not be. Thus, it is necessary to better understand how algorithmically generated advice will be accepted in the unique context of misleading news content.

In this paper, we present an experimental study to fill this gap. Specifically, in our experiment individuals rate the credibility of news articles with or without algorithmic assistance. Our main goal is to understand if algorithmic assistance improves human decisions and whether there is a difference in performance with varying levels of algorithmic explanations and among different types of participants. Our paper contributes to the literature by providing a first look at human responses to AI assistance in the context of news veracity. We demonstrate using a real state-of-the-art AI assistant and real political news articles that AI assistance can improve human credibility perceptions. However, we also illustrate that some level of explanation is needed in order for the AI assistant to be compelling across different types of news. In addition, our results show that individual news consumption patterns impact the effectiveness of these tools. We demonstrate a significant difference in news perceptions between social media news consumers and expert news readers, as well as, their adherence to AI assistance. This set of results provides practitioners with insight into effective news assistant design. Finally, our work adds to the already rich literature on humans interaction with news of varying reliability and bias.

Unreliable article criteria
1. Does the article have a misleading title (clickbait or makes claims not supported by the article)?
2. Does the article have no supporting evidence (missing quotes from witnesses, experts, or other reputable sources)?
3. Does the article have logical fallacies (claims are not supported by the evidence presented in the article)?
4. Does the article use overly emotional tone?
5. Is the article factually incorrect (claims made that can be shown as false or misrepresented)?
6. Does the article reference other unreliable sources (other sources that are known to produce false information)?
Biased article criteria
1. Does the article use overly emotional tone?
2. Does the article create a “call to action” (telling consumers what to think or to do)?
3. Does the article have framing bias (only reporting one side of the story)?
4. Does the article use subjective statements or opinion?
5. Is the title of the article one-sided (a headline that favors one side over another)?
Table 1: Criteria for expert article labeling, loosely developed from the criteria laid out in [Zhang et al.2018]. Each criteria is answered yes or no by the expert labeler.

2 Related Work

In order to grasp a more fine-grained view of news veracity, we explore two different constructs: the reliability of an article and the bias of an article. While highly related and often intertwined, reliability and bias are not the same concept [Baly et al.2018, Horne et al.2018], and can be perceived differently. Reliability concentrates on the factuality of reporting, whether an article contains true information, completely fabricated information, or misleading information. In contrast, bias is the imbalance of information from different sides of an issue or subjective opinions that can decontextualize truth [DellaVigna and Kaplan2007, Druckman and Parkin2005, Fico, Richardson, and Edwards2004]. Both concepts can play a significant role in the spread of misinformation despite the different mechanisms used in each. Further, news articles can have varying levels of both reliability and bias, contributing to the articles overall veracity [Baly et al.2018, Horne et al.2018]. Hence, we argue that these concepts need to be studied separately to gain a more complete understanding of when AI intervention is accepted or rejected in news decisions.

It is clear there is a need for automated tools, as humans are prone to being misinformed by both unreliable information and biased information. Lewandowsky et al. [Lewandowsky et al.2012] examined cognitive factors that impact the spread of misinformation by humans. The authors show a large array of factors that influence news decisions, including (1) information that is compatible with what a person already believes can be seen as more credible, (2) stories that are coherent and compelling may be easier to believe, (3) if the information is from a source that is perceived to be credible, the information itself is perceived as credible, and (4) if others, particularly in a person’s social circle, believe the information, it is seen as more credible. Another factor that can leave humans prone to misinformation is their information consumption patterns [Del Vicario et al.2016, Bessi et al.2015b]. Specifically, if a person is a part of a homogeneous and polarized group in which unverified information is abundant, they are more likely to share that unverified information (and hence cause information cascades) [Del Vicario et al.2016]. In other words, if a person is in a tight-knit, false news spreading community, they may be acclimated to the style and structure of unreliable or hyper-partisan information, viewing it as more credible than it really is. Related to this notion is the literature on persuasion and credibility [Petty and Cacioppo1986], which points out that news articles are interpreted in two separate levels: (1) at an emotional or heuristic level, to test whether information is relevant and is in sync with one’s opinions, and (2) at a rational level, to test the veracity of the information. These levels are inseparable. Often heuristic methods are fast and require little cognitive effort and deeply impact the “fact-checking” aspect of information processing. Readers often reach a conclusion quickly using these heuristics and stop processing the information deeply, leaving them prone to being misinformed by false and misleading news. These mental shortcuts are often supported by the information overload and limited individual attention on social media platforms [Mele et al.2017].

To this end, many automated methods have been introduced to counter the spread of false and misleading news. The majority of these methods use supervised, feature-based automatic detection, with features extracted from text content, the network in which the content spreads, and the user who spreads it. There are some more recent studies that explored deep learning and unsupervised methods 

[Singhania, Fernandez, and Rao2017, Shu et al.2017], and others that focused on crowd-based methods [Pennycook and Rand2018]; however, by far the most common and successful approach has been supervised content-based methods [Baly et al.2018, Potthast et al.2017, Popat et al.2016, Nakashole and Mitchell2014, Karduni et al.2018, Zhang et al.2018]. Many of these studies have shown high accuracy results on test sets of varying size, time-frame, and topic. While there are still some open questions about how well these methods work over longer periods of time and how well they generalize over changes in the news cycle, these works have shown promising results for AI assistance in news veracity tasks.

Despite much progress in both building these automated tools and understanding humans’ deficiencies in news consumption, there has been little to no work to understand how the two work together. In general, there has been work on automated tools’ impact on human decisions outside of the news context. For example, Dietvorst et al. [Dietvorst, Simmons, and Massey2015] studied when humans choose the human forecaster over a statistical algorithm. The authors found that aversion of the automated tool increased as humans saw the algorithm perform, even if that algorithm had been shown to perform significantly better than the human. Dietvorst et al. explained that aversion occurs due to a quicker decrease in confidence in algorithmic forecasters over human forecasters when seeing the same mistake occur [Dietvorst, Simmons, and Massey2015]. Hence, humans are more critical of algorithmic mistakes (mistakes by automated tools) than human mistakes. In a later study, Dietvorst et al. illustrated a decrease in algorithm aversion if the humans could slightly modify the forecasts [Dietvorst, Simmons, and Massey2016]. This modification made the human participants feel more satisfied with the forecast and more likely to believe the algorithm was superior in predicting. Similar studies focused on the impact of generic algorithm explanations on the human’s trust in the automated tool. Rader et al. performed an experiment of various ways to explain the Facebook News Feed algorithm [Rader, Cotter, and Cho2018]. They found that all explanation types tested helped participants become more aware of how the system works, but these explanations were less effective for evaluating correctness of the system’s output. Thus the explanations were not necessarily useful in promoting trust in the algorithm. Kizilcec ran a similar experiment using a Massive Open Online Course (MOOC), and found that too much explanation of the automated tools’ decisions can erode trust [Kizilcec2016].

Broadly speaking, these works begin to explore the impact of algorithmic assistance on human decisions, but fail to address the context-specific nature of these tools and choices. As discussed, humans perceptions about the veracity of news can be rigid, strongly-held, and influenced by social pressure. These decisions are very different than decisions about the correctness of weather predictions, hiring a recommended job applicant, or trusting an algorithmic grading of a homework assignment. Our work begins to fill this context-specific gap. Specifically, we ask the following research questions:

  1. (Q1) Does algorithmic assistance improve users’ perceptions of news reliability?

  2. (Q2) Does algorithmic assistance improve users’ perceptions of news bias?

  3. (Q3) Are there individual differences that impact the effectiveness of the algorithmic assistance in each case? (reliability perceptions and bias perceptions)

3 Experimental Design

The study’s objective is to understand how news consumers interact with algorithmic assistance. To this end, our experimental design includes three conditions. Under each condition, participants were asked to read news articles and rate these articles on their bias and reliability. The three experimental conditions were: Condition 1: Article text only (text condition) Condition 2: Article text with AI assistance (AI base condition) Condition 3: Article text with AI assistance and explanations (AI explanation condition)

In this section we describe first the construction of our article set. In line with our interest in understanding both bias and reliability we constructed an article set that focused on both constructs. Next, we explain in depth the AI assistant that was used in the study. The section concludes with a description of the study’s respondents.

Article Set Construction

We used a two-step approach to create our data set. First, we selected news sources that fall into three categories: (1) mainstream (typically assumed to be reliable and unbiased), (2) unreliable, and (3) biased. Specifically, unreliable sources are sources that have reported completely fabricated information in the past, and biased sources are sources that tend to report from a hyper-partisan point of view. We used two previously built lexicons to select these sources: the opensources lexicon (

www.opensources.co/) and a hyper-partisan source lexicon from [Pennycook and Rand2018]. These definitions and source selections closely align with previous literature [Baly et al.2018, Horne et al.2018, Potthast et al.2017]. The final list of sources used can be found in the Appendix (Table A1). Next, we extracted articles at random from the abovementioned sources, using a large 2017 news article data set [Horne, Khedr, and Adalı2018], and roughly balancing between the three categories of sources. We then followed with an expert rating approach, in which five experts (in this case four authors of the paper and one external communications expert) independently read and rated each article using a set of criteria (similar to those proposed in [Zhang et al.2018]), listed in Table 1. Articles that were deemed unreliable (multiple criteria marked as “yes”) by all raters (100% agreement) were then defined as unreliable (UR, 13 articles) and maintained for this study. The same procedure was employed to define the remaining three types of articles: reliable (R, 11 articles), biased (B, 16 articles), and unbiased (UB, 9 articles).

AI Assistant

To create an AI assistant, we use two state-of-the-art Random Forest classifiers developed in previously published work (citation removed for blind review). Given a news article, these classifiers output the probability that the article is unreliable or biased, respectively. These classifiers are trained using a rich set of content-based features on a large set of political new articles. Some example features include, word usage, emotional tone, subjectivity, and sentence complexity. While there are numerous other recent methods for automatic news credibility classifications, we choose these content-based classifiers for several reasons:

  1. The decisions made by content-based algorithms, specifically ones that use rich feature sets, can be explained.

  2. Content based methods in general have been shown to be useful in news credibility task [Baly et al.2018, Horne and Adali2017, Potthast et al.2017, Popat et al.2016], and the vast majority of the literature on news credibility detection uses content-based methods in some form.

  3. These specific classifiers have been shown to be accurate in prior work, as well as, on our newly created data set. Specifically, according to [Horne et al.2018]

    , these classifiers perform with ROC AUC scores above 0.90 for both reliability classification and bias classification. Furthermore, when testing the classifiers on our set of news articles, we found that the predictions matched our ground truth well. For articles labeled as reliable, the classifier reported 86.15% reliability on average with a standard deviation of 10.62, while for articles labeled as unreliable the classifier report 17.18% reliability on average with a standard deviation of 8.55. Similarly, for articles labeled as biased, the bias classifier reported 86.60% bias on average with a standard deviation of 8.90, while for articles labeled as unbiased the classifier reported 26% bias on average with a standard deviation of 8.10. No incorrect or uncertain probabilities (near 50%) were reported. This high performance on our data set should be expected as it was created with strong ground truth.

Regardless of the method used in building the AI assistant, the task in this paper only requires our assistant to be accurate on our data set and that it’s decisions can be clearly explained. Broader questions of generality or algorithm choice are left to another study.

Text Only AI Base AI Explanation
N 217 211 226
Median Age Group 25-34 25-34 25-34
Median Education Group 4 year college 4 year college 4 year college
Gender M: 100 F: 115 O: 2 M: 110 F: 100 O: 1 M: 100 F: 124 O: 2
Political Leaning VL:28 L:66 M:77 C:40 VC:6 VL:30 L:57 M:85 C:28 VC:11 VL:25 L:73 M:83 C:34 VC:11
Table 2: Demographics for each treatment in our study, where both age and education are answered on 7 point scales.
Article Type Text Only AI Base AI Explanation
Reliable N: 217 N: 211 N: 226
Average rating: 6.64 Average rating: 7.34 Average rating: 7.10
Standard Dev: 2.15 Standard Dev: 2.00 Standard Dev: 1.86
Unreliable N: 217 N: 211 N: 226
Average rating: 5.01 Average rating: 4.79 Average rating: 3.84
Standard Dev: 2.26 Standard Dev: 2.52 Standard Dev: 2.27
Unbiased N: 214 N: 202 N: 225
Average rating: 4.37 Average rating: 4.42 Average rating: 3.89
Standard Dev: 2.59 Standard Dev: 2.70 Standard Dev: 2.05
Biased N: 217 N: 211 N: 226
Average rating: 6.58 Average rating: 6.97 Average rating: 7.15
Standard Dev: 2.28 Standard Dev: 2.34 Standard Dev: 1.86
Table 3: Summary statistics for each condition and article ground truth. Note, ratings closer to 10 are better for reliable articles and biased articles, while ratings closer to 1 are better for unreliable articles and unbiased articles. One-way ANOVA results for each condition and article type can be found in text below.

Human Experiment on Amazon Mechanical Turk

Using this data set and AI assistant, we conducted a randomized between-subjects study on three conditions (text condition; AI base condition; and AI explanation condition). In each condition, participants (Turkers) rated articles on their bias and reliability. Participants were asked to rate articles’ reliability on a 10-point scale ranging from ‘completely unreliable’ (1) to ‘completely reliable’ (10). They were similarly asked to rate articles’ bias on a 10-point scale ranging from ‘completely unbiased’ (1) to ‘completely biased’ (10). After each rating, participants were also asked to comment on why they rated the article this way, in order to qualitatively capture any features or thoughts the participant used in their assessment.

In the text only condition, participants were only given the article text, including both the body text and the headline text. No additional information such as the journalist or source that wrote the article is provided. Each participant evaluated between three to five randomly assigned articles, where three to five is chosen based on how long we wanted the user experience to be. If a participant is given more than one article of the same ground truth (R, UR, UB, B), we only keep the first rating to avoid repeated measures.

In the AI base condition, we introduced our AI assistant (discussed in Section 3.2), which provides the user with a predicted probability of each article being reliable or biased along with the article text. We displayed this prediction at the bottom of each article stating “Our smart AI system says this article has a X% chance of being (reliable or bias)” in bold red font. All other parts were formatted exactly as the text only condition.

In the AI explanation

condition, participants were again shown an AI prediction with the article text, but in addition a feature-based explanation of the prediction is shown. Specifically, we showed the top four most important content features for each predicted class and highlighted four to eight examples of them in the article. These features were easily extracted from our AI assistant as the model is built using decision trees. Specifically, we computed the mean decrease impurity of a feature averaged over all trees in the ensemble model 

[Breiman et al.1984, Pedregosa et al.2011]. This provides us with a feature importance ranking for each article. Further, since the features are based on content, they are easily interpretable. All other parts were formatted exactly as the text only condition. Figure 1 presents screenshots of all three conditions.

While, there are other methods to explaining the AI decisions, such as example-based explanations, we focus on one type of explanation method to simplify the analysis. Varying the presentations of information are left for future work.

Participants

Participants were workers on Amazon Mechanical Turk (AMT). Based on recommendations from the literature on using AMT for research, a filter was applied to ensure that participants had successfully completed at least 50 tasks in the past [Dennis, Goodson, and Pearson2018]. Responses were provided using a Qualtrics survey.

In addition to the experimental task described above, participants were also asked about demographics information, as well as their news consumption habits. Specific questions on individual differences were:

  1. How familiar are you with US politics? (5-point scale ranging from “not at all” to “extremely”)

  2. How often do you read news? (4-point scale ranging from “never” to “multiple times a day”)

  3. What is the primary way you get news? (“social media”, “news websites”, “TV”, “newspaper”)

  4. When you use social media, how often do you share news? (4-point scale ranging from “never” to “always”)

  5. To what extent do you trust news coming from mainstream media? (“don’t trust”; “do trust”)

  6. To what extent do you trust news coming from your social contacts? (“don’t trust”; “do trust”)

  7. What is your political leaning? (5-point scale ranging from “very liberal” to “very conservative”)111We also asked three policy-based political questions to validate the self-reported political leaning question. No unexpected differences between policy and self-selected leaning were found.

This information was used in our analysis of the results to identify contingencies in acceptance of the AI assistance, as we explain in the next section. Finally, to ensure reliability in responses (for example - to avoid the problem of bots used on AMT or users clicking through survey with little effort) we included one check question in each survey and one simple reading comprehension question for each article. Data were cleaned to ensure all responses were of sufficient quality. Similar attention check filtering processes are used in Amazon Mechanical Turk studies throughout the literature [Kittur, Chi, and Suh2008, Knijnenburg and Willemsen2015] and have been recommended in meta-analysis [Dennis, Goodson, and Pearson2018].

Table 2 describes the final groups of respondents as well as their demographic information.

4 Results

We conducted four one-way ANOVA tests corresponding with the four ground truths of interest in this study (Reliable, UnReliable, Biased, and UnBiased articles). For each test we used respondents’ rating as the dependent variable, and the experimental condition as the independent variable. Table 3

provides descriptive statistics for each condition, and the following text summarizes the results of the ANOVA tests.

Article Type Political Familiarity Reading Frequency
Reliable No significant interaction No significant interaction
**; ** **; **
Unreliable No significant interaction No significant interaction
** **
Unbiased No significant interaction No significant interaction
** *; *
Biased No significant interaction No significant interaction
**; ** **
Where News is Read Sharing Frequency
Reliable No significant interaction No significant interaction
* **
Unreliable No significant interaction No significant interaction
** ** **
Unbiased No significant interaction No significant interaction
* **
Biased No significant interaction No significant interaction
** **
Trust in Mainstream Media Trust in Social Contacts
Reliable Significant interaction (*) No significant interaction
*; ** *
Unreliable Significant interaction (*) No significant interaction
** ** **
Unbiased No significant interaction No significant interaction
*; * *
Biased No significant interaction No significant interaction
** **
**, *
Table 4: Two-way ANOVA results for each individual difference measure. For measures that had a significant interaction, refer to Figure 2. Note, two-way ANOVA examines the influence between two independent variables (AI conditions and individual difference) and a dependent variable (ratings). If there is an interaction, the effect of one factor depends on the other factor.
Article Type Political Leaning
Reliable No significant interaction
**; **
Unreliable No significant interaction
**
Unbiased No significant interaction
Biased No significant interaction
**
**, *
Table 5: Two-way ANOVA results for political leaning measure. Refer to Table 5 for other individual measures.
Text Only
Liberal Moderate Conservative
Left Articles 6.82 6.07 6.84
Right Articles 6.49 6.19 6.30
AI Base
Liberal Moderate Conservative
Left Articles 6.90 6.69 7.95
Right Articles 6.97 6.75 7.37
AI Explanation
Liberal Moderate Conservative
Left Articles 7.39 7.31 6.91
Right Articles 6.72 6.96 6.54
Table 6: Average rating of participants when rating biased articles, broken down by article and participant leaning. Since there were very few participants on the extreme-ends (very liberal or very conservative), we combine the liberal and conservative groups.
(a) Reliable Ground Truth
(b) Unreliable Ground Truth
Figure 2: Interaction plots for Trust in Mainstream Media

For the reliable (R) articles, the ANOVA showed a significant difference in group means (F=7.8129, sig. 0.0004), with the post hoc test indicating the different means were between the text only and AI base conditions as well as the text only and AI explanation conditions. No significant difference was found between the two AI conditions.

For the unreliable (UR) articles, the ANOVA showed a significant difference in group means (F=18.1541, sig. 0.000), with the post hoc test indicating the different means were between the text only and AI explanation as well as the two different AI explanation conditions. No significant difference was found between text only and AI base conditions.

For the unbiased (UB) articles, the ANOVA and post hoc tests did not show a significant difference in group means (F=2.9945, sig. 0.051).

Finally, for the biased (B) articles, the ANOVA showed a significant difference in group means (F=4.1947, sig. 0.015), with the post hoc test indicating the different means were between the text only and AI explanation conditions. No significant difference was found between the text only and AI based conditions nor between the two AI conditions.

Analysis of individual differences

To delve deeper into the contingencies that may impact individuals’ perceptions of news articles, as well acceptance of AI assistance, we followed with a set of two-way ANOVA tests. Each test used the same dependent variable and experimental conditions as before (the individual ratings under the three conditions) but added a second factor capturing differences in news consumption. The questions concerning these individual differences were described in Section 3. We conducted a total of 24 two-way ANOVA tests (the four article categories (R,UR,UB,B) and the six news consumption questions). Note, two-way ANOVA examines the influence between two independent variables (AI conditions and individual difference) and a dependent variable (ratings). If there is an interaction, the effect of one independent variable depends on the other independent variable. This significant interaction means we cannot clearly interpret the significance of the independent variables on the dependent variable. In other words, if we have no significant interaction, but a significant difference in the independent variables, we can say something about the impact of those factors on the dependent variable. We present the results below in Table 5.

5 Discussion

AI Assistance improves human perceptions about reliability and bias in news articles, but explanations are needed.

In general, we find that AI assistance does improve participants’ perceptions of reliability and bias in news articles. However, this improvement depends on the level of explanation used by the tool. Specifically, we find that both the AI Base condition and the AI Explanation condition significantly improve participants’ ratings on reliable articles, but only the AI Explanation condition improves participants’ ratings on unreliable articles and biased articles. Furthermore, in our two-way ANOVA analysis we find that the AI assistant helps every individual difference group significantly, with exception of those who share news often on social media. We explore this exception further below.

Political familiarity and reading frequency help participants judge when information is reliable.

Our two-way ANOVA results show that both how familiar a participant is with politics and how often a participant reads news significantly improve the judgment of reliable articles (see Table 5). Specifically, those who said they are very familiar with politics had an average rating of 6.6 in the text only condition, while those who said they are not at all familiar with politics had an average rating of 6.3. This difference increases with AI assistance, where those very familiar rated reliable articles 7.7 on average and those not familiar rated articles 6.6 on average in the AI Base condition. These large differences in ratings also hold true for the AI Explanation condition. Interestingly, these differences do not hold when participants rated unreliable articles.

Similarly, those who read news multiple times a day had an average rating of 6.7 while those who never read the news had an average rating of 6.4 in the Text condition. Again, this difference is increased in the AI conditions. Those who read news multiple times a day rated reliable articles 7.6 on average and those who never read news rated reliable articles 5.6 on average, which is even worse than they did without the AI assistant.

Those who trust news from social contacts and share news often on social media perceive unreliable articles as more reliable.

Our two-way ANOVA results also show that trust in news from ones social contacts and how often a participant shares news on social media significantly impacts ratings of unreliable articles (see Table 5). On average, those who trust news from social contacts rated unreliable articles 5.25, while those who do not trust rated articles 4.50 in the Text condition. While both groups improve significantly in the AI Explanation condition, the difference in performance still exists, with those who trust rating unreliable articles 4.0 on average and those who do not trust rating unreliable articles 3.50 on average. Remember, a lower rating for unreliable articles is more accurate compared with our ground truth, therefore, those who do not trust in news shared by their social contacts do better without AI and with AI than those who trust. Even more significantly, those who reported sharing news on social media often did much worse at rating unreliable articles (average rating without AI 6.25) than those who reported they never share news on social media (average rating without AI 4.5). Again, AI with explanation significantly helps both groups, but differences in performance still exist, with those who often share rating unreliable articles 4.25 on average and those who never share rating unreliable articles 3.75 on average.

This set of results suggests that heavy social media users perceive unreliable news articles as more reliable than they actually are. Previous work shows that fake news has very different style and structure than traditional news [Horne and Adali2017] and that repeated exposure to false news is correlated with believing and sharing false news in the future [Lewandowsky et al.2012, Bessi et al.2015a, Del Vicario et al.2016]. Furthermore, we know that during recent U.S. elections, fake news stories were more widely shared than mainstream news stories on Facebook [Allcott and Gentzkow2017], illustrating widespread exposure to the style and structure of unreliable news articles. While we do not have knowledge of the specific online communities or platforms that our study participants read and share in, this finding could be due to widespread repeated exposure to unreliable structured news articles on social media. Further study is needed to explain these results conclusively.

Political familiarity and trust in mainstream media help perceptions of bias in news articles.

Similar to our findings with reliable news articles, we find that participants who are more familiar with politics and trust mainstream media are better at recognizing bias in news articles. Without AI assistance, those who said they are very familiar with politics rated biased articles 6.75 on average, while those who said they are not at all familiar with politics rated biased articles 5.75 on average (keep in mind that higher ratings are better for bias ground truth). In the AI Explanation condition, both groups improve (7.25 average for those very familiar and 6.50 average for those who are not familiar). Additionally, those who said they trust the mainstream media were better at recognizing bias without AI assistance, rating biased articles 6.8 on average, than those who do not trust mainstream media, rating biased articles 6.25 on average. These differences effectively go away in both AI conditions.

Feature-based AI assistance is helpful in pointing out bias, but not necessarily the lack of bias in a news article.

In our one-way ANOVA results, we see significant improvement in participants recognition of bias in news articles, but not recognition of the lack of bias in news articles. At a high-level, this result makes sense, as our feature-based classifiers can highlight biased statements, such as subjective language or emotional tone, but may not be able to clearly highlight examples of being unbiased, particularly without specific knowledge of the issue being discussed. Despite there being no significant improvement in our one-way ANOVA results, we do see significant improvement in recognizing lack of bias in our two-way ANOVA results. All individual groups were significantly affected by the AI assistant conditions when rating unbiased articles, except for the social news sharing group. Looking closer at the means of each type of participant in this group, we see that those participants who share news on social media ’some of the time’ (as opposed to ’never’ or ’most of the time’) are the only group that improves with AI assistance. However, in all three conditions, those who never share news on social media are better at recognizing lack of bias in news articles than those who share news on social media often.

Political leaning has little impact on rating reliability and bias.

In Table 5, we show the two-way ANOVA results for the political leaning of participants. Surprisingly, we find that the political leaning of participants does not have a strong impact on article ratings overall. We only find significant differences in ratings for reliable articles (**). When looking at a Tukey post hoc test, we see the only group that differs is the “very liberal” group, which seems to rate reliable articles as more reliable. This difference does not exist for the other types of articles.

In Table 6

, we break down the average rating of participants in each political ideology with the political leaning of the biased articles rated. We can see in each condition, the average difference between rating an article of the same ideological leaning and of the differing ideological leaning is very small, and sometimes non-existent. For example, that both liberal and conservative participants rated left leaning articles as slightly more biased than right leaning articles. Similarly, in the AI Base condition, conservatives rated both left and right leaning articles as more biased than liberals. Thus, article ideology seems to be little to no impact on ratings. It should be noted that the distribution of political leanings is slightly skewed towards “liberal” with most participants being either “moderate” or “liberal” (see Table 

2). This slight skew is expected based on previous Mturk studies, but may influence the results we find concerning political ideology. Furthermore, this set of articles is not chosen based on strong ideological issues, but on general bias. It may be that if both right and left biased articles were selected to be on the same topic from the same time frame, ideology may come into play. However, this was not the goal of the study.

6 Conclusion and Limitations

In conclusion, this study is the first in exploring the effectiveness of AI assistance in news credibility perceptions. We presented an experimental study in which humans rated the reliability and bias of real news articles with varying levels of assistance from a state-of-the-art decision support tool. We found that AI assistance with feature-based explanations significantly improved perceptions of reliability and bias. However, these improvements differ between different types of news consumers. Some participants tended to do well on their own, particularly if they reported to have high expertise through reading news frequently or political familiarity. In comparison, participants who used social media heavily showed negative results, perceiving unreliable articles as more reliable. While both groups improved significantly, those who share and read news on social media never perform as well as those who do not. For further insight, we provide qualitative analysis of the users’ explanation of their ratings in the Appendix.

Our results suggest that AI tools for news credibility can be most effective if they explain how the decisions are made rather than act as a black box. Further, while our results suggest AI assistance will help everyone to some extent, tools may be even more effective if they are tailored to individual differences. For example, our study only covered one type of automated assistant and explanation type (namely feature-based explanation), but it may be more effective to leverage a frequent social media users’ friends to change their belief about the veracity of an article, similar to previously proposed crowd sourced methods [Pennycook and Rand2018]. On the other hand, those who more actively read news may be most helped by feature-based explanations, as our study used. Similarly, our study shows that reliability and bias are judged differently, and those judgments are influenced differently by the AI assistance. Hence, it may be useful to tailor an AI assistance to explain bias decisions differently than reliability decisions.

More research is certainly required to properly assess the dynamics of trust placed in the AI assistant, and to assess levels of adherence to its predictions, in cases of agreement and disagreement with prior belief, and the knock-on effects on repeated interactions with the advice giver overtime. Additionally, other news consumption environments should be tested. Specifically, participants in our study are forced to read and at least minimally comprehend each news article. This consumption is different than the often passive consumption of news on social media, where users may only read the title or skim a news article. Thus, while our study provides a first step in understanding the effectiveness of news assistance, it is only looking at active consumption interactions, not passive consumption interactions. Although, it is certainly true that news consumption can happen actively on social media, further study should take place to also understand news assistance when consumption is passive. Another important limitation of this study is the explanation method used. Our study focused on one very simple explanation method, but there are many more ways to present or explain the results of an algorithm. It may be the case that participant’s adherence to algorithm advice improves or worsens with other explanation methods. In the future, we want to explore and compare other explanation methods such as example-based explanations. Lastly, in our participant instructions, we did not explicitly define reliability or bias, but rather left it up to the participants interpretation. It is possible that not having a briefing on these concepts created some noise in our response data. In future work, we will address these concerns.

7 Acknowledgements

We would like to thank Tamar Gordon for being our external expert in the article labeling task. We would also like to thank the reviewers for their many helpful suggestions for this paper.

References

  • [Allcott and Gentzkow2017] Allcott, H., and Gentzkow, M. 2017. Social media and fake news in the 2016 election. J. of Economic Perspectives 31(2):211–36.
  • [Baly et al.2018] Baly, R.; Karadzhov, G.; Alexandrov, D.; Glass, J.; and Nakov, P. 2018. Predicting factuality of reporting and bias of news media sources. In Proceedings of EMNLP.
  • [Bessi et al.2015a] Bessi, A.; Coletto, M.; Davidescu, G. A.; Scala, A.; Caldarelli, G.; and Quattrociocchi, W. 2015a. Science vs Conspiracy: Collective Narratives in the Age of Misinformation. PLoS ONE 10(2):e0118093–17.
  • [Bessi et al.2015b] Bessi, A.; Zollo, F.; Del Vicario, M.; Scala, A.; Caldarelli, G.; and Quattrociocchi, W. 2015b. Trend of Narratives in the Age of Misinformation. PLoS ONE 10(8):e0134641–16.
  • [Breiman et al.1984] Breiman, L.; Friedman, J.; Stone, C. J.; and Olshen, R. A. 1984. Classification and regression trees. CRC press.
  • [Budak, Goel, and Rao2016] Budak, C.; Goel, S.; and Rao, J. M. 2016. Fair and balanced? quantifying media bias through crowdsourced content analysis. Public Opinion Quarterly 80(S1):250–271.
  • [Chakraborty et al.2016] Chakraborty, A.; Paranjape, B.; Kakarla, S.; and Ganguly, N. 2016. Stop clickbait: Detecting and preventing clickbaits in online news media. In ASONAM, 9–16. IEEE.
  • [Del Vicario et al.2016] Del Vicario, M.; Bessi, A.; Zollo, F.; Petroni, F.; Scala, A.; Caldarelli, G.; Stanley, H. E.; and Quattrociocchi, W. 2016. The spreading of misinformation online. Proceedings of the National Academy of Sciences 113(3):554–559.
  • [DellaVigna and Kaplan2007] DellaVigna, S., and Kaplan, E. 2007. The fox news effect: Media bias and voting. The Quarterly Journal of Economics 122(3):1187–1234.
  • [Dennis, Goodson, and Pearson2018] Dennis, S. A.; Goodson, B. M.; and Pearson, C. 2018. Mturk workers’ use of low-cost “virtual private servers” to circumvent screening methods: A research note.
  • [Dietvorst, Simmons, and Massey2015] Dietvorst, B. J.; Simmons, J. P.; and Massey, C. 2015. Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General 144(1):114.
  • [Dietvorst, Simmons, and Massey2016] Dietvorst, B. J.; Simmons, J. P.; and Massey, C. 2016. Overcoming algorithm aversion: People will use imperfect algorithms if they can (even slightly) modify them. Management Science.
  • [Druckman and Parkin2005] Druckman, J. N., and Parkin, M. 2005. The impact of media bias: How editorial slant affects voters. The Journal of Politics 67(4):1030–1049.
  • [Fico, Richardson, and Edwards2004] Fico, F.; Richardson, J. D.; and Edwards, S. M. 2004. Influence of story structure on perceived story bias and news organization credibility. Mass Communication & Society 7(3):301–318.
  • [Gentzkow and Shapiro2010] Gentzkow, M., and Shapiro, J. M. 2010. What drives media slant? evidence from us daily newspapers. Econometrica 78(1):35–71.
  • [Horne and Adali2017] Horne, B. D., and Adali, S. 2017. This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. In NECO Workshop 2017.
  • [Horne et al.2018] Horne, B. D.; Dron, W.; Khedr, S.; and Adalı, S. 2018. Assessing the news landscape: A multi-module toolkit for evaluating the credibility of news. In WWW Companion.
  • [Horne, Khedr, and Adalı2018] Horne, B. D.; Khedr, S.; and Adalı, S. 2018. Sampling the news producers: A large news and feature data set for the study of the complex media landscape. In ICWSM.
  • [Karduni et al.2018] Karduni, A.; Wesslen, R.; Santhanam, S.; Cho, I.; Volkova, S.; Arendt, D.; Shaikh, S.; and Dou, W. 2018. Can you verifi this? studying uncertainty and decision-making about misinformation using visual analytics.
  • [Kittur, Chi, and Suh2008] Kittur, A.; Chi, E. H.; and Suh, B. 2008. Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, 453–456. ACM.
  • [Kizilcec2016] Kizilcec, R. F. 2016. How much information?: Effects of transparency on trust in an algorithmic interface. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2390–2395. ACM.
  • [Kleinberg et al.2017] Kleinberg, J.; Lakkaraju, H.; Leskovec, J.; Ludwig, J.; and Mullainathan, S. 2017. Human decisions and machine predictions. The Quarterly Journal of Economics 133(1):237–293.
  • [Knijnenburg and Willemsen2015] Knijnenburg, B. P., and Willemsen, M. C. 2015. Evaluating recommender systems with user experiments. In Recommender Systems Handbook. Springer. 309–352.
  • [Lazer et al.2018] Lazer, D. M.; Baum, M. A.; Benkler, Y.; Berinsky, A. J.; Greenhill, K. M.; Menczer, F.; Metzger, M. J.; Nyhan, B.; Pennycook, G.; Rothschild, D.; et al. 2018. The science of fake news. Science 359(6380):1094–1096.
  • [Lewandowsky et al.2012] Lewandowsky, S.; Ecker, U. K.; Seifert, C. M.; Schwarz, N.; and Cook, J. 2012. Misinformation and its correction: Continued influence and successful debiasing. Psychological Science in the Public Interest 13(3).
  • [Mele et al.2017] Mele, N.; Lazer, D.; Baum, M.; Grinberg, N.; Friedland, L.; Joseph, K.; Hobbs, W.; and Mattsson, C. 2017. Combating fake news: An agenda for research and action.
  • [Nakashole and Mitchell2014] Nakashole, N., and Mitchell, T. M. 2014. Language-aware truth assessment of fact candidates. In ACL (1), 1009–1019.
  • [Pedregosa et al.2011] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. 2011.

    Scikit-learn: Machine learning in python.

    Journal of Machine Learning Research 12(Oct):2825–2830.
  • [Pennycook and Rand2018] Pennycook, G., and Rand, D. G. 2018. Crowdsourcing judgments of news source quality.
  • [Petty and Cacioppo1986] Petty, R. E., and Cacioppo, J. T. 1986. The elaboration likelihood model of persuasion. In In Communication and Persuasion. New York: Springer. 1–24.
  • [Popat et al.2016] Popat, K.; Mukherjee, S.; Strötgen, J.; and Weikum, G. 2016. Credibility assessment of textual claims on the web. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2173–2178.
  • [Potthast et al.2017] Potthast, M.; Kiesel, J.; Reinartz, K.; Bevendorff, J.; and Stein, B. 2017. A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638.
  • [Rader, Cotter, and Cho2018] Rader, E.; Cotter, K.; and Cho, J. 2018. Explanations as mechanisms for supporting algorithmic transparency. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 103. ACM.
  • [Shu et al.2017] Shu, K.; Sliva, A.; Wang, S.; Tang, J.; and Liu, H. 2017. Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter 19(1):22–36.
  • [Singhania, Fernandez, and Rao2017] Singhania, S.; Fernandez, N.; and Rao, S. 2017.

    3han: A deep neural network for fake news detection.

    In International Conference on Neural Information Processing, 572–581. Springer.
  • [Swire et al.2017] Swire, B.; Berinsky, A. J.; Lewandowsky, S.; and Ecker, U. K. 2017. Processing political misinformation: comprehending the trump phenomenon. Royal Society Open Science 4(3):160802.
  • [Zhang et al.2018] Zhang, A. X.; Ranganathan, A.; Metz, S. E.; Appling, S.; Sehat, C. M.; Gilmore, N.; Adams, N. B.; Vincent; et al. 2018. A structured response to misinformation: Defining and annotating credibility indicators in news articles. In WWW 2018 Companion.

8 Appendix

Mainstream sources Unreliable sources Hyper-partisan sources
Associated Press Infowars Brietbart
PBS Liberty News Young Cons
NPR Natural News RedState
CBS Alt Media Syndicate TheBlaze
USA Today DC Clothesline Politicus USA
BBC Newslo Bipartisan Report
The Guardian Freedom Daily Occupy Democrats
Daily Buzz Live Daily Kos
Intellihub Shareblue
Table 1: Sources used in stage one of article construction. Note, only US stories were selected from BBC.
Writing Style
It uses a lot of opinion statements, and not a lot of evidence
Written in a convoluted style
The informal, accusatory, aggressive tone of the writing.
This is a made-up news article, because it doesn’t follow Associated Press style for capitalization.
The headline is totally unprofessional.
The story has a lot of grammatical errors
A lot of negative emotions, Some language seems sensational
It doesn’t have emotional flash points or inflammatory language.
Seems to be coherent and in order
AI Advice
I based it on the AI system since I know nothing about this.
The AI system rating lead me to think that this is unreliable article.
It provides updates to previously reported news, stating facts and the smart AI system gave a 95% chance
I’m going with the AI on this one
Strong AI rating.
Journalistic Features
Didn’t really have cites to back this up
Includes a non sequitur
Because it uses unnamed sources to make its statement
It addressed both sides of the question without seeming to take sides.
Trust
Can never be 100% sure if news is real or fake these days
The FBI can’t be trusted.
Other Heuristics
Clearly biased article written by an angry feminist
It seems logical that these events happened.
Table 2: Example comments from users on their article ratings.