An Exploration of Unreliable News Classification in Brazil and The U.S

06/07/2018 ∙ by Mauricio Gruppi, et al. ∙ Rensselaer Polytechnic Institute 0

The propagation of unreliable information is on the rise in many places around the world. This expansion is facilitated by the rapid spread of information and anonymity granted by the Internet. The spread of unreliable information is a wellstudied issue and it is associated with negative social impacts. In a previous work, we have identified significant differences in the structure of news articles from reliable and unreliable sources in the US media. Our goal in this work was to explore such differences in the Brazilian media. We found significant features in two data sets: one with Brazilian news in Portuguese and another one with US news in English. Our results show that features related to the writing style were prominent in both data sets and, despite the language difference, some features have a universal behavior, being significant to both US and Brazilian news articles. Finally, we combined both data sets and used the universal features to build a machine learning classifier to predict the source type of a news article as reliable or unreliable.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is an increasing interest in developing automated tools to identify misinformation. In our past work [Horne and Adalı2017][Horne et al.2018], we have shown that it is possible to distinguish between information coming from reliable sources and unreliable sources, i.e. sources that have published completely fabricated information. We have also shown that the writing style of satire and unreliable sources have many similarities [Horne and Adalı2017], which is another important class of articles to study given the use of humor and irony in many extremist communities [Marwick and Lewis2017]. This work as well as many others of similar nature [Nakashole and Mitchell2014]; [Potthast et al.2017]; [Popat et al.2016]; [Guacho et al.2018] concentrate on content-based prediction and analysis, illustrating the usefulness of content approaches to automatic news classification.

Despite this growing usefulness of content-based methods, there is little work exploring how well these methods can generalize across language, time, and culture. A recent study by European Union111 has pointed out the need to study misinformation identification in multiple languages and countries to gain a deeper understanding of commonalities as well as differences. In this paper, we provide a first attempt in classifying news sources with a unique study of news sources from U.S. and Brazil. We ask the following questions:

Q1: Can we distinguish between news from reliable, unreliable and satire news sources based on writing style alone, both in U.S. and Brazilian news sources?

Our objective here is twofold. We would like to revisit our findings and check whether they remain valid. In the past, we were able to show reasonably high prediction accuracy (77% ROC AUC) with fairly simple features and a simple model. However, this was from a fairly small data set right after U.S. Elections. At the time, attention on news was high, but study of misinformation was just starting to gain momentum. Due to the recent attention on the topic of misinformation in news, sources may have started using different tactics in presenting information. While we have shown that these types of content features can generalize in prediction tasks [Horne et al.2018], we have not explored the changes in feature significance and direction in newer time slices.

The Brazilian scenario is fairly susceptible to misinformation as the collected data pertains to months prior to the country’s presidential elections. Though the two domains are somewhat different, the underlying motivation is the same: disseminate misinformation. We study a similar set of features in both languages and check the accuracy of prediction of these features across two different domains.

Q2: Is there a universality to some of the features? Are there significant features common to both countries?

Our second objective is to understand to which degree the prominent features are universal. Given two different countries with different political landscapes, cultural landscapes, and language, one can expect significant differences in how misinformation is presented. However, one can also expect that the main motivation of the sources are similar: to engage users and present information that appear credible. Given the universal nature of some heuristics used to infer credibility of information, we might expect consistent similarities. On top of this, some similarities may be expected due to lack of editorial oversight in some of the unreliable media organizations. Hence, to understand these similarities, we find significant features in both countries and check how much they overlap. Then, we also look at whether the differences also point in the same direction for these features.

First, in response to Q2 above, we find a high number of features in multiple categories that are significant in both countries. Furthermore, there is strong agreement in features that measure text complexity across all three categories: unreliable sources use simpler language, shorter texts but longer sentences than reliable sources. These comparisons also hold for relationships between satire sources. We also find consistent similarities for complexity and stylistic features between reliable and unreliable sources, and weaker overall similarity for features involving part-of-speech. In essence, these three categories of features have a certain degree of universality. We also test this assumption using a prediction experiment. We show that using our fairly simple set of features, we can distinguish reliable and unreliable sources with 85% accuracy in the Brazilian dataset and 72% in the U.S. dataset. Then we combined U.S. and Brazilian data sets for a joint classification task. We chose the subset of features that are significant in both data sets, a total of 18 features chosen from this universal group. The accuracy of classification between reliable and unreliable news was 70% using this small set of features, illustrating the universal nature of these features. These early promising results open many new research questions that we expect to investigate in detail in our future work.

2 Related Work

In addition to work that concentrates on classification of reliable and unreliable news based on its content, there has been several works examining news and news consumption across countries and cultures. Most of these works have focused on news coverage or consumption. Vreese et al. study news coverage of the 2004 European parliament elections in each of the EU countries, showing a difference in sentiment towards the EU between the new and old members [De Vreese et al.2006]. An and Kwak examine what gets media attention across 196 counties using data from Unfiltered News ([An and Kwak2017]. They find that there are differences across region and that media has a short attention span in general. Similarly, An et al. show that news coverage across various countries can not only be driven by geographical closeness, but also by historical relationships and that similarity between news coverage in various countries depends on time and topic [An, Aldarbesti, and Kwak2017]. Kwak et al. examine both the attention of media and the attention of consumers through a unique study of news coverage and news searches in 193 countries. They show that many countries have dissimilar attention between media an public attention, but local attention patterns are similar. These differences in news coverage across countries may explain some of the differences in content we find this this study. Despite the interest in cross country news coverage and consumption, there is little to no work exploring news content differences and similarities across languages or countries, especially in the context of unreliable news. In addition, we do not find many studies exploring the generalization of content-based features for prediction of reliable and unreliable content across cultures and contexts.

3 Data and Features

To study the problem of identification of articles from reliable (R), unreliable (U), and satire (S) sources, we construct two sets of political news articles from US (United States) and BR (Brazilian) sources in each category. Reliable sources in each country are well-established media companies. Unreliable sources are sources known to have published at least one maliciously incorrect news article (according Snopes in US and AosFatos ( in BR). To this group, we also add sources that self-identify as satire and clearly indicate this on their website. The sources in US data comes from the sources in the NELA2017 data set [Horne, Khedr, and Adalı2018]. We construct BR sources by looking for unreliable and satire media sources and well-established media companies. We collect all political articles from these sources for a period of one month, between February 15th and March 15th of 2018, and then sample articles from each source.

Our BR news dataset contains 5511 political news articles from 19 sources of which 4698 articles are from reliable sources, 755 are from unreliable sources and 58 from satire sources. The list of sources is shown in Table 1. Our US news dataset contains 2841 political news articles from 16 sources of which 1997 articles are from reliable sources, 794 are from unreliable sources and 50 are from satire sources. The list of US news sources is shown in Table 2. Both BR and US datasets contain all articles collected between February 15th and March 15th of 2018 from the aforementioned sources. Each article is a data point in our dataset. For each article, we compute every feature from our feature list, and assign a class Reliable (R), Unreliable (U) or Satire (S) based on the source from which the article was collected.

We construct a set of roughly equivalent sets of features in both languages as shown in Table 3. The features are classified into 4 categories: complexity, style, linguistic, and psychological. Each feature is computed on title and body text separately. Some of these features are obtained using the Python NLTK [Bird2006] and LIWC [Pennebaker, Francis, and Booth2001].

Complexity features are used to assess the level of intricacy of title and body text of news articles. We capture the sentence level complexity through the number of words per sentence. To capture the readability level of the text, we use the Gunning fog index, SMOG grade, Flesch-Kincaid grade level and Flesch-Kincaid reading ease indexes. Such scores suggest the education level needed for the reader to have some understanding of the text. Higher scores indicate that a higher education is required.

Stylistic features are related to the writer’s style at the character level. These features include the frequency of commas and punctuation, the number of words in all caps. It is common to see use of capitalization and exclamation points in sensationalist writing styles. Journalistic style in contrast is much more measured in its use of these stylistic features. One can also see stylistic differences due to the lack of clear editorial oversight and standards in alternative media sites.

Linguistic features are related to the frequency of different parts of speech used in the text, such as frequency of nouns, proper nouns, verbs, etc. These features often indicate how the text is framed, such as whether the article is about specific individuals or actions, or it is from the point of view of a specific person.

Psychological features are based on words correlated to psychological processes, such features are provided by the Linguistic Inquiry and Word Count dictionaries (in English and Portuguese). These a non-topic related features that evoke cognitive processes such as positive and negative emotions, anxiety, certainty, etc.

Reliable (R) Unreliable (U) Satire (S)
BBC Brasil Correio do Poder Joselitto Müller
El País Brasil Diário do Brasil Sensacionalista
Exame Folha Política Piauí Herald
Extra Gazeta Social
Folha de S. Paulo Jornal do País
G1 Pensa Brasil
Isto É Saúde Vida e Família
O Tempo
Reuters Brasil
Table 1: Brazilian news sources
Reliable (R) Unreliable (U) Satire (S)
CBS News Activist Post Glossy News
CNBC Addicting Info The Borowitz Report
NPR Infowars The Burrard Street Journal
Reuters Intellihub The Spoof
The NY Times Natural News
USA Today Waking Times
Table 2: US news sources
Abbr. Description Abbr. Description Abbr. Description
Category 1: Complexity features Category 3: Parts of speech features Category 4: Psychological features
GI Gunning fog grade readability index Pronoun Frequency of pronouns Insight Frequency of insight related words
SMOG SMOG readability index PPronoun Frequency of proper pronouns Percept Frequency of perceptual process words
FK-RE Flesch-Kincaid reading ease index IPron Frequency of I pronoun Posemo Frequency of positive emotion words
FK-GL Flesch-Kincaid grade level You Frequency of you pronoun Tentat Frequency of tentative words
TTR Type-Token Ratio (lexical diversity) SheHe Frequency of pronouns she and he Negemo Frequency of negative emotion words
WC Word count We Frequency of pronoun we Certain Frequency of certainty words
WPS Words per sentence Negate Frequency of negation words Sad Frequency of words related to sadness
AVG_WLEN Avg. length of words Compare Frequency of comparison words Achieve Frequency of achievement words
SixLtr Frequency of six letter words Preps Frequency of prepositions Anger Frequency of anger words
Category 2: Stylistic features Article Frequency of articles AllPunc Frequency of punctuation characters
Comma Frequency of commas Verb Frequency of verbs Anx Frequency of anxiety words
Exclam Frequency of exclamation marks AuxVerb Frequency of auxiliary verbs Cause Frequency of causal words (because, effect)
Quote Frequency of quotations Quant Frequency of quantifying words Discrep Frequency of discrepancy words
Period Frequency of period characters Number Frequency of numerals Feel Frequency of feeling words
QMark Frequency of question marks Adjective Frequency of adjectives
Parenth Frequency of parentheses Conj Frequency of conjunctions
AllCaps Frequency of words in all capital letters
Table 3: Features used in this study grouped by category

. Feature Where BR US SMOG TXT U >R >S U >R >S GF TXT U >R >S U >R >S FK-RE TXT S >R >U S >R >U FK-GL TXT U >R >S U >R >S WC TXT R >U >S R >U >S WPS TXT U >R >S U >R >S FK-RE TTL U >S = R S >R >U WC TTL U >S >R U >R >S WPS TTL U = S >R U >R >S TTR TTL S = R >U R >S = U Feature Where BR US AllCaps TXT R = U = S S >R >U Colon TXT U >S = R U >S = R QMark TXT S >U >R S >U >R Exclam TXT U = S >R S >U >R Dash TXT R = U >S S = R >U Parenth TXT R >U >S U >S >R OtherP TXT U >R >S U >R >S AllCaps TTL U >R >S S = U >R SixLtr TTL S = R >U U = R >S Colon TTL U >R >S U >R = S SemiC TTL U = R >S R >S = U Exclam TTL U >R = S S >U >R Feature Where BR US Funct TXT S >U >R S >R >U Pronoun TXT S >U >R S >R >U PPronoun TXT S >U >R S >R >U SheHe TXT S >U >R S >R >U IPron TXT S >U >R S >R >U Article TXT S >U >R S >R >U AuxVerb TXT S >U >R R >U = S Negate TXT U = S >R U >R >S Quant TXT S >U >R U = S = R (a) Category 1 (b) Category 2 (c) Category 3 Overall agreement: 0.5 Overall agreement: -0.03 Overall agreement: 0.13 Unreliable vs Reliable agreement: 0.9 Unreliable vs Reliable agreement: 0.58 Unreliable vs Reliable agreement: 0.11

Table 4: Agreement between datasets BR and US (the agreement with respect to Kendall-tau is in the range (1,-1). TTL is title, TXT is body text.

4 Methodology

To reduce the dimensionality and find the most significant features, we performed hypothesis testing using the one-way ANOVA (ensuring our feature distribution are normal). First, for each dataset (BR and US), we separate our data into three classes: reliable, unreliable and satire news articles. Then, for each feature, we apply a hypothesis test to the distributions of that feature over each pair of classes (reliable vs. unreliable; reliable vs. satire; unreliable vs. satire). If the tests result in a low p-value (), the distributions of the tested feature over different classes is statistically significant. To measure the effect of that significance, we use Cohen’s d effect size. The effect size quantifies the difference between the two distributions, a large effect size implies the values of the feature are considerably different across the classes. If a feature has a small p-value and a large effect size, we say the feature is significant.

We select the most significant features in the dataset to run a Support Vector Machine (SVM) classifier with a linear kernel. Features are selected according to Cohen’s d magnitude descriptor, features whose effect size magnitude is at least

(medium) are included used in the SVM. The unbalanced number of samples in each class is handled by upsampling the data in the least populated class, this gives us a baseline accuracy of . We use three binary classifiers to separate Reliable vs Unreliable articles.

To assess the universality of features, we compared the ordering relations between distributions of features in the three classes between US and BR using Kendall-tau where for each pairs of classes from R,U and S, we count agreements (+1) and disagreements (-1) in the orderings, add the numbers and divide by total number of comparisons. Kendall-tau ranges between +1 (complete agreement) and -1 (complete disagreement). To obtain rankings between classes R, U and S for each feature, we first check the effect size. If its magnitude is below the threshold, the classes are expected to have similar values for that feature (i.e. equality). Otherwise, we order classes based on their expected value.

5 Results

We show the ordering of significant features for Categories 1, 2 and 3 in Table 4 (a), (b) and (c). First, we observe significant similarities in Category 1 for all U, R and S. Unreliable articles use simpler language, are shorter overall but have longer sentences than reliable articles. Unreliable article titles are longer than reliable article titles, this suggests that unreliable sources try to convey as much information as possible in the title, in an attempt to draw the reader’s attention. Satire in contrast uses more complex language but is shorter in general than all groups. It is clear unreliable articles use simpler language to be easily understood and try to convey their message through longer sentences. In the stylistic features of Category 2, the agreement is low overall, but reasonably high with respect the comparison between reliable and unreliable sources. Unreliable sources use more question marks, exclamation points, all caps both in body and title, revealing the usage of more informal language. Ultimately, we believe the main purpose of these is to get attention of readers. Category 3 shows good overall agreement, but lower agreement regarding reliable and unreliable sources. The main source of agreement in this case is with the satire articles compared to other articles. In essence, the parts of text features are more useful for categorizing satire, but not unreliable articles. Such agreements tell us which features are relevant for both datasets simultaneously, thus displaying universality among the languages. These are the features that can be used by our machine learning model to classify sources in a dataset that contains both BR and US news articles. Furthermore, these results indicate that, in both datasets, stylistic and complexity features play an important role in separating articles according to source type.

The classification of BR and US datasets used the 60 and 49 most relevant features, respectively. The test accuracy for the BR and US datasets were 85% and 72%, respectively, with a baseline score of 50%. The combined dataset used a reduced set of features consisting of an intersection of the most relevant features observed in both BR and US, achieving a test score of 70% using only 18 features. See Table 5. The majority of significant features in the combined dataset were from categories 1 and 2 (complexity and stylistic). This result reinforces our previous findings which show how writing style is substantially different between reliable and unreliable news sources. Furthermore, the results also suggest the existence of universality of complexity and stylistic features when separating reliable from unreliable articles in both Portuguese and English languages.

85% 72% 70%
Table 5: Classification test scores for classifying R vs U in the BR, US, and combined BR + US dataset. The baseline score is 50%.

6 Conclusion and Future Work

In this study, we strengthen the claim that suggests the existence of noticeable differences between news articles from reliable and unreliable sources. Writing style and complexity are extremely significant for distinguishing between articles of the two classes. We have shown that these features may be used to classify news articles in a language other than English. In addition, we have found evidence of universality of such features across the Portuguese and English languages by using a single set of features in the classification of a combined dataset containing articles from BR and US datasets and achieving fair classification accuracy. In our future work, we intend to expand the exploration to other languages that may share commonalities in the separation of reliable and unreliable sources and carry out this experiment on different time frames, such as in mid and post-election periods, to evaluate the effects of temporal dynamics over the study. We hope these results contribute to develop guidelines for identification of sources of unreliable information.