Characterizing Political Fake News in Twitter by its Meta-Data

12/16/2017
by   Julio Amador, et al.
Imperial College London
0

This article presents a preliminary approach towards characterizing political fake news on Twitter through the analysis of their meta-data. In particular, we focus on more than 1.5M tweets collected on the day of the election of Donald Trump as 45th president of the United States of America. We use the meta-data embedded within those tweets in order to look for differences between tweets containing fake news and tweets not containing them. Specifically, we perform our analysis only on tweets that went viral, by studying proxies for users' exposure to the tweets, by characterizing accounts spreading fake news, and by looking at their polarization. We found significant differences on the distribution of followers, the number of URLs on tweets, and the verification of the users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

03/22/2018

Influence of fake news in Twitter during the 2016 US presidential election

We investigate the influence of fake and traditional, fact-based, news o...
11/27/2019

The Limited Reach of Fake News on Twitter during 2019 European Elections

The advent of social media changed the way we consume content favoring a...
12/27/2020

The metrics of keywords to understand the difference between Retweet and Like in each category

The purpose of this study is to clarify what kind of news is easily retw...
09/18/2018

RumourEval 2019: Determining Rumour Veracity and Support for Rumours

This is the proposal for RumourEval-2019, which will run in early 2019 a...
11/11/2019

Analysing Russian Trolls via NLP tools

The fifty-eighth American presidential election in 2016 still arouse fie...
02/04/2021

High-level Approaches to Detect Malicious Political Activity on Twitter

Our work represents another step into the detection and prevention of th...
09/05/2018

Stance Prediction for Russian: Data and Analysis

Stance detection is a critical component of rumour and fake news identif...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

While fake news, understood as deliberately misleading pieces of information, have existed since long ago (e.g. it is not unusual to receive news falsely claiming the death of a celebrity), the term reached the mainstream, particularly so in politics, during the 2016 presidential election in the United States [1]. Since then, governments and corporations alike (e.g. Google [2] and Facebook [3]) have begun efforts to tackle fake news as they can affect political decisions [4]. Yet, the ability to define, identify and stop fake news from spreading is limited.

Since the Obama campaign in 2008, social media has been pervasive in the political arena in the United States. Studies report that up to 62% of American adults receive their news from social media [5]. The wide use of platforms such as Twitter and Facebook has facilitated the diffusion of fake news by simplifying the process of receiving content with no significant third party filtering, fact-checking or editorial judgement. Such characteristics make these platforms suitable means for sharing news that, disguised as legit ones, try to confuse readers.

Such use and their prominent rise has been confirmed by Craig Silverman, a Canadian journalist who is a prominent figure on fake news [6]: “In the final three months of the US presidential campaign, the top-performing fake election news stories on Facebook generated more engagement than the top stories from major news outlet”.

Our current research hence departs from the assumption that social media is a conduit for fake news and asks the question of whether fake news (as spam was some years ago) can be identified, modelled and eventually blocked. In order to do so, we use a sample of more that 1.5M tweets collected on November 8th 2016 —election day in the United States— with the goal of identifying features that tweets containing fake news are likely to have. As such, our paper aims to provide a preliminary characterization of fake news in Twitter by looking into meta-data embedded in tweets. Considering meta-data as a relevant factor of analysis is in line with findings reported by Morris et al. [7]. We argue that understanding differences between tweets containing fake news and regular tweets will allow researchers to design mechanisms to block fake news in Twitter.

Specifically, our goals are: 1) compare the characteristics of tweets labelled as containing fake news to tweets labelled as not containing them, 2) characterize, through their meta-data, viral tweets containing fake news and the accounts from which they originated, and 3) determine the extent to which tweets containing fake news expressed polarized political views.

For our study, we used the number of retweets to single-out those that went viral within our sample. Tweets within that subset (viral tweets hereafter) are varied and relate to different topics. We consider that a tweet contains fake news if its text falls within any of the following categories described by Rubin et al. [8] (see next section for the details of such categories): serious fabrication, large-scale hoaxes, jokes taken at face value, slanted reporting of real facts and stories where the truth is contentious. The dataset [9], manually labelled by an expert, has been publicly released and is available to researchers and interested parties.

From our results, the following main observations can be made:

  • Distribution in the number of retweets, favourites and hashtags in tweets containing fake news are not significantly different from their counterparts in tweets not containing fake news.

  • Accounts generating fake news are comparatively more unverified that accounts not producing fake news.

  • There are significant differences in both the number of friends and followers of the accounts creating tweets with fake news when compared with accounts not generating them.

  • There are no significant differences in the number of media elements, but there are indications that the number of URLs it is indeed different.

Our findings resonate with similar work done on fake news such as the one from Allcot and Gentzkow [10]. Therefore, even if our study is a preliminary attempt at characterizing fake news on Twitter using only their meta-data, our results provide external validity to previous research. Moreover, our work not only stresses the importance of using meta-data, but also underscores which parameters may be useful to identify fake news on Twitter.

The rest of the paper is organized as follows. The next section briefly discusses where this work is located within the literature on fake news and contextualizes the type of fake news we are studying. Then, we present our hypotheses, the data, and the methodology we follow. Finally, we present our findings, conclusions of this study, and future lines of work.

Defining Fake news

Our research is connected to different strands of academic knowledge related to the phenomenon of fake news. In relation to Computer Science, a recent survey by Conroy and colleagues [11]

identifies two popular approaches to single-out fake news. On the one hand, the authors pointed to linguistic approaches consisting in using text, its linguistic characteristics and machine learning techniques to automatically flag fake news. On the other, these researchers underscored the use of network approaches, which make use of network characteristics and meta-data, to identify fake news.

With respect to social sciences, efforts from psychology, political science and sociology, have been dedicated to understand why people consume and/or believe misinformation [12, 13, 14, 15]. Most of these studies consistently reported that psychological biases such as priming effects and confirmation bias play an important role in people ability to discern misinformation.

In relation to the production and distribution of fake news, a recent paper in the field of Economics [10] found that most fake news sites use names that resemble those of legitimate organizations, and that sites supplying fake news tend to be short-lived. These authors also noticed that fake news items are more likely shared than legitimate articles coming from trusted sources, and they tend to exhibit a larger level of polarization.

The conceptual issue of how to define fake news is a serious and unresolved issue. As the focus of our work is not attempting to offer light on this, we will rely on work by other authors to describe what we consider as fake news. In particular, we use the categorization provided by Rubin et al. [8]. The five categories they described, together with illustrative examples from our dataset, are as follows:

  1. Serious fabrication. These are news stories created entirely to deceive readers. During the 2016 US presidential election there were plenty of examples of this (e.g. claiming a celebrity has endorsed Donald Trump when that was not the case). For instance: [@JebBush - Maybe Donald negotiated a deal with his buddy @HillaryClinton. Continuing this path will put her in the White House. https://t.co/AlvByiSrMn]

  2. Large-scale hoaxes. Deceptions that are then reported in good faith by reputable sources. A recent example would be the story that the founder of Corona beer made everyone in his home village a millionaire in his will. For instance: [@FullFrontalSamB - Unfortunately Melania copied HER ballot from Michelle so… Donald just voted for Hillary. #ElectionDay https://t.co/x2ZimtFxyl]

  3. Jokes taken at face value. Humour sites such as the Onion or Daily Mash present fake news stories in order to satirise the media. Issues can arise when readers see the story out of context and share it with others. For instance: [@BBCTaster - BREAKING NEWS: If you face-swap @realDonaldTrump with @MayorofLondon you get Owen Wilson. https://t.co/YY8a20wQVP]

  4. Slanted reporting of real facts. Selectively-chosen but truthful elements of a story put together to serve an agenda. One of the most prevalent examples of this is the well-known problems of voting machine faults. For instance: [@NeilTurner_ - @realDonaldTrump Trump predicted it. #BrusselsAttack https://t.co/BM3UxA7heR]

  5. Stories where the ‘truth’ is contentious. On issues where ideologies or opinions clash —for example, territorial conflicts— there is sometimes no established baseline for truth. Reporters may be unconsciously partisan, or perceived as such. For instance: [@FoxNews - Report: @HillaryClinton’s plan would raise taxes $1.3T/10 years. https://t.co/Dh1tWM4FAP]

Research Hypotheses

Previous works on the area (presented in the section above) suggest that there may be important determinants for the adoption and diffusion of fake news. Our hypotheses builds on them and identifies three important dimensions that may help distinguishing fake news from legit information:

  1. Exposure.

    Given that psychological effects such as priming and confirmation biases are likely to increase the probability an individual believes in a certain piece of information, we believe exposure to misinformation is an important determinant of a fake news distribution strategy.

  2. Characterization. Given that distributors of fake news may want to simulate legitimate information outlets, we believe it is important to analyse specific features that may help a fake news outlet ‘disguise’ as a legit one.

  3. Polarization. Given that fake news outlets are more likely to attract attention with polarizing content (See [15]), we believe the level of polarization is an important determinant of a fake news distribution strategy.

Taking those three dimensions into account, we propose the following hypotheses about the features that we believe can help to identify tweets containing fake news from those not containing them. They will be later tested over our collected dataset.

Exposure.

H1A:

The average number of retweets of a viral tweet containing fake news is larger than that of viral tweets not containing them.

H1B:

The average number of hashtags and user mentions in viral tweets with fake news is larger than that of viral tweets with no fake news in them.

Characterization.

H2A:

Viral tweets containing fake news have a larger number of URLs.

H2B:

Creation date of an account generating tweets with fake news is more recent that those accounts tweeting non-fake news content.

H2C:

The rate of friends/followers of accounts tweeting fake news is larger than the rate of those creating tweets without them.

Polarization.

H3:

Viral tweets containing fake news are slanted towards one candidate.

Data and Methodology

For this study, we collected publicly available tweets using Twitter’s public API. Given the nature of the data, it is important to emphasize that such tweets are subject to Twitter’s terms and conditions which indicate that users consent to the collection, transfer, manipulation, storage, and disclosure of data. Therefore, we do not expect ethical, legal, or social implications from the usage of the tweets. Our data was collected using search terms related to the presidential election held in the United States on November 8th 2016. Particularly, we queried Twitter’s streaming API, more precisely the filter endpoint of the streaming API, using the following hashtags and user handles: #MyVote2016, #ElectionDay, #electionnight, @realDonaldTrump and @HillaryClinton. The data collection ran for just one day (Nov 8th 2016).

One straightforward way of sharing information on Twitter is by using the retweet functionality, which enables a user to share a exact copy of a tweet with his followers. Among the reasons for retweeting, Body et al. [16] reported the will to: 1) spread tweets to a new audience, 2) to show one’s role as a listener, and 3) to agree with someone or validate the thoughts of others. As indicated, our initial interest is to characterize tweets containing fake news that went viral (as they are the most harmful ones, as they reach a wider audience), and understand how it differs from other viral tweets (that do not contain fake news). For our study, we consider that a tweet went viral if it was retweeted more than 1000 times.

Once we have the dataset of viral tweets, we eliminated duplicates (some of the tweets were collected several times because they had several handles) and an expert manually inspected the text field within the tweets to label them as containing fake news, or not containing them (according to the characterization presented before). This annotated dataset [9] is publicly available and can be freely reused.

Finally, we use the following fields within tweets (from the ones returned by Twitter’s API) to compare their distributions and look for differences between viral tweets containing fake news and viral tweets not containing fake news:

  • Exposure: created_at, retweet_count, favourites_count and hashtags.

  • Characterization. screen_name, verified, urls, followers_count, friends_count and media.

  • Polarization. text and hashtags.

In the following section, we provide graphical descriptions of the distribution of each of the identified attributes for the two sets of tweets (those labelled as containing fake news and those labelled as not containing them). Where appropriate, we normalized and/or took logarithms of the data for better representation. To gain a better understanding of the significance of those differences, we use the Kolmogorov-Smirnov test with the null hypothesis that both distributions are equal.

Results

The sample collected consisted on 1 785 855 tweets published by 848 196 different users. Within our sample, we identified 1327 tweets that went viral (retweeted more than 1000 times by the 8th of November 2016) produced by 643 users. Such small subset of viral tweets were retweeted on 290 841 occasions in the observed time-window.

The 1327 ‘viral’ tweets were manually annotated as containing fake news or not. The annotation was carried out by a single person in order to obtain a consistent annotation throughout the dataset. Out of those 1327 tweets, we identified 136 as potentially containing fake news (according to the categories previously described), and the rest were classified as ‘non containing fake news’. Note that the categorization is far from being perfect given the ambiguity of fake news themselves and human judgement involved in the process of categorization. Because of this, we do not claim that this dataset can be considered a ground truth.

The following results detail characteristics of these tweets along the previously mentioned dimensions. Table 1 reports the actual differences (together with their associated p-values) of the distributions of viral tweets containing fake news and viral tweets not containing them for every variable considered.

Kolmogorov-Smirnov test
feature difference p-value
Followers 0.2357 2.6E-6
Friends 0.1747 0.0012
URLs 0.1285 0.0358
Favourites 0.1218 0.0535
Mentions 0.1135 0.0862
Media 0.0948 0.2231
Retweets 0.0609 0.7560
Hashtags 0.0350 0.9983
Table 1: For each one of the selected features, the table shows the difference between the set of tweets containing fake news and those non containing them, and the associated p-value (applying a Kolmogorov-Smirnov test). The null hypothesis is that both distributions are equal (two sided). Results are ordered by decreasing p-value.

Exposure

Figure 1 shows that, in contrast to other kinds of viral tweets, those containing fake news were created more recently. As such, Twitter users were exposed to fake news related to the election for a shorter period of time.

However, in terms of retweets, Figure 2 shows no apparent difference between containing fake news or not containing them. That is confirmed by the Kolmogorov-Smirnoff test, which does not discard the hypothesis that the associated distributions are equal.

Figure 1: Distribution of the date of creation of the tweets that were viral on November 8th. For clarity, the image only shows the year 2016, and no more than 150 tweets per day.
Figure 2: Density distributions of achieved retweets for tweets in our dataset 1)containing fake news and 2)not containing them. No differences are apparent.
Figure 3: Density distributions of the number of favourites that the user generating the tweet has. The differences are not statistically significant.
Figure 4: Distribution of the number of hashtags used in tweets labelled as containing fake news and those labelled as not containing them.

In relation to the number of favourites, users that generated at least a viral tweet containing fake news appear to have, on average, less favourites than users that do not generate them. Figure 3 shows the distribution of favourites. Despite the apparent visual differences, the difference are not statistically significant.

Finally, the number of hashtags used in viral fake news appears to be larger than those in other viral tweets. Figure 4 shows the density distribution of the number of hashtags used. However, once again, we were not able to find any statistical difference between the average number of hashtags in a viral tweet and the average number of hashtags in viral fake news.

Characterization

We found that 82 users within our sample were spreading fake news (i.e. they produced at least one tweet which was labelled as fake news). Out of those, 34 had verified accounts, and the rest were unverified. From the 48 unverified accounts, 6 have been suspended by Twitter at the date of writing, 3 tried to imitate legitimate accounts of others, and 4 accounts have been already deleted. Figure 5 shows the proportion of verified accounts to unverified accounts for viral tweets (containing fake news vs. not containing fake news). From the chart, it is clear that there is a higher chance of fake news coming from unverified accounts.

Figure 5: Tweets labelled as containing fake news mostly come from non-verified users. This contrasts with the opposite pattern for tweets non containing them (which mostly originate from verified accounts).

Turning to friends, accounts distributing fake news appear to have, on average, the same number of friends than those distributing tweets with no fake news. However, the density distribution of friends from the accounts (Figure 6) shows that there is indeed a statistically significant difference in their distributions.

Figure 6: Density distributions (for tweets labelled as containing fake news, and tweets labelled as not containing them) of the number of friends that the user generating the tweet has. Difference is statistically significant.
Figure 7: Density distributions of the number of followers that the accounts generating viral tweets (within our sample) have. Accounts producing fake news have a narrower window of followers.
Figure 8:

Density distribution of friends/followers ratio, showing quartiles. Accounts that generate fake news tend to have a higher ratio value.

Figure 9:

Density distribution of friends/followers ratio. Note that they do not follow a normal distribution. A higher friends/followers ratio exists for accounts that has at least produced a tweet labelled as containing fake news.

If we take into consideration the number of followers, accounts generating viral tweets with fake news do have a very different distribution on this dimension, compared to those accounts generating viral tweets with no fake news (see Figure 7). In fact, such differences are statistically significant.

A useful representation for friends and followers is the ratio between friends/followers. Figures 8 and 9 show this representation. Notice that accounts spreading viral tweets with fake news have, on average, a larger ratio of friends/followers. The distribution of those accounts not generating fake news is more evenly distributed.

With respect to the number of mentions, Figure 10 shows that viral tweets labelled as containing fake news appear to use mentions to other users less frequently than viral tweets not containing fake news. In other words, tweets containing fake news mostly contain 1 mention, whereas other tweets tend to have two). Such differences are statistically significant.

Figure 10: Number of mentions within tweets labelled as containing fake news and tweets not containing them. There is almost a similar distribution of 1 and 2 mentions for tweets containing fake news. This contrasts with tweets not containing fake news, in which 2 mentions is much more common.

The analysis (Figure 11) of the presence of media in the tweets in our dataset shows that tweets labelled as not containing fake news appear to present more media elements than those labelled as fake news. However, the difference is not statistically significant.

Figure 11: Number of media elements embedded within viral tweets (labelled as containing fake news vs. labelled as not containing them)

On the other hand, Figure 12 shows that viral tweets containing fake news appear to include more URLs to other sites than viral tweets that do not contain fake news. In fact, the difference between the two distributions is statistically significant (assuming ).

Figure 12: Number of URLs embedded within viral tweets (with fake news vs. without them). Differences are statistically significant with

Polarization

Finally, manual inspection of the text field of those viral tweets labelled as containing fake news shows that 117 of such tweets expressed support for Donald Trump, while only 8 supported Hillary Clinton. The remaining tweets contained fake news related to other topics, not expressing support for any of the candidates.

Discussion

As a summary, and constrained by our existing dataset, we made the following observations regarding differences between viral tweets labelled as containing fake news and viral tweets labelled as not containing them:

  • Less than of the tweets went viral. Out of those, only were labelled as containing fake news.

  • Tweets containing fake news that became viral during the day of the election were mostly created very shortly before that day or in the day. That contrasts with tweets not containing fake news (which were initially created much before election day).

  • Considering retweets, favourites and hashtags as proxies for exposures, we did not find any difference between viral tweets labelled as containing fake news and viral tweets labelled as not containing them.

  • The characterization of accounts spreading fake news has shown that the proportion of unverified accounts that generates at least a tweet containing fake news is larger than that of accounts spreading tweets not labelled as fake news.

  • Even if the accounts producing fake news are, on average, following the same number of other users than those producing tweet with no fake news in them, the distribution of followers are statistically different.

  • There is no significant difference between the number of media elements in viral tweets labelled as containing fake news and viral tweets labelled as not containing them.

  • Viral tweets labelled as containing fake news tend to have more URLs than viral tweets with labelled as not containing fake news.

  • Regarding polarization, fake news were heavily supportive of the Trump campaign.

Hypothesis
H1A: The average number of retweets of a viral tweet containing fake news is larger than that of viral tweets not containing them NOT CONFIRMED
H1B: The average number of hashtags and user mentions in viral tweets with fake news is larger than that of viral tweets with no fake news in them. NOT CONFIRMED
H2A: Viral tweets containing fake news have a larger number of URLs. CONFIRMED
H2B: Creation date of an account generating tweets with fake news is more recent that those accounts tweeting non-fake news content. CONFIRMED
H2C: The rate of friends/followers of accounts tweeting fake news is larger than the rate of those creating tweets without them. CONFIRMED
H3: Viral tweets containing fake news are slanted towards one candidate. CONFIRMED
Table 2: Summary of our conclusions, and tested hypothesis

These findings (related to our initial hypothesis in Table 2) clearly suggest that there are specific pieces of meta-data about tweets that may allow the identification of fake news. One such parameter is the time of exposure. Viral tweets containing fake news are shorter-lived than those containing other type of content. This notion seems to resonate with our findings showing that a number of accounts spreading fake news have already been deleted or suspended by Twitter by the time of writing. If one considers that researchers using different data have found similar results [10], it appears that the lifetime of accounts, together with the age of the questioned viral content could be useful to identify fake news. In the light of this finding, accounts newly created should probably put under higher scrutiny than older ones. This in fact, would be a nice a-priori bias for a Bayesian classifier.

Accounts spreading fake news appear to have a larger proportion of friends/followers (i.e. they have, on average, the same number of friends but a smaller number of followers) than those spreading viral content only. Together with the fact that, on average, tweets containing fake news have more URLs than those spreading viral content, it is possible to hypothesize that, both, the ratio of friends/followers of the account producing a viral tweet and number of URLs contained in such a tweet could be useful to single-out fake news in Twitter. Not only that, but our finding related to the number of URLs is in line with intuitions behind the incentives to create fake news commonly found in the literature [10] (in particular that of obtaining revenue through click-through advertising).

Finally, it is interesting to notice that the content of viral fake news was highly polarized. This finding is also in line with those of Alcott et al. [10]

. This feature suggests that textual sentiment analysis of the content of tweets (as most researchers do), together with the above mentioned parameters from meta-data, may prove useful for identifying fake news.

Conclusions

With the election of Donald Trump as President of the United States, the concept of fake news has become a broadly-known phenomenon that is getting tremendous attention from governments and media companies. We have presented a preliminary study on the meta-data of a publicly available dataset of tweets that became viral during the day of the 2016 US presidential election. Our aim is to advance the understanding of which features might be characteristic of viral tweets containing fake news in comparison with viral tweets without fake news.

We believe that the only way to automatically identify those deceitful tweets (i.e. containing fake news) is by actually understanding and modelling them. Only then, the automation of the processes of tagging and blocking these tweets can be successfully performed. In the same way that spam was fought, we anticipate fake news will suffer a similar evolution, with social platforms implementing tools to deal with them. With most works so far focusing on the actual content of the tweets, ours is a novel attempt from a different, but also complementary, angle.

Within the used dataset, we found there are differences around exposure, characteristics of accounts spreading fake news and the tone of the content. Those findings suggest that it is indeed possible to model and automatically detect fake news. We plan to replicate and validate our experiments in an extended sample of tweets (until 4 months after the US election), and tests the predictive power of the features we found relevant within our sample.

Author Disclosure Statement

No competing financial interest exist.

References