Introduction
While fake news, understood as deliberately misleading pieces of information, have existed since long ago (e.g. it is not unusual to receive news falsely claiming the death of a celebrity), the term reached the mainstream, particularly so in politics, during the 2016 presidential election in the United States [1]. Since then, governments and corporations alike (e.g. Google [2] and Facebook [3]) have begun efforts to tackle fake news as they can affect political decisions [4]. Yet, the ability to define, identify and stop fake news from spreading is limited.
Since the Obama campaign in 2008, social media has been pervasive in the political arena in the United States. Studies report that up to 62% of American adults receive their news from social media [5]. The wide use of platforms such as Twitter and Facebook has facilitated the diffusion of fake news by simplifying the process of receiving content with no significant third party filtering, fact-checking or editorial judgement. Such characteristics make these platforms suitable means for sharing news that, disguised as legit ones, try to confuse readers.
Such use and their prominent rise has been confirmed by Craig Silverman, a Canadian journalist who is a prominent figure on fake news [6]: “In the final three months of the US presidential campaign, the top-performing fake election news stories on Facebook generated more engagement than the top stories from major news outlet”.
Our current research hence departs from the assumption that social media is a conduit for fake news and asks the question of whether fake news (as spam was some years ago) can be identified, modelled and eventually blocked. In order to do so, we use a sample of more that 1.5M tweets collected on November 8th 2016 —election day in the United States— with the goal of identifying features that tweets containing fake news are likely to have. As such, our paper aims to provide a preliminary characterization of fake news in Twitter by looking into meta-data embedded in tweets. Considering meta-data as a relevant factor of analysis is in line with findings reported by Morris et al. [7]. We argue that understanding differences between tweets containing fake news and regular tweets will allow researchers to design mechanisms to block fake news in Twitter.
Specifically, our goals are: 1) compare the characteristics of tweets labelled as containing fake news to tweets labelled as not containing them, 2) characterize, through their meta-data, viral tweets containing fake news and the accounts from which they originated, and 3) determine the extent to which tweets containing fake news expressed polarized political views.
For our study, we used the number of retweets to single-out those that went viral within our sample. Tweets within that subset (viral tweets hereafter) are varied and relate to different topics. We consider that a tweet contains fake news if its text falls within any of the following categories described by Rubin et al. [8] (see next section for the details of such categories): serious fabrication, large-scale hoaxes, jokes taken at face value, slanted reporting of real facts and stories where the truth is contentious. The dataset [9], manually labelled by an expert, has been publicly released and is available to researchers and interested parties.
From our results, the following main observations can be made:
-
Distribution in the number of retweets, favourites and hashtags in tweets containing fake news are not significantly different from their counterparts in tweets not containing fake news.
-
Accounts generating fake news are comparatively more unverified that accounts not producing fake news.
-
There are significant differences in both the number of friends and followers of the accounts creating tweets with fake news when compared with accounts not generating them.
-
There are no significant differences in the number of media elements, but there are indications that the number of URLs it is indeed different.
Our findings resonate with similar work done on fake news such as the one from Allcot and Gentzkow [10]. Therefore, even if our study is a preliminary attempt at characterizing fake news on Twitter using only their meta-data, our results provide external validity to previous research. Moreover, our work not only stresses the importance of using meta-data, but also underscores which parameters may be useful to identify fake news on Twitter.
The rest of the paper is organized as follows. The next section briefly discusses where this work is located within the literature on fake news and contextualizes the type of fake news we are studying. Then, we present our hypotheses, the data, and the methodology we follow. Finally, we present our findings, conclusions of this study, and future lines of work.
Defining Fake news
Our research is connected to different strands of academic knowledge related to the phenomenon of fake news. In relation to Computer Science, a recent survey by Conroy and colleagues [11]
identifies two popular approaches to single-out fake news. On the one hand, the authors pointed to linguistic approaches consisting in using text, its linguistic characteristics and machine learning techniques to automatically flag fake news. On the other, these researchers underscored the use of network approaches, which make use of network characteristics and meta-data, to identify fake news.
With respect to social sciences, efforts from psychology, political science and sociology, have been dedicated to understand why people consume and/or believe misinformation [12, 13, 14, 15]. Most of these studies consistently reported that psychological biases such as priming effects and confirmation bias play an important role in people ability to discern misinformation.
In relation to the production and distribution of fake news, a recent paper in the field of Economics [10] found that most fake news sites use names that resemble those of legitimate organizations, and that sites supplying fake news tend to be short-lived. These authors also noticed that fake news items are more likely shared than legitimate articles coming from trusted sources, and they tend to exhibit a larger level of polarization.
The conceptual issue of how to define fake news is a serious and unresolved issue. As the focus of our work is not attempting to offer light on this, we will rely on work by other authors to describe what we consider as fake news. In particular, we use the categorization provided by Rubin et al. [8]. The five categories they described, together with illustrative examples from our dataset, are as follows:
-
Serious fabrication. These are news stories created entirely to deceive readers. During the 2016 US presidential election there were plenty of examples of this (e.g. claiming a celebrity has endorsed Donald Trump when that was not the case). For instance: [@JebBush - Maybe Donald negotiated a deal with his buddy @HillaryClinton. Continuing this path will put her in the White House. https://t.co/AlvByiSrMn]
-
Large-scale hoaxes. Deceptions that are then reported in good faith by reputable sources. A recent example would be the story that the founder of Corona beer made everyone in his home village a millionaire in his will. For instance: [@FullFrontalSamB - Unfortunately Melania copied HER ballot from Michelle so… Donald just voted for Hillary. #ElectionDay https://t.co/x2ZimtFxyl]
-
Jokes taken at face value. Humour sites such as the Onion or Daily Mash present fake news stories in order to satirise the media. Issues can arise when readers see the story out of context and share it with others. For instance: [@BBCTaster - BREAKING NEWS: If you face-swap @realDonaldTrump with @MayorofLondon you get Owen Wilson. https://t.co/YY8a20wQVP]
-
Slanted reporting of real facts. Selectively-chosen but truthful elements of a story put together to serve an agenda. One of the most prevalent examples of this is the well-known problems of voting machine faults. For instance: [@NeilTurner_ - @realDonaldTrump Trump predicted it. #BrusselsAttack https://t.co/BM3UxA7heR]
-
Stories where the ‘truth’ is contentious. On issues where ideologies or opinions clash —for example, territorial conflicts— there is sometimes no established baseline for truth. Reporters may be unconsciously partisan, or perceived as such. For instance: [@FoxNews - Report: @HillaryClinton’s plan would raise taxes $1.3T/10 years. https://t.co/Dh1tWM4FAP]
Research Hypotheses
Previous works on the area (presented in the section above) suggest that there may be important determinants for the adoption and diffusion of fake news. Our hypotheses builds on them and identifies three important dimensions that may help distinguishing fake news from legit information:
-
Exposure.
Given that psychological effects such as priming and confirmation biases are likely to increase the probability an individual believes in a certain piece of information, we believe exposure to misinformation is an important determinant of a fake news distribution strategy.
-
Characterization. Given that distributors of fake news may want to simulate legitimate information outlets, we believe it is important to analyse specific features that may help a fake news outlet ‘disguise’ as a legit one.
-
Polarization. Given that fake news outlets are more likely to attract attention with polarizing content (See [15]), we believe the level of polarization is an important determinant of a fake news distribution strategy.
Taking those three dimensions into account, we propose the following hypotheses about the features that we believe can help to identify tweets containing fake news from those not containing them. They will be later tested over our collected dataset.
Exposure.
- H1A:
-
The average number of retweets of a viral tweet containing fake news is larger than that of viral tweets not containing them.
- H1B:
-
The average number of hashtags and user mentions in viral tweets with fake news is larger than that of viral tweets with no fake news in them.
Characterization.
- H2A:
-
Viral tweets containing fake news have a larger number of URLs.
- H2B:
-
Creation date of an account generating tweets with fake news is more recent that those accounts tweeting non-fake news content.
- H2C:
-
The rate of friends/followers of accounts tweeting fake news is larger than the rate of those creating tweets without them.
Polarization.
- H3:
-
Viral tweets containing fake news are slanted towards one candidate.
Data and Methodology
For this study, we collected publicly available tweets using Twitter’s public API. Given the nature of the data, it is important to emphasize that such tweets are subject to Twitter’s terms and conditions which indicate that users consent to the collection, transfer, manipulation, storage, and disclosure of data. Therefore, we do not expect ethical, legal, or social implications from the usage of the tweets. Our data was collected using search terms related to the presidential election held in the United States on November 8th 2016. Particularly, we queried Twitter’s streaming API, more precisely the filter endpoint of the streaming API, using the following hashtags and user handles: #MyVote2016, #ElectionDay, #electionnight, @realDonaldTrump and @HillaryClinton. The data collection ran for just one day (Nov 8th 2016).
One straightforward way of sharing information on Twitter is by using the retweet functionality, which enables a user to share a exact copy of a tweet with his followers. Among the reasons for retweeting, Body et al. [16] reported the will to: 1) spread tweets to a new audience, 2) to show one’s role as a listener, and 3) to agree with someone or validate the thoughts of others. As indicated, our initial interest is to characterize tweets containing fake news that went viral (as they are the most harmful ones, as they reach a wider audience), and understand how it differs from other viral tweets (that do not contain fake news). For our study, we consider that a tweet went viral if it was retweeted more than 1000 times.
Once we have the dataset of viral tweets, we eliminated duplicates (some of the tweets were collected several times because they had several handles) and an expert manually inspected the text field within the tweets to label them as containing fake news, or not containing them (according to the characterization presented before). This annotated dataset [9] is publicly available and can be freely reused.
Finally, we use the following fields within tweets (from the ones returned by Twitter’s API) to compare their distributions and look for differences between viral tweets containing fake news and viral tweets not containing fake news:
-
Exposure: created_at, retweet_count, favourites_count and hashtags.
-
Characterization. screen_name, verified, urls, followers_count, friends_count and media.
-
Polarization. text and hashtags.
In the following section, we provide graphical descriptions of the distribution of each of the identified attributes for the two sets of tweets (those labelled as containing fake news and those labelled as not containing them). Where appropriate, we normalized and/or took logarithms of the data for better representation. To gain a better understanding of the significance of those differences, we use the Kolmogorov-Smirnov test with the null hypothesis that both distributions are equal.
Results
The sample collected consisted on 1 785 855 tweets published by 848 196 different users. Within our sample, we identified 1327 tweets that went viral (retweeted more than 1000 times by the 8th of November 2016) produced by 643 users. Such small subset of viral tweets were retweeted on 290 841 occasions in the observed time-window.
The 1327 ‘viral’ tweets were manually annotated as containing fake news or not. The annotation was carried out by a single person in order to obtain a consistent annotation throughout the dataset. Out of those 1327 tweets, we identified 136 as potentially containing fake news (according to the categories previously described), and the rest were classified as ‘non containing fake news’. Note that the categorization is far from being perfect given the ambiguity of fake news themselves and human judgement involved in the process of categorization. Because of this, we do not claim that this dataset can be considered a ground truth.
The following results detail characteristics of these tweets along the previously mentioned dimensions. Table 1 reports the actual differences (together with their associated p-values) of the distributions of viral tweets containing fake news and viral tweets not containing them for every variable considered.
Kolmogorov-Smirnov test | ||
---|---|---|
feature | difference | p-value |
Followers | 0.2357 | 2.6E-6 |
Friends | 0.1747 | 0.0012 |
URLs | 0.1285 | 0.0358 |
Favourites | 0.1218 | 0.0535 |
Mentions | 0.1135 | 0.0862 |
Media | 0.0948 | 0.2231 |
Retweets | 0.0609 | 0.7560 |
Hashtags | 0.0350 | 0.9983 |
Exposure
Figure 1 shows that, in contrast to other kinds of viral tweets, those containing fake news were created more recently. As such, Twitter users were exposed to fake news related to the election for a shorter period of time.
However, in terms of retweets, Figure 2 shows no apparent difference between containing fake news or not containing them. That is confirmed by the Kolmogorov-Smirnoff test, which does not discard the hypothesis that the associated distributions are equal.




In relation to the number of favourites, users that generated at least a viral tweet containing fake news appear to have, on average, less favourites than users that do not generate them. Figure 3 shows the distribution of favourites. Despite the apparent visual differences, the difference are not statistically significant.
Finally, the number of hashtags used in viral fake news appears to be larger than those in other viral tweets. Figure 4 shows the density distribution of the number of hashtags used. However, once again, we were not able to find any statistical difference between the average number of hashtags in a viral tweet and the average number of hashtags in viral fake news.
Characterization
We found that 82 users within our sample were spreading fake news (i.e. they produced at least one tweet which was labelled as fake news). Out of those, 34 had verified accounts, and the rest were unverified. From the 48 unverified accounts, 6 have been suspended by Twitter at the date of writing, 3 tried to imitate legitimate accounts of others, and 4 accounts have been already deleted. Figure 5 shows the proportion of verified accounts to unverified accounts for viral tweets (containing fake news vs. not containing fake news). From the chart, it is clear that there is a higher chance of fake news coming from unverified accounts.

Turning to friends, accounts distributing fake news appear to have, on average, the same number of friends than those distributing tweets with no fake news. However, the density distribution of friends from the accounts (Figure 6) shows that there is indeed a statistically significant difference in their distributions.



Density distribution of friends/followers ratio, showing quartiles. Accounts that generate fake news tend to have a higher ratio value.

Density distribution of friends/followers ratio. Note that they do not follow a normal distribution. A higher friends/followers ratio exists for accounts that has at least produced a tweet labelled as containing fake news.
If we take into consideration the number of followers, accounts generating viral tweets with fake news do have a very different distribution on this dimension, compared to those accounts generating viral tweets with no fake news (see Figure 7). In fact, such differences are statistically significant.
A useful representation for friends and followers is the ratio between friends/followers. Figures 8 and 9 show this representation. Notice that accounts spreading viral tweets with fake news have, on average, a larger ratio of friends/followers. The distribution of those accounts not generating fake news is more evenly distributed.
With respect to the number of mentions, Figure 10 shows that viral tweets labelled as containing fake news appear to use mentions to other users less frequently than viral tweets not containing fake news. In other words, tweets containing fake news mostly contain 1 mention, whereas other tweets tend to have two). Such differences are statistically significant.

The analysis (Figure 11) of the presence of media in the tweets in our dataset shows that tweets labelled as not containing fake news appear to present more media elements than those labelled as fake news. However, the difference is not statistically significant.

On the other hand, Figure 12 shows that viral tweets containing fake news appear to include more URLs to other sites than viral tweets that do not contain fake news. In fact, the difference between the two distributions is statistically significant (assuming ).

Polarization
Finally, manual inspection of the text field of those viral tweets labelled as containing fake news shows that 117 of such tweets expressed support for Donald Trump, while only 8 supported Hillary Clinton. The remaining tweets contained fake news related to other topics, not expressing support for any of the candidates.
Discussion
As a summary, and constrained by our existing dataset, we made the following observations regarding differences between viral tweets labelled as containing fake news and viral tweets labelled as not containing them:
-
Less than of the tweets went viral. Out of those, only were labelled as containing fake news.
-
Tweets containing fake news that became viral during the day of the election were mostly created very shortly before that day or in the day. That contrasts with tweets not containing fake news (which were initially created much before election day).
-
Considering retweets, favourites and hashtags as proxies for exposures, we did not find any difference between viral tweets labelled as containing fake news and viral tweets labelled as not containing them.
-
The characterization of accounts spreading fake news has shown that the proportion of unverified accounts that generates at least a tweet containing fake news is larger than that of accounts spreading tweets not labelled as fake news.
-
Even if the accounts producing fake news are, on average, following the same number of other users than those producing tweet with no fake news in them, the distribution of followers are statistically different.
-
There is no significant difference between the number of media elements in viral tweets labelled as containing fake news and viral tweets labelled as not containing them.
-
Viral tweets labelled as containing fake news tend to have more URLs than viral tweets with labelled as not containing fake news.
-
Regarding polarization, fake news were heavily supportive of the Trump campaign.
Hypothesis | |
---|---|
H1A: The average number of retweets of a viral tweet containing fake news is larger than that of viral tweets not containing them | NOT CONFIRMED |
H1B: The average number of hashtags and user mentions in viral tweets with fake news is larger than that of viral tweets with no fake news in them. | NOT CONFIRMED |
H2A: Viral tweets containing fake news have a larger number of URLs. | CONFIRMED |
H2B: Creation date of an account generating tweets with fake news is more recent that those accounts tweeting non-fake news content. | CONFIRMED |
H2C: The rate of friends/followers of accounts tweeting fake news is larger than the rate of those creating tweets without them. | CONFIRMED |
H3: Viral tweets containing fake news are slanted towards one candidate. | CONFIRMED |
These findings (related to our initial hypothesis in Table 2) clearly suggest that there are specific pieces of meta-data about tweets that may allow the identification of fake news. One such parameter is the time of exposure. Viral tweets containing fake news are shorter-lived than those containing other type of content. This notion seems to resonate with our findings showing that a number of accounts spreading fake news have already been deleted or suspended by Twitter by the time of writing. If one considers that researchers using different data have found similar results [10], it appears that the lifetime of accounts, together with the age of the questioned viral content could be useful to identify fake news. In the light of this finding, accounts newly created should probably put under higher scrutiny than older ones. This in fact, would be a nice a-priori bias for a Bayesian classifier.
Accounts spreading fake news appear to have a larger proportion of friends/followers (i.e. they have, on average, the same number of friends but a smaller number of followers) than those spreading viral content only. Together with the fact that, on average, tweets containing fake news have more URLs than those spreading viral content, it is possible to hypothesize that, both, the ratio of friends/followers of the account producing a viral tweet and number of URLs contained in such a tweet could be useful to single-out fake news in Twitter. Not only that, but our finding related to the number of URLs is in line with intuitions behind the incentives to create fake news commonly found in the literature [10] (in particular that of obtaining revenue through click-through advertising).
Finally, it is interesting to notice that the content of viral fake news was highly polarized. This finding is also in line with those of Alcott et al. [10]
. This feature suggests that textual sentiment analysis of the content of tweets (as most researchers do), together with the above mentioned parameters from meta-data, may prove useful for identifying fake news.
Conclusions
With the election of Donald Trump as President of the United States, the concept of fake news has become a broadly-known phenomenon that is getting tremendous attention from governments and media companies. We have presented a preliminary study on the meta-data of a publicly available dataset of tweets that became viral during the day of the 2016 US presidential election. Our aim is to advance the understanding of which features might be characteristic of viral tweets containing fake news in comparison with viral tweets without fake news.
We believe that the only way to automatically identify those deceitful tweets (i.e. containing fake news) is by actually understanding and modelling them. Only then, the automation of the processes of tagging and blocking these tweets can be successfully performed. In the same way that spam was fought, we anticipate fake news will suffer a similar evolution, with social platforms implementing tools to deal with them. With most works so far focusing on the actual content of the tweets, ours is a novel attempt from a different, but also complementary, angle.
Within the used dataset, we found there are differences around exposure, characteristics of accounts spreading fake news and the tone of the content. Those findings suggest that it is indeed possible to model and automatically detect fake news. We plan to replicate and validate our experiments in an extended sample of tweets (until 4 months after the US election), and tests the predictive power of the features we found relevant within our sample.
Author Disclosure Statement
No competing financial interest exist.
References
- [1] Connolly K, Chrisafis A, McPherson P, Kirchgaessner S, Haas B, Phillips D, Hunt E, Safi M. Fake news: an insidious trend that’s fast becoming a global problem. The Guardian 02 Dec 2016; https://www.theguardian.com/media/2016/dec/02/fake-news-facebook-us-election-around-the-world Accessed: 2017-05-03.
- [2] Fact check now available in Google Search and News around the world. https://blog.google/products/search/fact-check-now-available-google-search-and-news-around-world/. Accessed: 2017-05-20.
- [3] News feed FYI: New test with related articles. https://newsroom.fb.com/news/2017/04/news-feed-fyi-new-test-with-related-articles/. Accessed: 2017-05-15.
- [4] Hillary Clinton blames the Russians, Facebook, and Fake News for her loss. http://fortune.com/2017/05/31/clinton-fake-news/. Accessed: 2017-05-31.
- [5] Gottfried J, Shearer E. News use across social media platforms 2016. Technical Report 202.419.4372, Pew Research Center 2016. URL http://www.journalism.org/2016/05/26/news-use-across-social-media-platforms-2016.
- [6] Silverman C. Lies, damn lies and viral content. Technical report, Tow Center for Digital Journalism 2015. doi:10.7916/D8Q81RHH.
- [7] Morris M, Counts S, Roseway A, Hoff A, Schwarz J. Tweeting is believing?: understanding microblog credibility perceptions. In Procs. ACM 2012 Conf. Computer Supported Cooperative Work. 2012, 441–450.
- [8] Rubin VL, Chen Y, Conroy NJ. Deception detection for news: three types of fakes. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community. 2015, 83:1–83:4.
- [9] Amador J, Oehmichen A, Molina-Solana M. Viral tweets with fakenews on 2016 US election day. http://dx.doi.org/10.5281/zenodo.1048820 2017. doi:10.5281/zenodo.1048820.
- [10] Allcott H, Gentzkow M. Social media and fake news in the 2016 election. Technical Report 23089, National Bureau of Economic Research 2017. doi:10.3386/w23089.
- [11] Conroy NJ, Chen Y, Rubin VL. Automatic deception detection: Methods for finding fake news. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community. 2015, 82:1–82:4.
- [12] Pennycook G, Rand DG. Who Falls for Fake News? The Roles of Analytic Thinking, Motivated Reasoning, Political Ideology, and Bullshit Receptivity 2017. URL https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3023545.
- [13] Flynn D, Nyhan B, Reifler J. The nature and origins of misperceptions: Understanding false and unsupported beliefs about politics. Advances in Political Psychology 2017; 38(S1):127–150. doi:10.1111/pops.12394.
- [14] Polage DC, Polage DC. Making up History: False Memories of Fake News Stories. Europe’s Journal of Psychology 2012; 8(2):245–250. ISSN 1841-0413. doi:10.5964/ejop.v8i2.456. URL http://ejop.psychopen.eu/article/view/456.
- [15] Swire B, Berinsky AJ, Lewandowsky S, Ecker UKH. Processing political misinformation: comprehending the Trump phenomenon. Royal Society Open Science 2017; 4(3):160802. ISSN 2054-5703. doi:10.1098/rsos.160802.
- [16] Boyd D, Golder S, Lotan G. Tweet, tweet, retweet: Conversational aspects of retweeting on twitter. In Proc. 43rd Hawaii Int. Conf. on System Sciences. 2010. ISSN 1530-1605, 1–10. doi:10.1109/HICSS.2010.412.
Comments
There are no comments yet.