The Coronavirus pandemic has been accompanied by an “infodemic” according to the World Health Organization (who:2020; Gallotti:2020). An overabundance of information—not all of it truthful—about the Coronavirus and vaccines prevented people from finding trustworthy public health guidance guidance and can influence peoples’ behavior (Cinelli:2020). Research has shown that areas with “greater exposure to media downplaying the threat of COVID-19 experienced a greater number of cases and deaths” (Bursztyn:2020). Misinformation kills, and the public health community must find a way to effectively combat misinformation.
Most studies of misinformation on social media focus on content organic to the social media platform (e.g. the text of the Twitter tweet or Facebook post). Less attention has been given to external content like shared news articles. Studies that do examine external content look at the source (website) of the content to identify reliability and even characterize networks of low-information websites, but do not typically examine the character of the content (Yang:2020; Singh:2020). There have, however, been some recent works linking human behavior to website sharing on social media. Both (Giglietto:2020) and (Chen:2018), find that looking at coordination in link sharing can help uncover coordinated disinformation campaigns on social media sites. Even more broadly with human behaviour and link sharing, previous work has demonstrated that external content can often have different patterns of usage on social media and so form an important basis to understand social media behaviour and discussions (Boberg:2020; Cruickshank:2020). The websites users share on social media is an important indicator of information consumption and a means of understanding information propagation.
Our study aims to apply insights and tools from the emerging field of social-cybersecurity (Carley:2020). In particular, we are interested in the following primary questions: (1) What external content was shared in the vaccine discussion on Twitter during the COVID-19 pandemic outbreak? (2) What were the characteristics of this content? By looking at what outside information is shared on Twitter we can forge an understanding of how social media activity relates to activity outside of social media. We believe this novel approach of examining the content and the source of external articles and merging this information with social media activity will aid future public health messaging and become mainstream in misinformation work that examines changing narratives over time, misinformation across multiple social media platforms, and coordinated messaging of similar external content.
The analysis encompassed tweets from two different data sets with COVID-19 related tweets. The first data set was based on a collection of tweet IDs gathered using general COVID-19 related keywords by (Chen:2020). Examples of keywords included “coronavirus” and “Wuhancoronavirus.”
We used ”hydration,” a process of gathering information about each tweet into JSON format file via the Twitter API, on all tweet IDs (Hydrator:2020) (Twitter:2021). This only populated data from tweets that were still available on Twitter. Each tweet was filtered using the sub-strings ”vax” and ”vaccin” and only these were analyzed. The resulting data set contained 1,649,940 English language tweets spanning the time period of January 21, 2020 to July 31, 2020.
The second data set included approximately 4.5 million COVID-19-related tweets collected from 29 January 2020 to 23 June 2020, using a list of keywords, including “coronavirus”, “wuhan virus”, “wuhanvirus”, “2019nCoV”, “NCoV”, “NCoV2019” (Huang:2020). For our analysis, we selected the subset of these tweets that also contained vaccination-related keywords that include ”vax” and ”vaccin” .
Both data sets were then filtered to include only the dates of overlap (i.e. 1 February 2020 to 23 June 2020) and combined to produce 841,896 unique tweets for analysis.
We filtered out tweets that did not share a URL. We expanded shortened URLs 111https://unshortenit.readthedocs.io/en/latest/, filtered out links to youtube.com and twitter.com because we intended to scrape articles not videos or tweets. We removed duplicate links, but summed the total number of retweets and favorites for each link as a measure of virality. We scraped article content by using Requests Python library 222https://docs.python-requests.org/en/master/ to make an HTTP request to the URL in each tweet, with a six-second timeout to prevent the request from stalling the overall scraping process or burdening the domains being scraped, using BeautifulSoup Python library 333https://www.crummy.com/software/BeautifulSoup/bs4/doc/ to parse the HTML content of each webpage. We pulled all text from within paragraph (¡p¿) tags. We would have liked to use article (¡article¿) tags, but these are not ubiquitously used in web articles. We filtered out any articles that were less than 500 characters in length. The use of paragraph tags sometimes captured additional information (headers, footers, advertisements), which were filtered out later.
We labeled domains into various categories based on the labels compiled in (Cruickshank:2020) and from our domain knowledge. We first labeled domains into four domain categories by the type of domain they are: “dubious,” “government”, “news”, and “science”. For example, a ”news” domain would be www.cnn.com and a ”dubious” domain would be "www.oann.com". We then labeled domains into five bias categories: “conspiracy,” “fake-news,” “leans-left,” “leans-center,” and “leans-right”, which matches the bias labels compiled in (Cruickshank:2020). Only 6062 of the unique articles could be categorized into a known domain, all others were removed from the data set.
We used the Gensim Python library for cleaning 444https://radimrehurek.com/gensim/. We made all words lowercase, removed english stopwords and accents, and created bigrams and trigrams.
We chose term frequency-inverse document frequency (TF-IDF) to find the most important terms in each document within the corpus (Jones:1972). Specifically, we used the Scikit Learn Python library’s implementation 555https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
. Words with a high TF-IDF score are more prevalent within a given document relative to other documents within the corpus. We used TF-IDF scores to cluster documents based on topical similarity using the KMedoids clustering algorithm. We chose KMedoids rather than KMeans because it is more robust to outliers and performs better in a non-euclidean space. We decided on six topics as the appropriate number of clusters by examining silhouette scores (a metric using mean intra-cluster distance and mean nearest-cluster distance to measure tightness of clusters)666https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html and examining top words within clusters to determine whether clusters were coherent (Rousseeuw:1987).
Using the date-time stamp of each tweet, we grouped tweets by week to examine the websites by when they were tweeted within a selected time period. We found the proportion of a topic or domain’s representation in a given time period by dividing the number of articles from a given topic or domain in a specific time period by the total number of articles across all topics or domains during that same time period. This allowed us to create Figure 2 and Figure 3.
In this section we detail the results from the topic clustering of the website content and then do a dynamic analysis of those topics. We also analyze how the different websites, and their associated topics and domains are shared on Twitter itself.
4.1. Topic Clustering
Topic clustering yielded six distinct topics of conversation: coronavirus (general), response/lockdown, scientific research, finance, politics, and conspiracy/doubts. We used the top 50 frequently-occurring terms of each cluster relative to the entire corpus of web articles
Topic 0, Coronavirus (general), contained general terms relating to the virus. Topic 1, Response / Lockdowns references school closings, facial coverings, travel shutdowns, and social distancing. Topic 2, Scientific Research, references vaccine makers, Dr. Fauci (Director of the United States National Institute of Allergy and Infectious Diseases), and vaccine trials. We also see “gates” (likely a reference to Bill Gates, a focus of many anti-vaccine theories), but believe in this context of scientific research that this refers to the Gates Foundation’s efforts to fund vaccine production, rather than conspiracy theories. Topic 3, Finance had terms clearly related to finance. Topic 4, Politics had references to federal and state governments, political parties, as well as political figures. Topic 5, Conspiracy/Doubts references the flu ( coronavirus doubters said coronavirus was no worse than the flu), “wuhan” and “chinese” appeared (where the virus originated, but also used pejoratively in “wuhan-virus”), “gates” (this time we believe as in the popular conspiracy theory that Bill Gates invented the virus in order to implant chips in people during vaccination), “natural” (common in ‘natural remedies’ in anti-vaccine literature), “lab” and “generated” (a theory exists that says the virus was created in a laboratory). Other top words outside the top 50 include “bat” and “bats” (often used in reference to the theory about virus origins), “severe” and “children” (much of the anti-vaccine literature focuses on negative reactions in children), “mike” and “adams” and “ranger” (Mike Adams also known as the ‘Health Ranger’ is an anti-vaccine supporter and HIV-AIDS denier), “parler” (a fake-news and conspiracy website), and “truth.”
4.2. Dynamic Analysis
Public discussion of each topic was not constant over time. By examining the total number of unique URLs shared from each topic group and normalizing by time period (week), we are able to see what proportion of the conversation each topic accounts for in a given week.
Topic 0 (Coronavirus general) dominates early public discussion, accounting for around 40% of total unique articles shared in February. This proportion gradually decreased, settling closer to 20% of total content shared by June. Topic 1, Response/Lockdown comprises less than 10% of the conversation in February, grows to 15% of the conversation in April, then 20-25% of the conversion by May and June. Topic 2, Scientific Research, consistently maintained a ¿15% rate, with occasional weekly spikes to around 25% of total unique articles shared. In May and June this topic maintained almost 25% of the discourse. Topic 3, Finance, usually remains below 6% of the public conversion and stays below 9% of the total unique articles shared. Topic 4, Politics, is less than 8% of the conversation through February, but grows to 20% of total articles shared by mid-march, around the time President Trump declared a national emergency and state-level lockdowns began. After mid-March, this topic maintained 20% of the discourse. Topic 5, Conspiracy/Doubts, was strongest in February, when the origins, transmissions, effects, and treatment of Coronavirus were unknown, accounting for between 22% and 35% of the weekly discourse. Over time this decreased proportional to total public discourse. In March, the Conspiracy topic made up 13% of the discourse, and maintained a 10% level through June.
Examining the domains from which external content was shared is also revealing. We examined this by taking the number of unique URLs shared from each domain group (dubious, government, news, and science) and normalizing by time period (week).
Dubious domains, account for 38% of all unique external content shared in the first week and over 25% of all unique websites shared in the public vaccine discussion on Twitter. Government domains, account for the smallest share of information each week, usually less than 4% of unique articles. News domains account for over 30% of articles shared in February, never drops below 40% after February, and usually is over 50% of articles shared. Science domains comprise 30% to 41% of articles shared each week in February, but decreases through subsequent months. Science domains never account for more than 20% of weekly articles shared after late April.
4.3. Website Sharing Analysis
Articles from different domain groups are retweeted and favorited at different rates. This gives some measure of virality for articles from each domain, to say nothing of the virality of the most-retweeted or most-favorited articles across other social media platforms such as Facebook. Dubious sources receive large amounts of retweets and favorites. Government-sourced articles are the least-viral of all source domains, and have the fewest number of unique articles shared on twitter.
Articles about different topics are also retweeted and favorited at different rates. Articles about politics and lockdowns/social distancing responses are by far the most retweeted, followed by articles about vaccine development.
By examining the proportion of articles from each topic group by the source domain group and normalizing on the article content topic, we can see where different types of stories originate.
News domains are the leading supplier of external content for every topic except for conspiracy/doubts (Topic 5). Government domains were by far the smallest originator of content for each topic, accounting for less than 6% of political discussion, less than 4% of general coronavirus discussion, less than 3% of response/lockdown, scientific research, and only 2% of the public discussion about conspiracies and doubts about coronavirus. Scientific domains accounted for a significant portion of general coronavirus (22%), scientific research (30%), and conspiracy/doubts (31%) discussions, but these numbers look less good when compared to the large share of these discussions originating from dubious sources. Dubious sources account for 25% of general coronavirus content, 20% of lockdown conversations, 27% of conversations about scientific research, 20% of political conversations, and 38% of conspiracy/doubts conversations.
This large proportional use of dubious domains to source information is particularly jarring considering that only 13% of the tweets in the collected dataset were labeled as vaccine-hesitant tweets. This means the remaining 87% of non-vaccine-hesitant tweets were still relying on dubious sources of information in public coronavirus discussions. The proportion of dubious-sourced content rivals legitimate scientific content in this “infodemic.” A chi-square test for independence between the domain labels and their text cluster labels had a p-value ¡ 0.0001, indicating there was a dependence between the domain groups and the cluster groups from the text.
We propose a new methodology to examine external content shared on social media by both its origin source (top-level domain) and its article content. External content has the potential to spread misinformation across social media platforms and our methods should be studied on large social media data sets consisting of topics other than vaccine conversations and on platforms other than Twitter to validate that the results and methods can be generalized.
Using external content shared on Twitter to study public conversations around vaccines and coronavirus, we find: (1) multiple distinct narratives about the pandemic, (2) narratives change in their importance over time, (3) different sources of information contribute to these narratives in different proportions, and (4) different originating sources of external content and different topics have different amounts of virality.
The “infodemic” of untrustworthy information shared on Twitter policitized measures like social distancing and mask-wearing and corrupted public health messaging, likely contributing to some of the more than 500,000 American deaths. Public health messaging from scientific and government sources were drowned out by dubious and news sources. More must be done by the government and scientific communities to create more “signal” of good, safe information in the “noise” of news coverage and misinformation.
External content was used effectively to not only differentiate narratives being shared on Twitter, but see the changes in these narratives over time. From a public health messaging perspective, this means that trustworthy sources of information (scientific and government) must track changes in public narrative and adjust positive public health messaging accordingly. The Twitter dataset used for this analysis also labeled tweets as either “normal” or “vaccine hesitant.” Though we did not use these labels in our study, over 85% of tweets in the aggregate data set were “normal.” This means that dubious sources—which accounted for around 25% of content shared each week—were being shared by both “normal” and “vaccine hesitant” Twitter users, showing it is hard to find trustworthy information even if we assume social media users want to use it. More work, such as sentiment analysis, should be done to understand how pro- and anti-vaccine Twitter users use the same external content from the same external sources in different ways and within different communities. Future research should also look to better characterize and classify the domains of websites so that more websites can be incorporated in future work.