ReCOVery: A Multimodal Repository for COVID-19 News Credibility Research

First identified in Wuhan, China, in December 2019, the outbreak of COVID-19 has been declared as a global emergency in January, and a pandemic in March 2020 by the World Health Organization (WHO). Along with this pandemic, we are also experiencing an "infodemic" of information with low credibility such as fake news and conspiracies. In this work, we present ReCOVery, a repository designed and constructed to facilitate the studies of combating such information regarding COVID-19. We first broadly search and investigate  2,000 news publishers, from which 61 are identified with extreme [high or low] levels of credibility. By inheriting the credibility of the media on which they were published, a total of 2,029 news articles on coronavirus, published from January to May 2020, are collected in the repository, along with 140,820 tweets that reveal how these news articles are spread on the social network. The repository provides multimodal information of news articles on coronavirus, including textual, visual, temporal, and network information. The way that news credibility is obtained allows a trade-off between dataset scalability and label accuracy. Extensive experiments are conducted to present data statistics and distributions, as well as to provide baseline performances for predicting news credibility so that future methods can be directly compared. Our repository is available at http://coronavirus-fakenews.com, which will be timely updated.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

09/14/2021

MMCoVaR: Multimodal COVID-19 Vaccine Focused Data Repository for Fake News Detection and a Baseline Architecture for Classification

The outbreak of COVID-19 has resulted in an "infodemic" that has encoura...
09/04/2021

Supervised Contrastive Learning for Multimodal Unreliable News Detection in COVID-19 Pandemic

As the digital news industry becomes the main channel of information dis...
06/10/2020

Pandemic Pulse: Unraveling and Modeling Social Signals during the COVID-19 Pandemic

We present and begin to explore a collection of social data that represe...
10/18/2020

CHECKED: Chinese COVID-19 Fake News Dataset

COVID-19 has impacted all lives. To maintain social distancing and avoid...
01/27/2020

Ginger Cannot Cure Cancer: Battling Fake Health News with a Comprehensive Data Repository

Nowadays, Internet is a primary source of attaining health information. ...
07/14/2021

Linking Health News to Research Literature

Accurately linking news articles to scientific research works is a criti...
05/05/2021

ExcavatorCovid: Extracting Events and Relations from Text Corpora for Temporal and Causal Analysis for COVID-19

Timely responses from policy makers to mitigate the impact of the COVID-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As of June , the COVID-19 pandemic has resulted in over 6.4 million confirmed cases and over 380,000 deaths globally.111https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200604-covid-19-sitrep-136.pdf Governments have enforced border shutdowns, travel restrictions, and quarantines to “flatten the curve” (Burkert and Loeb, 2020). The COVID-19 outbreak has had a detrimental impact on not only the healthcare sector but also every aspect of human life such as education and economic sectors (Nicola et al., 2020). For example, over 100 countries have imposed nationwide (even complete) closures of education facilities, which has lead to over 900 million learners being affected.222https://en.unesco.org/covid19/educationresponse Statistics indicate that 3.3 million Americans applied for unemployment benefits in the week ending on March and the number doubled in the following week, before which time the highest number of unemployment applications ever received in one week was 695,000 in 1982. (FitzGerald et al., 2020)

Along with the COVID-19 pandemic, we are also experiencing an “infodemic” of information with low credibility regarding COVID-19.333https://www.un.org/en/un-coronavirus-communications-team/un-tackling-%E2%80%98infodemic%E2%80%99-misinformation-and-cybercrime-covid-19 Hundreds of news websites have contributed to publishing false coronavirus information.444https://www.newsguardtech.com/coronavirus-misinformation-tracking-center/ Individuals who believe false news articles claiming that, for example, eating boiled garlic or drinking chlorine dioxide, an industrial bleach, can cure or prevent coronavirus, might take an ineffective or extremely dangerous action to protect themselves from the virus.555https://www.factcheck.org/2020/02/fake-coronavirus-cures-part-1-mms-is-industrial-bleach/

Given this background, research is motivated to combat this infodemic. Hence, we design and construct a multimodal repository, , to facilitate reliability assessment of news on COVID-19. We first broadly search and investigate 2,000 news publishers, from which 61 with various political polarizations and from different countries are identified with extreme [high or low] credibility. As past literature has indicated, there is a close relationship between the credibility of news articles and their publication sources. (Zhou and Zafarani, 2020) In total 2,029 news articles on coronavirus are finally collected in the repository along with 140,820 tweets that reveal how these news articles are spread on the social network. The main contributions of this work are summarized as follows:

First, we construct a repository to support the research that investigates (1) how news with low credibility is created and spreads in this COVID-19 pandemic and (2) ways to predict such “fake” news. The manner in which the ground truth of news credibility is obtained allows a scalable repository, as annotators need not label each news article that is time-consuming and instead they can directly label the news site.

Second, provides multimodal information on COVID-19 news articles. Basically, for each news article, we collect its news content and social context information revealing how it spreads on social media, which covers textual, visual, temporal, and network information.

Third, we conduct extensive experiments using , which includes analyzing our data (data statistics and distributions) and providing baseline performances for predicting news credibility using data. These baselines allow future methods to be easily compared to. Baselines are obtained using either news content alone or combined with social context information within a framework of supervised machine learning.

The rest of this paper is organized as follows. We first detail how the data is collected in Section 2. The statistics and distributions of the data are presented and analyzed in Section 3. Experiments that use the data to predict news credibility are designed and conducted in Section 4, whose results can be used as benchmarks. Finally, we review the related dataset in Section 5, and conclude in Section 6.

2. Data Collection

The overall process that we collect the data, including news content and social media information, is presented in Figure 1. To facilitate scalability, news credibility is assessed based on the credibility of the media (site) that publishes the news article. Based on the process outlined in Figure 1, we will further detail how the data is collected, answering the following three questions: (1) how to identify reliable (or unreliable) news sites mainly releasing real news (or fake news)? (which we address in Section 2.1); having determined such news sites, (2) how do we crawl COVID-19 news articles from these sites and what news components are valuable for collection? (Section 2.2); and given COVID-19 news articles, (3) how can we track their spread on social networks? (Section 2.3)

2.1. Filtering News Sites

To determine a list of reliable and unreliable news sites, we primarily rely on two resources: NewsGuard and Media Bias/Fact Check.

NewsGuard
666https://www.newsguardtech.com/

NewsGuard is developed to review and rate news websites. Its reliability rating team is formed by trained journalists and experienced editors, whose credentials and backgrounds are all transparent and available on the site. The performance (credibility) of each news website is assessed based on the following nine journalistic criteria:

  1. [leftmargin=9mm]

  2. Does not repeatedly publish false content, (22 points)

  3. Gathers and presents information responsibly, (18 points)

  4. Regularly corrects or clarifies errors, (12.5 points)

  5. Handles the difference between news and opinion responsibly, (12.5 points)

  6. Avoids deceptive headlines, (10 points)

  7. Website discloses ownership and financing, (7.5 points)

  8. Clearly labels advertising, (7.5 points)

  9. Reveals who’s in charge, including possible conflicts of interest, and (5 points)

  10. The site provides the names of content creators, along with either contact or biographical information, (5 points)

where the overall score of a site is between 0 to 100; 0 indicates the lowest credibility, and 100 indicates the highest credibility. A news website with a NewsGuard score higher than 60 is often labeled as reliable; otherwise, it is unreliable. NewsGuard has provided ground truth for the construction of news datasets such as NELA-GT-2018 (Nørregaard et al., 2019) for studying misinformation.

Media Bias/Fact Check (MBFC)
777https://mediabiasfactcheck.com/

MBFC is a website that rates factual accuracy and political bias of news medium. The fact-checking team consists of Dave Van Zandt, the primary editor and the website owner, and some journalists and researchers (more details can be found on its “About” page). MBFC labels each news media as one of six factual-accuracy levels based on the fact-checking results of the news articles it has published (more details can be found on its “Methodology” page): (i) very high, (ii) high, (iii) most factual, (iv) mixed, (v) low, and (vi) very low. Such information has been used as ground truth for automatic fact-checking studies. (Baly et al., 2018)

What Are Our Criteria?

Referenced by NewsGuard and MBFC, our criteria for determining reliable and unreliable news sites are:

  • [leftmargin=20mm]

  • A news site is reliable if its NewsGuard score is greater than 90, and its factual reporting on MBFC is very high or high.

  • A news site is unreliable if its NewsGuard score is less than 30, and its factual reporting on MBFC is below mixed.

Our search towards news medium with high credibility is conducted among news articles listed in MBFC (2,000). To find news medium with low credibility we search in MBFC and the newly released “Coronavirus Misinformation Tracking Center”5 of NewsGuard, which provides a list of websites publishing false coronavirus information. Ultimately, we obtain a total of 61 news sites, from which 22 are the sources of reliable news articles (e.g., National Public Radio888https://www.npr.org and Reuters999https://www.reuters.com) and the remaining 39 are sources to collect unreliable news articles (e.g., Human Are Free101010http://humansarefree.com/ and Natural News111111https://www.naturalnews.com). The full list of sites considered in our repository is also available at http://coronavirus-fakenews.com. Note that several “fake” news medium are not included, such as 70 News, Conservative 101, and Denver Guardian, since they no longer exist or their domains have been unavailable.

(a) Reliable News Sites
(b) Unreliable News Sites
Figure 2. Credibility Distribution of Determined News Sites

Also note that to achieve a good trade-off between dataset scalability and label accuracy, we determine more extreme threshold scores (30 and 90) compared to the initial one provided by NewsGuard (60). In this way, the selected news sites share an extreme reliability (or unreliability) to reduce the number of the false positive and false negative of news labels in our repository; ideally, each news article published on a reliable site is factual, and on an unreliable site is false. Figure 2 illustrates the credibility distributions of reliable and unreliable news sites. It can be observed from the figure that for reliable news, most of them get a full mark on NewsGuard and are labeled as “high”ly factual by MBFC; “very high” is rare for all sites listed in MBFC. In contrast, unreliable news sites share an average NewsGuard score of 15 and a low factual label by MBFC; similarly, “very low” is rarely given on MBFC.

2.2. Collecting COVID-19 News Content

(a) Reliable News1313footnotemark: 13
(b) Unreliable News151515 https://www.npr.org/sections/coronavirus-live-updates/2020/05/17/857512288/obama-malala-jonas-brothers-send-off-class-of-2020-in-virtual-graduation
Figure 3. Examples of News Articles Collected
1515footnotetext: https://humansarefree.com/2020/05/researchers-100-covid-19-cure-rate-using-intravenous-chlorine-dioxide.html

To crawl COVID-19 news articles from selected news sites, we first determine whether the news article is about COVID-19; the process is detailed in Section 2.2.1. Next, we detail how the data is crawled and the news content components that are included in our repository in Section 2.2.2.

2.2.1. News Topic Identification

To identify news articles on COVID-19, we use a list of keywords:

  • SARS-CoV-2,

  • COVID-19, and

  • Coronavirus.

News articles whose content contains any of the keywords (case-insensitive) are considered related to COVID-19. These three keywords are the official names announced by the WHO on February , where “SARS-CoV-2” (standing for Severe Acute Respiratory Syndrome CoronaVirus 2) is the virus name, and “coronavirus” and “COVID-19” are the name of the disease that the virus causes. Before the WHO announcement, COVID-19 was previously known as the “2019 novel coronavirus,”161616https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/naming-the-coronavirus-disease-(covid-2019)-and-the-virus-that-causes-it, which also includes the “coronavirus” keyword which we are considering. We merely consider official names as keywords to avoid potential biases, or even discrimination in articles collected. Furthermore, a news media (article) that is credible, or pretends to be credible, often acts professionally and adopts the official name(s) of the disease/virus. Compared to those articles that use biased and/or inaccurate terms, false news pretending to be professional is more detrimental and challenging to detect, which has become the focus of current fake news studies. (Zhou and Zafarani, 2020) Examples of such news articles are illustrated in Figure 3.

2.2.2. Crawling News Content

Content crawler relies on Python library.171717https://github.com/codelucas/newspaper The content of each news article corresponds to twelve components:

  1. News ID: Each news article is assigned a unique id as the identity;

  2. News URL: The URL of the news article. The URL helps us verify the correctness of the collected data. It can also be used as the reference and source when repository users would like to extend the repository by fetching additional information;

  3. Publisher: The name of the news media (site) that publishes the news article;

  4. Publication Date: The date (in yyyy-mm-dd format) on which the news article was published on the site, which provides temporal information to support the investigation of, e.g., the relationship between the misinformation volume and the outbreak of COVID-19 over time;

  5. Author: The author(s) of the news article, whose number can be none, one, or more than one. Note that some news articles might have fictional author names. Author information is valuable in evaluating news credibility by either investigating the collaboration network of authors (Sitaula et al., 2020) or exploring its relationships with news publishers and content (Zhang et al., 2018);

  6. News Title and Bodytext as the main textual information;

  7. News Image as the main visual information, which is provided in the form of a link (URL). Note that most images within the news page are noise – they can be advertisements, images belonging to other news articles due to the recommender systems embedded in news sites, logos of news sites and/or social media icons, such as Twitter and Facebook logos for sharing. Hence, we particularly fetch the main/head/top image for each news article to reduce noise;

  8. Country: The name of country where the news is published;

  9. Political bias: Each news article is labeled as one of ‘extreme left’, ‘left’, ‘left-center’, ‘center’, ‘right-center’, ‘right’, and ‘extreme right’ that is equivalent to the political bias of its publisher. News political bias is verified by two resources, AllSides181818https://www.allsides.com/unbiased-balanced-news and MFBC, both which rely on domain experts to label media bias; and

  10. NewsGuard score and MBFC factual reporting as the original ground truth of news credibility, which has been detailed in Section 2.1.

2.3. Tracking News Spreading on Social Media

We first use Twitter Search API191919https://developer.twitter.com/en/docs/tweets/search/overview to track the spread of collected news articles on Twitter. Specifically, our search is based on the URL of each news article and looks for tweets posted after the date when the news article was published to the current date (for the current version of the dataset, this date is May ). Twitter Search API can return the corresponding tweets with detailed information such as their IDs, text, languages of text, times of being created, statistics on retweeted/replied/liked. Also, it returns the information of users who post these tweets, such as user IDs and their number of followers/friends. To comply with Twitter’s Terms of Service202020https://developer.twitter.com/en/developer-terms/agreement-and-policy, we only publicly release the IDs of the collected data for non-commercial research use. More details can be seen in http://coronavirus-fakenews.com.

3. Data Statistics and Distributions

Reliable Unreliable Total
News articles 1,364 665 2,029
 w/ images 1,354 663 2,017
 w/ social information 1,219 528 1,747
Tweets 114,402 26,418 140,820
Users 78,659 17,323 93,761
Table 1. Data Statistics
Figure 4. Distribution of News Publishers

The general statistics on our dataset is presented in Table 1. The dataset contains 2,029 news articles, most of which have both textual and visual information for multimodal studies (2,017), (Zhou et al., 2020; Wang et al., 2018) and have shared on social media (1,747). The proportion of reliable versus unreliable news articles is around 2:1, hence due to class imbalance, compared to accuracy rate, AUC or

scores should be a better evaluation metric when using the collected data to predict news credibility. Note that the number of users who spread reliable news (78,659) pluses that of users spreading unreliable news (17,323) is greater than the total number of users including in the dataset (93,761), which indicates that users can both engage in spreading reliable and unreliable news articles.

Next, we visualize the distributions of data features/attributes.

Distribution of News Publishers

Figure 4 shows the number of COVID-19 news articles published in each [extremely reliable or extremely unreliable] news site. There are six unreliable publishers with no news on COVID-19; hence, they are not presented in the figure. We keep these publishers in our repository as the data will be updated over time and these publishers may publish news articles on COVID-19 in the future.

Figure 5. Publication Dates
Figure 6. Author Count
(a) Network
(b) Degree Distribution
Figure 7. Author Collaborations
News Publication Dates

The distribution of news publication dates is presented in Figure 6, where all articles are published in 2020. We point out that from January to May, the number of COVID-19 news articles published is significantly (exponentially) increased. The possible explanation for this phenomena is three-fold. First, from the time that the outbreak was first identified in Wuhan, China (December 2019) (Huang et al., 2020) to May 2020, the number of confirmed cases and deaths caused by SARS-CoV-2 have grown exponentially globally.1 Meanwhile, the virus has become a world topic and has triggered more and more discussions on a world-wide scale. Second, some older news articles are no longer available, which has motivated us to timely update the dataset. Third, the keywords we have used to identify COVID-19 news articles are the official ones provided by the WHO in February.212121https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/naming-the-coronavirus-disease-(covid-2019)-and-the-virus-that-causes-it Some news articles published in January are also collected, as before the WHO announcement COVID-19 was known as the “2019 novel coronavirus,” which also includes one of our keywords: “coronavirus.” We have detailed the reasons behind our keyword selection in Section 2.2.1.

News Authors and Author Collaborations

Figure 6 presents the distribution of the number of authors contributing to news articles, which is governed by a long-tail distribution: most articles have less than five authors. Instead of including the [real or virtual] names of the authors, some articles provide publisher names as authors. Considering such information has been available in the repository, we leave the author information of these news articles blank, i.e., their number of authors is zero. Furthermore, we construct the coauthorship network, shown in Figure 7. It can be observed from the network that node degrees also follow a power-law-like distribution: among 1,095 nodes (authors), over 90% of them have less than or equal to two collaborators.

News Content Statistics

Both Figures 9 and 9 reveal textual characteristics within news content (including news title and bodytext). It can be observed from Figures 9 that the number of words within news content follows a long-tail (power-low-like) distribution, with an average value of 800 and a median value of 600. On the other hand, Figure 9 provides the word cloud for the entire repository. As the news articles collected share the same COVID-19 topic, some relevant topics and vocabularies have been naturally and frequently used by the news authors, such as “coronavirus” (6465), “COVID” (5413), “state” (4432), “test” (4274), “health” (3714), “pandemic” (3427), “virus” (2903), “home” (2871), “case” (2676), and “Trump” (2431) that are illustrated with word font size scaled to their frequencies.

Figure 8. Word Count
Figure 9. Word Cloud
(a) Publishers (b) News
Figure 10. Countries
(a) Publishers (b) News
Figure 11. Political Bias
Country Distribution

Figure 11 reveals the countries that news and news publishers belong to. It can be observed that in total six countries (USA, Russia, UK, Iran, Cyprus, and Canada) are covered, where US news and news publishers constitute the majority of the population.

Distribution of News Political Bias

Figure 11 is the distribution of political bias of news and news medium (publishers). It can be observed from the figure that for both news and publishers, the distribution for those exhibiting a right bias (including extreme right, right, and right-center) is more balanced compared to those exhibiting a left bias (including extreme left, left, and left-center).

News Spreading Frequencies

Figure 13 shows the distribution of the number of tweets spreading each news article. The distribution exhibits a long tail – over 80% of news articles are spread less than 100 times while a few have been shared by thousands of tweets.

Figure 12. Spreading Frequency
Figure 13. News Spreaders
Figure 14. Follower Distribution
Figure 15. Friend Distribution
News Spreaders

The distribution of the number of spreaders for each news article is shown in Figure 13. It differs from the distribution in Figure 13 as one user can spread a news article multiple times. As for social connections of news spreaders, the distributions of their followers and friends are respectively presented in Figures 15 and 15, where the most popular spreader has over 40 million followers (or 600,000 friends).

4. Forming Baselines: Using to Predict COVID-19 News Credibility

In this section, several methods that often act as baselines are utilized and developed to predict COVID-19 news credibility using data, hoping to facilitate future studies. These methods (baselines) are first specified in Section 4.1. The implementation details of experiments are then provided in Section 4.2. Finally, we present the performance results for these methods in Section 4.3.

4.1. Methods

Broadly speaking, all developed methods fall under a traditional supervised machine learning framework where features are manually engineered to represent news articles (see Section 4.1.1

) and then classified by a well-trained classifier such as a random forest classifier (see Section

4.1.2).

4.1.1. Features

We design and extract the following three feature groups in our experiments:

LIWC Features

LIWC is a widely-accepted psycholinguistic lexicon. Given a news story, LIWC can count the words in the text falling into one or more of 93 linguistic (e.g., self-references), psychological (e.g., anger), and topical (e.g., leisure) categories 

(Pennebaker et al., 2015), based on which 93 features are extracted.

Content Features

Here, we consider in a total of eight features for each news article: (1) the timestamp at which the news was published, (2) the number of news authors, (3-4) the mean and median number of collaborators of the news authors, (5-7) the number of words in news title, bodytext, and the entire content, and (8) the number of news images. Compared to LIWC features that merely focus on news textual information (title and bodytext), this group of features comprehensively investigates most of the components of news content that are included in the repository.

Social Attributes

Six features are extracted from the available social attributes of each news article in the repository: (1) the frequency of the news being spread, i.e., the number of corresponding tweets; (2) the number of news spreaders; (3-6) the mean (and median) number of followers (or friends) of news spreaders.

4.1.2. Classifiers

In current fake news research, often a random classifier is used as one of the baselines (Zhou and Zafarani, 2020)

, which randomly labels a news article as reliable or unreliable with equal probability. We further use multiple common supervised learners (classifiers) in our experiments: Logistic Regression (LR), Naïve Bayes (NB),

-Nearest Neighbor (

-NN), Random Forest (RF), Decision Tree (DT), Support Vector Machines (SVM), and XGBoost (XGB) 

(Chen and Guestrin, 2016).

4.2. Implementation Details

The overall dataset is randomly divided into training and testing datasets with a proportion of 80%:20%. As the dataset has an unbalanced distribution between reliable and unreliable news articles (2:1), we evaluate the prediction results in terms of Precision, Recall, and the

score. Each performance outcome is obtained by averaging five experimental results repeated with different random seeds on the dataset division. All classifiers are trained with default hyperparameters.

Features Classifier Precision Recall Score
/ Random 0.338 0.506 0.434
LIWC
Features
LR 0.652 0.476 0.550
NB 0.542 0.584 0.560
KNN 0.486 0.266 0.346
RF 0.858 0.482 0.618
DT 0.556 0.546 0.548
SVM 0.780 0.352 0.484
XGB 0.816 0.628 0.708
Content
Features
LR 0.602 0.522 0.558
NB 0.600 0.526 0.560
KNN 0.574 0.468 0.512
RF 0.822 0.730 0.772
DT 0.704 0.736 0.716
SVM 0.616 0.520 0.562
XGB 0.860 0.686 0.760
Content + Social
Features
LR 0.668 0.566 0.612
NB 0.596 0.618 0.606
KNN 0.694 0.618 0.654
RF 0.850 0.774 0.806
DT 0.754 0.790 0.768
SVM 0.680 0.638 0.658
XGB 0.854 0.764 0.806
Table 2. Baselines Performance in Predicting COVID-19 News Credibility Using Data

4.3. Experimental Results

Prediction results are presented in Table 2. It can be observed from the table that when predicting news credibility using news content alone, attribute features are more representative compared to LIWC features. Attribute features can perform best with an score of 0.772 with a random forest classifier, and LIWC features perform best with an score of 0.708 using XGBoost. Furthermore, using both news content and social information to predict news credibility can further improve the performance, achieving an score of 0.8.

5. Related Work

Related datasets can be generally grouped as (I) COVID-19 datasets and (II) “fake” news and rumor datasets.

COVID-19 Datasets

As a global emergency (Sohrabi et al., 2020), the outbreak of COVID-19 has been labelled as a black swan event and likened to the economic scene of World War II (Nicola et al., 2020). With this background, a group of datasets have emerged, whose contributions range from real-time tracking of COVID-19 to help epidemiological forecasting (e.g., (Dong et al., 2020) and (Xu et al., 2020)) and collecting scholarly COVID-19 articles for literature-based discoveries (e.g., CORD-19222222https://www.semanticscholar.org/cord19), to tracking the spreading of COVID-19 information on Twitter (e.g., (Chen et al., 2020)).

Specifically, researchers at Johns Hopkins University (JHU) develop a Web-based dashboard232323https://coronavirus.jhu.edu/map.html to visualize and track reported cases of COVID-19 in real-time. The dashboard is released on January , presenting the location and number of confirmed COVID-19 cases, deaths, and recoveries for all affected countries (Dong et al., 2020). Another dataset shared publicly on March is constructed to aid the analysis and tracking of the COVID-19 epidemic, which provides real-time individual-level data (e.g., symptoms; date of onset, admission, and confirmation; and travel history) from national, provincial, and municipal health reports (Xu et al., 2020)

. Intended to mobilize researchers to apply recent advances in Natural Language Processing (NLP) to generate new insights in support of the fight against COVID-19, Allen Institute for AI has contributed a free and dynamic database of more than 128,000 scholarly articles about COVID-19, named CORD-19, to the global research community.

22 On the other hand, Chen et al. (Chen et al., 2020) release the first large-scale COVID-19 twitter dataset. The dataset, updated regularly, collects COVID-19 tweets that are posted from January and across languages.

“Fake” News and Rumor Datasets

Existing “fake” news and rumor datasets are collected with various focuses. These datasets may (i) only contain news content that can be full articles (e.g., NELA-GT-2018 (Nørregaard et al., 2019)), or short claims (e.g., FEVER (Thorne et al., 2018)); (ii) only contain social media information (e.g., CREDBANK(Mitra and Gilbert, 2015)), where news refers to user posts; or (iii) contain both content and social media information (e.g., LIAR (Wang, 2017) and FakeNewsNet (Shu et al., 2018)).

Specifically, NELA-GT-2018 (Nørregaard et al., 2019) is a large-scale dataset of around 713,000 news articles from February to November 2018. News articles are collected from 194 news medium with multiple labels directly obtained from NewsGuard, Pew Research Center, Wikipedia, OpenSources, MBFC, AllSides, BuzzFeed News, and PolitiFact. These labels refer to news credibility, transparency, political polarizations, and authenticity. FEVER dataset (Thorne et al., 2018) consists of 185,000 claims and is constructed following two steps: claim generation and annotation. First, the authors extract sentences from Wikipedia, and then the annotators manually generate a set of claims based on the extracted sentences. Then, the annotators label each claim as “supported”, “refuted”, or “not enough information” by comparing it with the original sentence from which it is developed. On the other hand, some datasets focus on user posts on social media, for example, CREDBANK (Mitra and Gilbert, 2015) comprises more than 60 million tweets grouped into 1049 real-world events, each of which is annotated by 30 human annotators, while some contain both news content and social media information. For instance, collecting both claims and fact-check results (labels, i.e., “true”, “mostly true”, “half-true”, “mostly false”, and “pants on fire”) directly from PolitiFact, Wang establishes the LIAR dataset (Wang, 2017) containing around 12,800 verified statements made in public speeches and social medium. The aforementioned datasets only contain textual information valuable for NLP research with limited information on how “fake” news and rumors spread on social networks, which motivate the construction of FakeNewsNet dataset. (Shu et al., 2018) The dataset collects verified (real or fake) full news articles from PolitiFact (#=1,056) and GossipCop (#=22,140), respectively. The dataset also tracks news spreading on Twitter.

6. Conclusion

To fight the coronavirus infodemic, we construct a multimodal repository for COVID-19 news credibility research, which provides textual, visual, temporal, and network information regarding news content and how news spreads on social media. The repository balances data scalability and label accuracy. To facilitate future studies, benchmarks are developed and their performances on predicting news credibility using the data available in the repository are presented. We find that using news content and/or social attributes available in the repository, we can achieve an score of 0.77 when news has not yet spread on social media (i.e., only news content is available) and an score of 0.81 can be achieved when it has been shared by social media users.

We point out that the data could be further enhanced (1) by including COVID-19 news articles in various languages such as Chinese, Russian, Spanish, and Italian, as well as the information on how these news articles spread on the popular local social media for those languages, e.g., Sina Weibo (China). Countries speaking (but not limited to) these languages have all been suffering heavy losses in this pandemic and have shown different characteristics in their spreading in the physical world242424https://coronavirus.jhu.edu/data/new-cases, which would be invaluable when investigating the relationship between the spread of the virus in the physical world and that of its related misinformation on social networks. Furthermore, (2) extending the dataset by introducing the ground truth of, for example, hate speech, clickbaits, and social bots (Ferrara et al., 2016) would help study the bias and discrimination bred by the virus, as well as the correlation among all information and accounts with low credibility. Both (1) and (2) will be our future work.

References

  • (1)
  • Baly et al. (2018) Ramy Baly, Georgi Karadzhov, Dimitar Alexandrov, James Glass, and Preslav Nakov. 2018. Predicting factuality of reporting and bias of news media sources. arXiv preprint arXiv:1810.01765 (2018).
  • Burkert and Loeb (2020) Andi Burkert and Avi Loeb. 2020. Flattening the COVID-19 curves. Scientific American. Retrieved from https://blogs. scientificamerican. com/observations/flattening-the-covid-19-curves (2020).
  • Chen et al. (2020) Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set. JMIR Public Health and Surveillance 6, 2 (2020), e19273.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785–794.
  • Dong et al. (2020) Ensheng Dong, Hongru Du, and Lauren Gardner. 2020. An interactive web-based dashboard to track COVID-19 in real time. The Lancet infectious diseases 20, 5 (2020), 533–534.
  • Ferrara et al. (2016) Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (2016), 96–104.
  • FitzGerald et al. (2020) A FitzGerald, K Kwiatkowski, V Singer, and S Smit. 2020. An instant economic crisis: How deep and how long. Library Catalog: www. mckinsey. com (2020).
  • Huang et al. (2020) Chaolin Huang, Yeming Wang, Xingwang Li, Lili Ren, Jianping Zhao, Yi Hu, Li Zhang, Guohui Fan, Jiuyang Xu, Xiaoying Gu, et al. 2020. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet 395, 10223 (2020), 497–506.
  • Mitra and Gilbert (2015) Tanushree Mitra and Eric Gilbert. 2015. CREDBANK: A Large-scale Social Media Corpus with Associated Credibility Annotations. In Ninth International AAAI Conference on Web and Social Media.
  • Nicola et al. (2020) Maria Nicola, Zaid Alsafi, Catrin Sohrabi, Ahmed Kerwan, Ahmed Al-Jabir, Christos Iosifidis, Maliha Agha, and Riaz Agha. 2020. The socio-economic implications of the coronavirus and COVID-19 pandemic: A review. International Journal of Surgery (2020).
  • Nørregaard et al. (2019) Jeppe Nørregaard, Benjamin D Horne, and Sibel Adalı. 2019. NELA-GT-2018: A Large Multi-Labelled News Dataset for the Study of Misinformation in News Articles. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13. 630–638.
  • Pennebaker et al. (2015) James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. 2015. The development and psychometric properties of LIWC2015. Technical Report.
  • Shu et al. (2018) Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2018. FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media. arXiv preprint arXiv:1809.01286 (2018).
  • Sitaula et al. (2020) Niraj Sitaula, Chilukuri K Mohan, Jennifer Grygiel, Xinyi Zhou, and Reza Zafarani. 2020. Credibility-based Fake News Detection. In Disinformation, Misinformation and Fake News in Social Media: Emerging Research Challenges and Opportunities. Springer.
  • Sohrabi et al. (2020) Catrin Sohrabi, Zaid Alsafi, Niamh O’Neill, Mehdi Khan, Ahmed Kerwan, Ahmed Al-Jabir, Christos Iosifidis, and Riaz Agha. 2020. World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19). International Journal of Surgery (2020).
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018).
  • Wang (2017) William Yang Wang. 2017. ” liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648 (2017).
  • Wang et al. (2018) Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su, and Jing Gao. 2018.

    EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection. In

    Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 849–857.
  • Xu et al. (2020) Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin, Alyssa Loskill, Emily L Cohn, Yulin Hswen, Sarah C Hill, Maria M Cobo, et al. 2020. Epidemiological data from the COVID-19 outbreak, real-time case information. Scientific data 7, 1 (2020), 1–6.
  • Zhang et al. (2018) Jiawei Zhang, Limeng Cui, Yanjie Fu, and Fisher B Gouza. 2018. Fake News Detection with Deep Diffusive Network Model. arXiv preprint arXiv:1805.08751 (2018).
  • Zhou et al. (2020) Xinyi Zhou, Jindi Wu, and Reza Zafarani. 2020. SAFE: Similarity-Aware Multi-Modal Fake News Detection. In The 24th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer.
  • Zhou and Zafarani (2020) Xinyi Zhou and Reza Zafarani. 2020. A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities. ACM Computing Surveys (CSUR) (2020).