Mega-COV: A Billion-Scale Dataset of 65 Languages For COVID-19

We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 234 countries), longitudinal (goes as back as 2007), multilingual (comes in 65 languages), and has a significant number of location-tagged tweets ( 32M tweets). We release tweet IDs from the dataset, hoping it will be useful for studying various phenomena related to the ongoing pandemic and accelerating viable solutions to associated problems.



There are no comments yet.



ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks

In this paper, we present ArCOV-19, an Arabic COVID-19 Twitter dataset t...

Grounding the Semantics of Part-of-Day Nouns Worldwide using Twitter

The usage of part-of-day nouns, such as 'night', and their time-specific...

Content analysis of Persian/Farsi Tweets during COVID-19 pandemic in Iran using NLP

Iran, along with China, South Korea, and Italy was among the countries t...

The Burden of Being a Bridge: Understanding the Role of Multilingual Users during the COVID-19 Pandemic

The outbreak of the COVID-19 pandemic triggers infodemic over online soc...

Measuring Shifts in Attitudes Towards COVID-19 Measures in Belgium Using Multilingual BERT

We classify seven months' worth of Belgian COVID-related Tweets using mu...

Two Truths and a Lie: Exploring Soft Moderation of COVID-19 Misinformation with Amazon Alexa

In this paper, we analyzed the perceived accuracy of COVID-19 vaccine Tw...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The seeds of the coronavirus disease 2019 (COVID-19) pandemic are reported to have started as a local outbreak in Wuhan (Hubei, China) in December, 2019, but soon spread around the world WHO (2020)

. As of April 25, 2020, the number of confirmed cases around the world is estimated at 2,877,487. 

111Source the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) dashboard at: The CSSE source provides real time updates on the location and number of confirmed COVID-19 cases. The number of confirmed cases witnessed exponential growth in some countries. Overall, the growth has also been fast. For example, in April 7, when we first documented that number in the current manuscript, the number of cases was at 1,390,511.

In response to this ongoing public health emergency, researchers are mobilizing to track the pandemic and study its impact not only on human life, but possibly on all sorts of life in our planet. The different ways the pandemic has its footprint on our lives is a question that will probably be studied for years to come. Importantly, enabling such a scholarship by providing relevant data is an important endeavor. Toward this end, we focus our efforts on collecting

Mega-Cov, a billion-scale multilingual Twitter dataset with geo-location information.

As several countries and regions around the world went into lockdown, the public health emergency has restricted physical aspects of human communication considerably. As hundreds of millions of people spend more time at home, communication over social media becomes more important than what it has ever been. In particular, the content of social media communication promises to capture useful aspects of the lives of the millions of people involved. Mega-Cov is intended as a repository of such a content. In this version of our work, the largest part of the dataset is focused on North America. Our next release will, however, bring a significant update with the size of the dataset doubling based on additional content from outside North America. While other early efforts to collect Twitter data are ongoing, our goal is to complement these existing resources in significant ways. More specifically, we designed our methods to harvest a dataset that is unique in the following means:

  • Longitudinal Coverage: We collect multiple data points (up to 3,200) from the same users, with a goal to allow for comparisons between the present and the past across the same users, communities, and geographical regions (Section 4).

  • Topic Diversity: We do not restrict our collection to tweets carrying certain hashtags. This makes the data general enough to comprise content and topics directly related to COVID-19, regardless of existence of accompanying hashtags, as well as themes that may not be directly linked to the pandemic but where the pandemic may have some bearings which should be taken into account when investigating such themes. Section 6 and Section 7 provide a general overview of issues discussed in the dataset.

  • Language Diversity: Since our method of collection targets users, rather than hashtag-based content, Mega-COV is linguistically diverse. In theory, the dataset should comprise any language posted to Twitter by a user whose data we collect. Based on Twitter-assigned language codes, we identify a total of languages (Section 5).

  • No Distribution Shift

    : Related to two previous points, but from a machine learning perspective, collecting the data without conditioning on existence of specific (or any) hashtags avoids introducing distribution bias. In other words, the data can be used to study various phenomena in-the-wild. This warrants more generalizable findings and models.

Even though our current releaase of Mega-COV has more focus on North America, the set of users in the data have posted widely from outside this specific region. In fact, based on our location assignment criteria, we identify a substantial set of users to belong to other regions (see Section 3). As stated earlier, Mega-Cov is continuously growing and our next release will bring half a billion tweets primarily from users from outside North America. The next version of the current paper will describe our next release. We will refer to our current release as Mega-COV V0.1, which we now describe.

2 Data

To collect a sufficiently large dataset, we put crawlers using the Twitter streaming API222streaming API link: on all world continents (i.e. Asia, Africa, North America, South America, Antarctica, Europe, and Australia) starting in early January, 2020. Our goal was to initially acquire a diverse set of tweets from which we can extract user IDs. We then iteratively crawl the user timelines (up to 3,200 tweets) using all collected IDs. This gives us data from April backwards, depending on how prolific of a poster a user is. 333See Table 3 for a breakdown.. For our current release (Mega-COV V0.1), we describe a total of users who contribute tweets. Our next release will add data from at least more users whose data we have already collected but have not yet analyzed and hence do not include here.

#tweets #retweets #replies Total
All-Time 218M 195M 154M 566M
2020 56.6M 78M 61M 196M
#users 474.4K 482.2K 480K 482.2K
Table 1: Distribution of tweets, retweets, and replies in Mega-COV V0.1. (Numbers rounded).

Once the timeline tweets are collected, we put them through a pipeline that involves merging all user files into a single file. For most of our analyses in this paper, we remove re-tweets and replies. 444See the different sections for specific details as tow what parts of the dataset are analyzed. Table 1 offers a breakdown of the distribution of tweets, re-tweets, and replies in Mega-COV V0.1. Tweet IDs of the dataset are available at our GitHub 555Accessible at: and can be downloaded for research. The dataset repository will be updated semi-regularly.

3 Geographic Diversity

(a) Geo-located cities.

(b) Geo-tagged coordinates
Figure 1: World map coverage of Mega-COV V0.1. (a) Left: Cities. Each dot is a city. Contiguous cities of the same color belong to the same country. (b) Right: Point co-ordinates. Each dot is a point co-ordinate (longitude and Latitude) from which at least one tweet was posted.

Tweet location can be associated with a specific ‘point’ location or a Twitter place with a ‘bounding box’ that describes a larger area such as city, town, or country. We refer to tweets in this category as geo-located tweets. Additionally, a smaller fraction of tweets are geo-tagged with longitude and latitude. As Table 2 shows, Mega-COV V0.1 has geo-located tweets from users and geo-tagged tweets from users. Table 2 also shows the distribution of tweets and users over Canada, the U.S., and other locations. For the year 2020, Mega-COV V0.1 has geo-located tweets from users and geo-tagged tweets from users. 666The dataset has locations whose country could not be resolved to a given country.

Geolocated Tweeted From Geotagged Tweeted From
Canada U.S. Other Canada U.S. Other
All-Time 31,996,454 10,360,126 19,930,856 1,705,472 7,102,347 3,067,864 3,338,464 696,019
All-Users 207,902 83,246 147,547 49,078 76,647 39,361 46,589 20,400
2020 10,527,121 1,690,561 8,541,224 295,336 980,821 230,776 667,207 82,838
2020-Users 150,194 51,496 101,806 49,078 28,973 12,724 17,389 4,407
Table 2: Mega-COV V0.1 geolocated and geotagged users and their corresponding tweets over North America vs. Other locations.

Figure 1 shows where the tweets were posted (cities on the left and actual point co-ordinates on the right). Mega-COV V0.1 has data posted from a total of cities from countries. Figure 2 shows the distribution (in terms of numbers) of cities from which the tweets were posted (i.e., geo-located tweets) over the top countries in (a) the whole dataset as well as (b) data posted during 2020. As Figure 2 shows, Mega-COV V0.1 comprises data posted from several European countries (e.g., France, the U.K., Germany, and Italy), Latin America (e.g., Brazil, Mexico), and Asia (e.g., Indonesia, India).

Figure 2: Geographical diversity in Mega-COV V0.1 based on geo-located data. We show distribution of number of cities over countries from all-time vs. during 2020. Only top countries are shown. In total, Mega-COV V0.1 has cities from countries.

4 Temporal Coverage

Figure 3: Distribution of Mega-COV V0.1 data points (including a breakdown of tweets, replies, and retweets) over time.

Mega-COV V0.1 aims at making it possible to compare user social content over time. Since we crawl user timelines, the dataset comprises content going back as early as 2007. Figure 3 shows the distribution of data over the period 2007-2020. Simple frequency of user posting shows a surge in Twitter use in the period of Jan-April 2020 777We started crawling the April data in April for some users and in April for others. Hence, April is an approximation across these users. Still, for about 150K users, we have collected data before March but will update the dataset with more recent tweets from these users. compared to the same period in 2019 (as shown in Figure 4). Indeed, we identify 40.53% more posting during the first 3 months of 2020 compared to the same period in 2019, with the same trend seeming to continue in April. This is expected, both due to physical distancing and a wide range of human activity moving online.

Figure 3 also shows a breakdown of tweets, re-tweets, and replies. A striking discovery is that, for 2020, users are engaged in conversations with one another (short as these typically are in Twitter) more than tweeting directly to the platform. This is the first time this happens compared to any previous years, based on our dataset. In addition, for 2020, we also see users re-tweeting more than tweeting. This is also happening for the first time.

Figure 4: Frequency of tweeting during Jan-April 2020 vs. Jan-April 2019.

(a) User level lang. distribution

(b) Tweet level lang. distribution

(c) Tweet level lang. for non-English
Figure 5: Language diversity in Mega-COV V0.1.

5 Linguistic Diversity

We perform the language analysis based only on tweets (n=), excluding re-tweets and replies. 888This is an arbitrary decision, otherwise re-tweets and replies could counted as relevant for some tasks. Based on Twitter-assigned language IDs, Mega-COV V0.1 comprises 65 languages. However, we suspect the dataset has other languages represented as well but those cannot be tagged using Twitter’s current language identification technology. For languages it cannot detect, Twitter also assigns an “und” (for “undefined”) tag. Mega-COV V0.1 has () tweets tagged as “und”. 999We also plan to run a language id tool on the data and provide a comparison to Twitter-provided language tags. As Figure 3 shows, English, French, and Spanish are (unsurprisingly) the top 3 languages in terms of the number of users who have posted in these languages in the dataset. These 3 languages are also the most frequent in terms of the number of actual tweets in the data as shown in Table 5. Overall, non-English comprises 16.43% of the tweets (n=).

Languages #tweets #users
English (en) 182,175,080 467,936
French (fr) 5,536,031 175,108
Spanish (es) 3,633,006 198,322
Portuguese (pt) 2,014,304 103,490
Tagalog (tl) 2,008,509 164,099
Japanese (ja) 1,349,226 12,151
Indonesion (in) 967,985 174,016
Arabic (ar) 928,318 5,991
Haiti (ht) 453,234 137,097
Turkish (tr) 360,435 51,022
German (de) 320,468 102,190
Italian (it) 281,391 96,525
Estonain (et) 277,024 118,875
Polish (pl) 243,767 62,663
Dutch (nl) 212,414 92,086
Russuian (ru) 184,676 2,072
Korean (ko) 171,994 4,072
Chinese (zh) 170,265 4,003
Farsi (fa) 155,884 1,413
Catalan (ca) 143,229 64,255
Table 3: Top 20 languages in Mega-COV V0.1 based on tweets (n= ), excluding re-tweets and replies.

6 Hashtag Content Analysis

(a) English (en)

(b) Turkish (tr)

(c) French (fr)

(d) Spanish (es)

(e) Tagalog (tl)

(f) Portuguese (pt)

(g) Japanese (ja)

(h) Haiti (ht)

(i) Indonesian (in)

(j) Arabic (ar)
Figure 6: Word clouds for hashtags in tweets from the top 10 languages in the data. We note that tweets in non-English can still carry English hashtags or employ Latin script.
2019 2020
BellLetsTalk 26928 COVID19 198512
GoHabsGo 20776 coronavirus 154175
NewProfilePic 19134 NowPlaying 134088
cdnpoli 13786 shopmycloset 121629
shopmycloset 13040 NintendoSwitch 73590
art 11413 NewProfilePic 71498
love 11271 AnimalCrossing 60818
PS4live 10864 ACNH 60232
yeg 10285 GoHabsGo 52285
realestate 9855 cdnpoli 42309
photography 9124 COVID-19 39982
Toronto 8506 PS4share 38130
travel 8480 Coronavirus 36083
music 8455 BellLetsTalk 35394
Repost 8117 covid19 33725
fashion 7967 fashion 32201
PS4share 7692 Covid_19 31668
Canada 7605 PS4live 29811
toronto 7370 music 28361
Vancouver 6235 nowplaying 28045
onpoli 6131 style 27148
canada 6097 love 26468
winter 5767 yeg 25662
twitch 5719 poshmark 24781
Oscars 5643 art 24050
style 5561 SocialDistancing 23742
yyj 5311 Canada 22091
NintendoSwitch 5164 SoundCloud 21896
vancouver 5163 Toronto 21293
yyc 5131 grambling_rys20 21026
Table 4: Top 30 hashtags in Mega-COV V0.1 for 2019 vs. 2020.

Hashtags usually correlate with the topics users post about. We provide the top 30 hashtags in the data in Table 4. As the table shows, users tweet heavily about the pandemic using hashtags such as COVID19, coronavirus, Coronavirus, COVID19, Covid19, covid19 and StayAtHome. Simple word clouds of hashtags from the various languages (Figure 6 shows clouds from the top 10 languages) also show corona virus topics trending. Also observed in the various word clouds are gaming related hashtags such as NowPlaydo, PSshare and NintendoSwitch, thus showing how users may be spending a share of their time while staying home. We also note frequent occurrence of political hashtags in languages such Arabic, Farsi, Indian, and Urdu. This is in contrast to discussions in European languages where politics are not as visible. For example, in Urdu, discussions involving the army and border issues show up. In Indiana languages such as Tamil and Hindi, posts focused on movies such as Valimai, TV shows such as Big Boss, doctors, and even fake news are observed along with the pandemic-related hashtags.

An interesting observation from the Chinese language word cloud is the use of hashtags such as ChinaPneumonia and WuhanPneumonia to refer to the pandemic. We did not observe these same hashtags in any of the other languages. Additionally, for some reason, Apple seems to be trending during the first 4 months of 2020 in China owing to hashtags such as appledaily and appledailytw.

Some of the languages such as Romanian and Vietnamese have shown bitcoin and crypto-currency to be a hot topic of discussion. This was also seen in the Chinese language word cloud, but not as prominently. Another surprising observation is seen from the Finnish language where users post about the corona virus and gaming but also about kirtan, gurbani which are religious terms related to Sikh religion.

7 Domain Sharing Analysis

Domains in URLs shared by users also provide a window on what is share-worthy. We perfrom an analysis on domains shared in tweets. 101010We note that the same analysis could also be performed on re-tweets and replies, which we intend to carry out. A comparison between the ranks of the top 40 domains in 2020 and their ranks in 2019 yields a number of observations, as follows:

News: We observe URLs with news organization domains are higher in rank in 2020. This is true for Canada where Canadian domains such as,,,, and are higher, but also international news such as and have jumped at least 10 positions and and a whopping 26 and 252 positions respectively. The U.S. twittersphere shows a similar trend, with,,, showing in the top 40 domains, jumping 25, 15, 19, and 48 positions respectively. It is striking that has moved from a rank of 81 in 2019 to 33 in 2020 with the 48 positions jump. We note a somewhat similar international trend, with sites such as and rising much higher in rank.

Other domains: Other noteworthy domain activities including those related to gaming, video and music, and social media tools where ranks of these domains have not necessarily shifted higher but remain prominent. This shows these themes still being relevant in 2020. In spite of the economic impact of the pandemic, shopping domains such as and have markedly risen in rank as people moved to shoppoing online in more significant ways.

8 Case Study: Mapping Human Mobility with Mega-COV V.01

Geo-location information in Mega-COV V01. can be used to characterize and track human mobility in various ways. We investigate some of these next.

8.1 Inter-Region Mobility

Mega-COV V0.1 can be exploited to generate responsive maps where end users can check mobility patterns between different regions over time. In particular, geo-location information can show mobility patterns between regions. As an illustration of this use case, we provide Figure 7. the Figure shows mobility between different Canadian cities (Figure 6(a)) and U.S. states (Figure 6(b)) during Jan.-April 2020.

(a) Tweets from Canada (based on cities)

(b) Tweets from the U.S. (based on states)
Figure 7: Inter-region mobility in Mega-COV V0.1 Canada and U.S. data (Jan.-April 2020).

(a) January 2020

(b) February 2020

(c) March 2020

(d) April 2020
Figure 8: User mobility between Canadian provinces during Jan.-April 2020

(a) January 2020

(b) February 2020

(c) March 2020

(d) April 2020
Figure 9: User mobility between U.S. states during Jan.-April 2020

8.2 User Home Location

We also use information in Mega-COV V0.1 to map each user to a single home region (i.e., city, state/province, and country). We follow geo-location literature in setting a condition that a user must have posted at least 10 tweets from a given region. However, we also condition that at least of all user tweets must have been posted from the same region. 111111We will provide a table with the distribution of users over global locations we could map them to in the next release. For all the analyses in the sections to follow, we exclusively use data from users we successfully located using our method described above (henceforth, located users).

8.3 Inter-Region Mobility Over Time

We exploit Mega-COV V0.1 to show inter-state/province mobility during a given window of time. Here, due to increased posting in 2020, we normalize the number of visits between states by the total number of all tweets posted during 2020. Figure 8 shows user mobility between different Canadian provinces over each of the Jan.-April months during 2020. As a general pattern, as the various provinces went in lockdown, starting from early/mid-March, user mobility drops noticeably leading to a much quieter April activity.

Figure 9 shows mobility between different U.S. states. The figure shows a clear change from higher mobility in Jan. and Feb. to much less activity in March and especially April. Clear differences can be can be seen in key states where the pandemic has hit hard such as New York (NY) and California (CA), and to some extent Washington State (WA).

8.4 User Weekly Mobility

We can also visualize user mobility as a distance from an average mobility score on a weekly basis. Namely, we calculate an average weekly mobility score for the year 2019 using geo-tag information (longitude and latitude) and use it as a baseline against which we plot user mobility for each week of 2019 and 2020 up until April. In general, we observe a drop in user mobility in Canada starting from mid-March. For U.S. users, we notice a very high mobility surge starting around end of Feb. and early March, only waning down the last week of March and continuing in April. For both the U.S. and Canada, we hypothesize the surge in early March (much more noticeable in the U.S.) is a result of people moving back to their hometowns, returning from travels, moving for basic need stocking, etc.

(a) Canada users

(b) U.S. users
Figure 10: Canadian and American user weekly mobility during 2019-2020. Each point (a week) is modeled as a mobility distance from weekly average mobility in 2019.

8.5 Intra-Region Mobility

We can exploit the data to plot user mobility between two or more points based on geo-tagged tweets within the same region, thus painting a more detailed picture. As an illustration, Figure 11 shows user monthly mobility within New York State during 2020. The Figure shows the surge in activity in March 2020 we discuss in the previous section.

(a) January (3,949 users)

(b) February (4,500 users)

(c) March (5,145 users)

(d) April (1,870 users)
Figure 11: User monthly mobility within New York State.

9 Related Works

9.1 Twitter in emergency and crisis:

Social media can play a useful role in disaster and emergency since they provide a mechanism for wide information dissemination and their content can be mined for prompt action Simon et al. (2015). For example, in the Typhoon Haiyan in the Philippines Twitter was used for dissemination of second-hand information, aiding relief efforts, and condolence to the victims Takahashi et al. (2015). Prior to an emergency, Twitter can also be useful for preparedness and early warning. carley2016crowd studied the potential value of Twitter for the warning and response to Tsunami in Padang Indonesia, showing it could be used to support pre-disaster management as it contained information about mobility, population, linguistic needs, and local opinion leaders in different regions, which could all contribute to the construction of an early response system.  verma2019newswire also studies the effectiveness of social media in disaster response and recovery in context to the Nepal 2015 earthquakes, making a comparison with the conventional newspapers and concluding that social media such as twitter and the news article share complementary perspectives that form a holistic view. marx2020sense studied the different strategies media organisations followed during a disaster such as Harvey Hurricane. They identified three sense-giving strategies: retweeting of local in-house outlets, bound amplification of messages of individual to the organisation associated journalists, and open message amplification.

A number of works have focused on developing systems for emergency response. For example,  mccreadie2019trec produce a series of curated feeds of social media posts where a particular type of information request is mapped to feeds. They also make use of a ‘criticality’ score which represents how important it is that a user be shown a given post. They use Twitter feeds to present 6 categories of event: wildfire, earthquake, flood, typhoon/hurricane, bombing, and shooting to tackle irrelevant or off-topic content.

9.2 Twitter Datasets for COVID-19

Several works have focused on creating datasets for enab;ing COVID-19 research. To the best of our knowledge, all these works depend on a list of hashtags related to COVID-19 and focus on a given period of time. For example,  chen2020covid started collecting tweets on Jan. and continued updating by actively tracking a list of 22 popular keywords such as #Coronavirus, #Corona, and #Wuhancoronavirus. They also crawled data from 8 related accounts such as PneumoniaWuhan, CoronaVirusInfo, and V2019N. As of Apr , the authors have released a total of 67M million English tweets 101M non-English tweets. singh2020first collect a dataset covering Jan. -March using a list of hashtags such as #2019nCoV, #ChinaPneumonia and #ChinesePneumonia, for a total of tweets, re-tweets, and direct conversations. Using location information on the data, authors report that tweets strongly correlated with newly identified cases in these locations. More precisely, they state that, for the located conversations, the pattern of volume changes led the COVID-19 cases by 2-5 days in the United States, Italy and China. They suggest that this pattern would be helpful to predict the outbreak of cases.

Similarly, alqurashi2020large use a list of keywords and hashtags related to Covid-19 with Twitter’s streaming API to collect a dataset of Arabic tweets. The dataset covers the period of March -March and is at tweets. The authors goal is to help researchers and policy makers study the various societal issues prevailing due to the pandemic. Authors note that the number of re-tweets increased significantly in late March. In the same vein,  lopez2020understanding also collect a dataset of in multiple languages, with English accounting for of the data. The dataset covers Jan. -March . Analyzing the data, authors observe the level of re-tweets to rise abruptly as the crisis ramped up in Europe in late February and early March.

9.3 Misinformation About COVID-19

Misinformation can spread fast during disaster, and especially during health outbreaks. Social data have been used to study rumors and various types of fake information related to the Zika Ghenai and Mejova (2017) and Ebola Kalyanam et al. (2015) viruses. In the context of COVID-19, a number of works have focused on investigating the effect of misinformation on mental health Rosenberg et al. (2020), the types, sources, claims, and responses of a number of pieces of misinformation about COVID-19 Brennen et al. (2020), the propagation pattern of rumours about COVID-19 on Twitter and Weibo do2019rumour, the check-worthiness (i.e., whether or not a piece of textual information is critical enough to be checked for veracity) wright2020fact, modeling the spread of misinformation and related networks about the pandemic cinelli2020covid,osho2020information,pierri2020topology,koubaa2020understanding, estimating the rate of misinformation in COVID-19 associated tweets kouzy2020coronavirus, the use of bots Ferrara (2020), and predicting whether a user is COVID19 positive or negative Karisani and Karisani (2020).

singh2020first examine the quality of shared links in tweets by identifying a set of ‘reputable’ and ‘questionable’ domains which comes from top medical journals, hospitals and official recommendations, and a set of questionable domains which are created by NewsGuard. They found that the number of useful shared links is about the same as the misleading links. They identify a list of ‘top’ 5 common ‘myths’: Origin of COVID-19, flu comparison, home remedies, heat kills disease, and vaccine development from the search phrase “Coronavirus common myths”. By matching the phrases and words in tweets with the broad descriptions of myths from Google search, they discover over 16,000 tweets containing these myths, which was a small fraction of Twitter content. Authors also identify the top 10 most frequent words in their dataset, including words such as China, people, cases, Wuhan, and Coronavirus. They also identify the top 8 most prevalent themes in their data as healthcare/illness, global nature, information providers, government response, individual concerns/strategies, emotion, and social through grouping frequent words in Twitter conversations.

sharma2020coronavirus collect a dataset of tweets from countries out of which the majority is English speaking, with English language in the data making up tweets. The data cover March -March . Analyzing their data, authors observe a spike new users during the period from November –March

. Authors also perform some initial analyses on their dataset, including to identify fake stories, topical distribution, and sentiment analysis aiming at understanding perception of the public towards the pandemic.

9.4 Racism and Hate Speech in COVID-19

Just as coronavirus spread fast in the world, hate speech towards certain communities is also spreading fast. devakumar2020racism raises the concern that discrimination towards ethnic minority groups like colored people and immigrants could lead to a higher risk of infection for these groups due to their limited access to medical resources and the lack of social protection. The rise of fake news has also worsened the problem of discrimination. A number of works have focused on related phenomena. For example,  schild2020go identify an increase in Sinophobic behaviour on the web and that its spread is a cross-platform phenomenon. Similarly, shimizu20202019 find that xenophobia towards Chinese people spread in Japan due to a piece of misinformation stating that “Chinese passengers from Wuhan with fever slipped through the quarantine at Kansai International Airport” and the hashtag #ChineseDon’tComeToJapan trending in Twitter. Despite WHO officially naming coronavirus as COVID-19, use of controversial terms such as Chinese Virus, Wuham Virus.  lyu2020sense reports work to predict twitter users who are more likely to use controversial terms related to the COVID-19 crisis.

9.5 Emotional Response in COVID-19

To investigate emotional response to COVID-19, kleinberg2020measuring collected and analyzed the Real World Worry Dataset, a dataset comprising

participants’ indications of worry level and emotion type, and their written long and short texts about their emotional states. Unlike other works, this dataset is not from Twitter but rather is collected via the crowd-sourcing platform Prolific. Participants express their level of worry and emotion (anger, anxiety, desire, disgust, fear, happiness, relaxation, and sadness) in written form (short and long) that is matched with a 9-point scale. Using use a lexicon (LWIC2015), authors find significantly high correlations of worrying thoughts in long texts with the categories “family” and “friends”. Topic models revealed the prevalent topics for short texts are related to government ‘slogans’ and suggesting social distancing for others, whereas common topics for long texts are lockdown and worries about employment and the economy. Authors also point out that participants tended to use short texts (tweet-sized) to call for solidarity and long texts to show their actual worries about family and friends.

10 Ethical Considerations

We collect Mega-COV from the public domain (Twitter). In compliance with Twitter policy, we do not publish hydrated tweet content. Rather, we only publish publicly available tweet IDs. All Twitter policies, including respect and protection of user privacy, apply. We encourage all researchers who decide to use Mega-COV to review Twitter policy at before they start working with the data.

11 Conclusion

We presented Mega-COV, a billion-scale dataset of 65 languages for studying global response to the ongoing COVID-19 pandemic. In addition to being large and highly multilingual, our dataset comprises data long pre-dating the pandemic. This allows for comparisons over time. We have provided initial analyses of the data, with a focus on potential use of investigating human mobility. We hope our dataset will be useful for accelerating research on the topic.


MAM acknowledges the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Social Sciences Research Council of Canada (SSHRC), and Compute Canada (


  • J. S. Brennen, F. M. Simon, P. N. Howard, and R. K. Nielsen (2020) Types, sources, and claims of covid-19 misinformation. Reuters Institute. Cited by: §9.3.
  • E. Ferrara (2020) # covid-19 on twitter: bots, conspiracies, and social media activism. arXiv preprint arXiv:2004.09531. Cited by: §9.3.
  • A. Ghenai and Y. Mejova (2017) Catching zika fever: application of crowdsourcing and machine learning for tracking health misinformation on twitter. arXiv preprint arXiv:1707.03778. Cited by: §9.3.
  • J. Kalyanam, S. Velupillai, S. Doan, M. Conway, and G. Lanckriet (2015) Facts and fabrications about ebola: a twitter based study. arXiv preprint arXiv:1508.02079. Cited by: §9.3.
  • N. Karisani and P. Karisani (2020) Mining coronavirus (covid-19) posts in social media. arXiv preprint arXiv:2004.06778. Cited by: §9.3.
  • H. Rosenberg, S. Syed, and S. Rezaie (2020) The twitter pandemic: the critical role of twitter in the dissemination of medical information and misinformation during the covid-19 pandemic. Canadian Journal of Emergency Medicine, pp. 1–7. Cited by: §9.3.
  • T. Simon, A. Goldberg, and B. Adini (2015) Socializing in emergencies—a review of the use of social media in emergency situations. International Journal of Information Management 35 (5), pp. 609–619. Cited by: §9.1.
  • B. Takahashi, E. C. Tandoc Jr, and C. Carmichael (2015) Communicating on twitter during a disaster: an analysis of tweets during typhoon haiyan in the philippines. Computers in Human Behavior 50, pp. 392–398. Cited by: §9.1.
  • WHO (2020) WHO statement regarding cluster of pneumonia cases in wuhan, china. Beijing: WHO 9. Cited by: §1.