The outbreak of the coronavirus disease 2019 (COVID-19) was observed at the end of 2019 in Wuhan, Hubei Province, China. Since January 2020, it has rapidly spread worldwide. On March 11, 2020, the World Health Organization (WHO) announced that COVID-19 can be characterized as a pandemic. The virus causing COVID-19, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), has infected more than 1.2 million people worldwide, and 60,000 people have lost their lives.222https://google.com/covid19-map/ WHO highly recommends maintaining “social distancing” measures, and several countries with severe epidemics are further requesting citizens to stay home.
In this scenario, online social media, such as Twitter, Weibo, and Instagram, are playing an important role in sharing information and perception about COVID-19. Social media is recognized as one of the valuable resource of data that can lead to prediction of various phenomena related to an event. For example, Lampos and Cristianini (2010) showed that microblog data facilitated better public-health surveillance, such as the prediction of the number of patients suffering from influenza.
To encourage and support the social media studies on COVID-19, it is crucial to make relevant datasets available to the public. Here, we publish a multilingual dataset that contains over 20 million microblogs related to COVID-19 in English, Japanese, and Chinese from Twitter and Weibo since January 20, 2020, until March 24, 2020.
Chen et al. (2020) and Lopez et al. (2020) have already released multilingual datasets collected from Twitter. Given that China is the very first country to have faced a COVID-19 outbreak, we further collected microblogs about COVID-19 from Weibo, one of the most popular social media in China similar to Twitter.
The remainder of the paper is organized as described follows. In Section 2, we elaborate on the method of data collection. In Section 3, we provide a quantitative analysis of the dataset, such as the character count per microblog and the microblog count per day. In Section 4, we present the daily word cloud images created from microblogs of each language as an example of text-mining analysis. Finally, in Section 5, we present the conclusion with our future work.
2 Data Collection
|Phase 1||Phase 2||Phase 3|
|English||Wuhan AND (pneumonia OR coronavirus)||Wuhan AND (pneumonia OR coronavirus OR (COVID AND 19))||(Wuhan AND pneumonia) OR|
|(COVID AND 19)|
|Japanese||武漢 AND (肺炎 OR コロナ)||武漢 AND (肺炎 OR コロナ OR (COVID AND 19))||(武漢 AND 肺炎) OR|
|(COVID AND 19)|
|Chinese||武汉 AND (肺炎 OR 冠状病毒)||武汉 AND (肺炎 OR 冠状病毒 OR 新冠肺炎)||(武汉 AND 肺炎) OR|
To collect the microblogs related to COVID-19, we adopted keyword-based search. For English and Japanese, we collected microblogs related to COVID-19 from Twitter, while we obtained Chinese microblogs from Weibo. We employed Twitter Search API333https://developer.twitter.com/en/docs/tweets/search/overview/standard for tweets; a web crawler was applied to retrieve Weibo posts.
We developed three sets of query keywords as shown in Table 1 according to the stages of COVID-19 spread. Corresponding to these sets, our dataset can be divided into three phases:
- Phase 1
(January 20 to February 23, 2020):
In combination with the term “Wuhan,” we used the keywords “pneumonia” and “coronavirus” in English and their translations in Japanese and Chinese. We included the Chinese city name “Wuhan” as the primary keyword, because Wuhan (“武漢” in Japanese and “武汉” in Chinese) observed the earliest outbreak with the maximum number of confirmed cases. Note that in the said period, the official disease name “COVID-19” was yet to be defined.
- Phase 2
(February 24 to 29, 2020):
WHO assigned the official name “COVID-19” on February 11. We added it to the keywords in combination with “Wuhan,” although this resulted in a smaller number of retrieval because all the microblogs included “Wuhan.”
- Phase 3
(March 1–24, 2020):
To obtain more data, we relaxed search conditions by querying each set of keywords separately.
|Phase 1||Phase 2||Phase 3|
2.2 Data Size
As shown in Table 2, we have collected over 16 million microblogs in English, 9 million in Japanese, and 180 thousand in Chinese during January 20 to March 24, 2020. To collect Twitter and Weibo posts, we have adopted a uniform daily timing to collect microblogs from 0:00 to 23:59 (JST) of the previous day. To ensure the uniqueness of the data, for Twitter, we filtered out all retweets by adding the “-filter:retweets” operator; for Weibo, we searched for “original microblogs” only. Note that we have collected smaller amounts of the data from Weibo than Twitter because anti-crawling mechanism in Weibo limits our web crawler to access only the first 50 pages of the search content.
2.3 Dataset Accessibility
We released the first version of the dataset on Github at https://github.com/sociocom/covid19_dataset. Following the terms of service of Twitter and Weibo, we mainly published microblog IDs, instead of exposing original text and metadata. The dataset consists of the lists of microblog IDs with two fields of metadata: their timestamps and the query keywords mentioned in the microblogs among our search queries. This helps make subsets suitable for subsequent applications and tasks. Since a Weibo’s microblog is uniquely determined by the combination of user ID and microblog ID, we share the corresponding user ID and microblog ID for each microblog in the form of “user ID/microblog ID.”
3 Quantitative Analysis
We provide basic statistics of our dataset in terms of its quantitative volume. First, we show the number of characters in microblogs. Next, we plot the number of microblogs per time series.
3.1 Character Count
While microblogs contain multimodal data (e.g., images and movies), their core content is text. We report the number of characters to quantify the total amount of our dataset. Table 3
shows the sum, mean, and standard deviation of the number of characters for each language in our dataset. We removed URLs and punctuations from each microblog to expose the amount of characters that constituted the essential content.
3.2 Daily Microblog Count
Figure 1 portrays the daily count of microblogs in each language, combined with the number of confirmed cases of COVID-19 patients every day, which is obtained from DataHub.io444https://datahub.io/core/covid-19. Figure 1(a) is the plot of English microblogs and the confirmed cases in four major English-speaking countries (i.e., Australia, Canada, the United Kingdom, and the United States) during Phases 1 and 2; Figure 1(b) shows that in Phase 3. Figures 1(c) and 1(e) are the Japanese and Chinese versions of the same plots for Phases 1 and 2, whereas Figures 1(d) and 1(f) display the plots of Phase 3.
In Figure 1(a), a sudden and dramatic increase in the number of English microblogs can be observed on January 28, 2020. According to the news, that particular day saw a discussion on the death toll in mainland China reaching 100.555January 28, 2020; CNN, https://cnn.it/3a1FFm8 On the same day, Japan also observed a sharp rise in the relevant microblogs, as shown in Figure 1(c). This was a result of many users tweeting extensively about the three newly confirmed cases in Japan, which included people who had not been to Wuhan.666January 28, 2020; Japan Times, https://bit.ly/3aFPqaE
Subsequently, there was a substantial increase in the English microblogs on February 25, 2020, as shown in Figure 1(a). On that day, there were reports that “Trump privately vents over his team’s response to coronavirus – even though he says that the virus is under control,”777February 25, 2020; CNN, https://cnn.it/39VVbjg leading to many microblogs against Trump on Twitter.
In March, as Figure 1(b) shows, the number of microblogs in major English-speaking countries showed an upward trend as the number of the confirmed cases increased, and the largest number of microblogs exceeded 9 million a day. Meanwhile, in Japan, the number of daily confirmed cases was relatively small as shown in Figure 1(d). Therefore, we assumed that Japanese Twitter
users are not as interested in COVID-19 as in the major English-speaking countries. In particular, there was a decline in the number of microblogs from March 12 to March 15, 2020. March 12, 2020, was the Olympic flame lighting ceremony and the torch relay for the Tokyo 2020 Olympics.888March 12, 2020; BBC, https://bbc.in/3emD6OK Therefore, we speculate that this sudden decrease was caused by a shift in attention from COVID-19 to the torch relay for many Japanese users.
4 Qualitative Analysis
In addition to the quantitative analysis, we show an example of qualitative analysis based on our dataset. As an initial attempt, we adopted a word cloud, which is “an electronic image that shows words used in a particular piece of electronic texts or series of texts.”999https://dictionary.cambridge.org/dictionary/english/word-cloud In word clouds, term frequency for each word in a corpus is proportional to its font size, which enables us to grasp the topics of the corpus visually. Daily word cloud images of our dataset for each language are available at https://aoi.naist.jp/2020-covid/wordcloud. Henceforth, we provide brief interpretations of these word clouds to demonstrate a possible text-mining approach that can be applied to our dataset in Figure 2.
Note that we removed stop words followed by tokenization in our word clouds. For the Chinese and Japanese tokenization, we used Jieba101010https://github.com/fxsjy/jieba and Mecab111111https://taku910.github.io/mecab, respectively. We also filtered out the search keywords in each microblog to reduce the disturbance of these keywords in the image.
4.1 English Word Cloud
A US citizen who lived in Wuhan passed away because of COVID-19 in Wuhan on February 8, 2020.121212February 8, 2020; CNBC, https://cnb.cx/2R4uYZ1 This was the first casualty of a US citizen. The word cloud of this day, shown in Figure 2(a), contains the related words, e.g., “American,” “US,” “citizen,” and “die.”
Figure 2(b) is the word cloud on March 16, 2020, in which “social distancing,” an important phrase to fight against the epidemic, appears notably. We can also notice that another socially important phrase “stay home” has an increased in size in our word cloud series from March 20, 2020.
4.2 Japanese Word Cloud
The first local transmission of COVID-19 inside Japan was reported on January 28, 2020, as described in Section 3.2. Figure 2(c) shows the word cloud on that day. It reflects the fact that the infected patient lived in Nara prefecture and drove a sightseeing-tour bus that carried travelers from Wuhan. We can observe the relevant keywords, such as “奈良 (Nara),” “バス (bus),” and “運転 (drive).”
On March 24, 2020, Japan and International Olympic Committee (IOC) officially agreed to postpone the planned 2020 Tokyo Olympics until 2021.131313March 24, 2020; The Washington Post, https://wapo.st/2UYXEnG A notable change in Japanese word cloud series can be found as the novel appearance of the words “オリンピック (Olympics)” and “延期 (postponing)” in that day’s figure (i.e., Figure 2(d)).
We can also notice that a YouTube video became viral in Japanese Twitter from around January 29 to February 6, 2020, by observing the corresponding word clouds. The video was originally made by a Wuhan citizen and subtitled in Japanese later by another YouTuber,141414January 29, 2020; YouTube, https://youtu.be/Mcfn5Eh5OVE which tells the situation of Wuhan in lockdown. In addition to the word “YouTube,” the corresponding word clouds contain the tokens of the video title, i.e., “震源 (hypocenter),” “動画 (video),” and “和訳 (Japanese translation).”
4.3 Chinese Word Cloud
Figure 2(e) shows the word cloud on January 20, 2020, and also shows that the term “钟南山 (Zhong nanshan)” has a larger weight. It was on January 20 that Dr. Zhong indicated the existence of human-to-human transmission of COVID-19151515January 20, 2020; The New York Times, https://nyti.ms/3bT7r5m that triggered extensive discussion on Weibo.
Figure 2(f) shows the word cloud on March 10, 2020 and the word “方舱医院 (mobile cabin hospital)” was more conspicuous. According to China’s National Health Commission, all of Wuhan’s mobile cabin hospitals were closed on March 10.161616March 10, 2020; Xinhua News, https://bit.ly/2JG28u6 The mobile cabin hospitals, which were instrumental in preventing the spread of the epidemic, also had attracted much attention.
We published a multilingual dataset of microblogs related to COVID-19 collected by relevant query keywords at https://github.com/sociocom/covid19_dataset. The dataset covered English and Japanese tweets from Twitter and Chinese posts from Weibo. The present version of the dataset (April 21, 2020) encompassed microblogs from January 20 to March 24, 2020.
We then showed one of the possible utilization of our dataset through the daily microblog count analysis as an example of the quantitative analyses and the word cloud-based analysis as an example of the qualitative analyses. The results of the analyses are summarized as follows. For China, which is the first country to have faced a full-blown outbreak of COVID-19, we can observe from social media that people took the situation and prevention seriously. As the number of confirmed cases in China decreased, the trend in social media shifted toward the concern for the global situation. In the UK and the US, the main English-speaking countries, initially, there was less social media interests owing to fewer confirmed cases. The subsequent outbreaks sprung the discussion about COVID-19 on social media, including the promotion of precautionary measures and recommendations to keep “social distancing” measures. Meanwhile, Japan showed relatively sluggish growth. However, on March 24, 2020, the announcement of the postponement of the 2020 Olympic Games in Tokyo along with a relatively rapid growth of confirmed cases was reflected in the increased social media activity. This was accompanied by microblogs expressing concerns about the epidemic and dissatisfaction with government measures.
We believe that this dataset can be analyzed further in many ways, such as sentiment-based analysis171717https://usc-melady.github.io/COVID-19-Tweet-Analysis/, comparison with web search queries, moving logs181818https://www.google.com/covid19/mobility/,191919https://dataforgood.fb.com/tools/disease-prevention-maps, etc. Various combinations of data can enable deeper analyses of social media communication. Furthermore, our dataset would contribute to extract useful clinical information from social media and render hints about efficient broadcasting of the clinical information. We continue to collect the microblog data while keeping the repository up-to-date.
This study was supported in part by JSPS KAKENHI Grant Number JP19K20279 and Health and Labor Sciences Research Grant Number H30-shinkougyousei-shitei-004.
- Chen et al. (2020) Emily Chen, Kristina Lerman, and Emilio Ferrara. 2020. Covid-19: The first public coronavirus twitter dataset.
- Lampos and Cristianini (2010) V. Lampos and N. Cristianini. 2010. Tracking the flu pandemic by monitoring the social web. In 2010 2nd International Workshop on Cognitive Information Processing, pages 411–416.
- Lopez et al. (2020) Christian E. Lopez, Malolan Vasu, and Caleb Gallemore. 2020. Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset.