Detecting Topic and Sentiment Dynamics Due to COVID-19 Pandemic Using Social Media

07/05/2020 ∙ by Hui Yin, et al. ∙ Deakin University 0

The outbreak of the novel Coronavirus Disease (COVID-19) has greatly influenced people's daily lives across the globe. Emergent measures and policies (e.g., lockdown, social distancing) have been taken by governments to combat this highly infectious disease. However, people's mental health is also at risk due to the long-time strict social isolation rules. Hence, monitoring people's mental health across various events and topics will be extremely necessary for policy makers to make the appropriate decisions. On the other hand, social media have been widely used as an outlet for people to publish and share their personal opinions and feelings. The large scale social media posts (e.g., tweets) provide an ideal data source to infer the mental health for people during this pandemic period. In this work, we propose a novel framework to analyze the topic and sentiment dynamics due to COVID-19 from the massive social media posts. Based on a collection of 13 million tweets related to COVID-19 over two weeks, we found that the positive sentiment shows higher ratio than the negative sentiment during the study period. When zooming into the topic-level analysis, we find that different aspects of COVID-19 have been constantly discussed and show comparable sentiment polarities. Some topics like “stay safe home" are dominated with positive sentiment. The others such as “people death" are consistently showing negative sentiment. Overall, the proposed framework shows insightful findings based on the analysis of the topic-level sentiment dynamics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The outbreak of the novel Coronavirus Disease 2019 (COVID-19) has influenced millions of people around the world [20]. People’s lives are at risk due to the fact that COVID-19 is highly infectious from person to person. It is reported that the globally confirmed cases of COVID-19 have surpassed 10 milloin and more than 500,000 people have lost their lives due to the infection until 29 June 2020 111https://news.google.com/covid19/map?hl=en-AU&gl=AU&ceid=AU%3Aen. Many countries and jurisdictions have taken a series of specific measures to help to slow down the spread of COVID-19, such as travel ban, lockdown, closure of public places (e.g., gym, restaurants, schools), requiring people to practise good personal hygiene, keep physical social distancing of 1.5 meters, and work or study at home to reduce contact with others. The above measures have changed people’s daily lives, and most people have to follow the policies to protect the safety of their communities. However, many mental symptoms like worry, fear, frustration, depression and anxiety could occur and cause serious mental heath issues to people due to the long-time social activity restriction during the pandemic period [24]. Therefore, understanding when and where people would experience mental issues in this special period is important for governments to make the right decisions.

On the other hand, people could spend more time on social media platforms like Twitter to get the latest news, communicate with friends and post their feelings and thoughts during the lockdown period. Such massive personal posts from social media could become invaluable data sources for large-scale sentiment and topic mining for monitoring people’s mental health across different events or topics [21]. Hence, many recent work focuses to detect the social sentiment for people due to COVID-19 [24, 10, 15]. For instance, Zhou et al. [24] adopted Twitter data for massive sentiment analysis due to COVID-19 for people living New South Wales Australia. Han et al. [10] studied the public opinion in the early stages of COVID-19 in China by analyzing text data from Sina Weibo. Rajesh et al. [15] exploited topic model to generate topics from tweets related to coronavirus and calculated the presence of different emotions. However, little work is found to analyze the topic and sentiment dynamics together, even though such dynamical analysis is important for authorities to understand when and where people would experience mental health issues.

In this work, we aim to dynamically identify the popular topics and their associated sentiment polarities due to the COVID-19 pandemic. Our research questions are: (1) How is the dynamics of people’s sentiment? (2) Which topics are mostly discussed by people? (3) How is the evolution of different topics? (4) How is the sentiment dynamics of topics? To answer these questions, we propose a novel dynamic topic discovery and sentiment analysis framework which contains multiple modules include data crawling, data cleaning, topic discovery, sentiment analysis and result visualization. In the proposed framework, we employ the Dynamic Topic Model (DTM) [2]

to generate accurate daily topics. To determine the sentiment polarity of each topic and tweet, we utilize a sentiment lexicon tool: VADER

[12] 222https://github.com/cjhutto/vaderSentiment to infer the sentiment polarity. We collect 13,746,822 tweets from 1 April 2020 to 14 April 2020 related to COVID-19 from Twitter across the world to test the effectiveness of the proposed framework. The experimental results show that the proposed framework can generate insightful findings such as the overall sentiment dynamics among people, topic evolutionary patterns, the sentiment dynamics of different topics.

2 Related work

With the spreading of COVID-19 across the world, researchers have proposed to use sentiment analysis based on social media as a tool to monitor people’s mental health. In this section, we review the latest work related to COVID-19 analysis based on social media data.

Rajesh et al.[15] adopted a classic Latent Dirichlet Allocation (LDA) topic model method to generate 10 topics in a random sample of 18,000 tweets about coronavirus, then they used NRC sentiment dictionary to calculate the presence of eight different emotions, which were “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise” and “trust”. Han et al. [10] explored public opinion in the early stages of COVID-19 in China by analyzing Sina-Weibo texts. The COVID-19 related micro-blogs were generalized into 7 topics and 13 more detailed sub-topics. However, they judged the polarity of the topics according to the polarity of the topic words and failed to consider the specific tweets under this topic. Cinelli et al. [6] analyzed engagement and interest in the COVID-19 topic on different social media platforms such as Twitter, Instagram, Reddit, and provided a differential assessment of the global discourse evolution of each platform and their users. They found that reliable and suspicious information sources have similar spreading patterns. Depoux et al. [8] confirmed that the spread of social media panic is faster than that of COVID-19. Therefore, the public rumors, opinions, attitudes and behaviors surrounding COVID-19 need to be quickly detected and responded to. They suggested to create an interactive platform dashboard to provide real-time alerts of rumors and concerns about the spread of coronavirus worldwide, which would enable public health officials and relevant stakeholders to respond quickly with a proactive and engaging narrative that can mitigate misinformation. Sharma et al. [18] designed a dashboard to track misinformation in Twitter conversations. The dashboard allows visualization of the social media discussions around coronavirus and the quality of information shared on the platform and updated over time. They evaluated sentiment polarity for each tweet based on its textual information and showed the distribution of sentiment in different countries over time.

More recently, Huang et al. [11] examined the public discussion concerning COVID-19 on Twitter and found that the most influential tweets are still written by regular users, such as news media, government officials, and individual news reporters. They also discovered that “fake news” sites are most likely to be retweeted within the source country and so are less likely to spread internationally. Zhou et al. [24] exploited tweets on Twitter to analyse the sentiment dynamics of people living in the state of New South Wales (NSW) in Australia during the pandemic period. They first summarized that the overall polarity of the community since the outbreak was positive, and then analyzed the sentiment dynamics of the NSW local government areas in terms of lockdown, social distance and JobKeeper.

Different from the above work that either performed static sentimental analysis or failed to analysis the detailed topic-level sentiment. We propose to analyse the dynamics of topic-level sentiment to monitor the evolution of people’s mental states across different topics or events.

3 Proposed Framework

In this section, we introduce the proposed framework and the adopted techniques for detecting topic and sentiment dynamics. Our framework contains three steps for the task. Firstly, we divide the tweets into different topics generated by a dynamic topic model, then we determine the sentiment polarity of each tweet, and finally we summarize the sentiment polarity distribution of each topic. The proposed framework is shown in figure 1.

Figure 1: Overall procedure of analysis.

3.1 Topic Extraction

Topic modelling is the process of learning, recognizing, and extracting high-level semantic topics across a corpus of unstructured text. A popular method for this is Latent Dirichlet Allocation (LDA) proposed by Blei et al. [3], which is used to detect topics for static corpus. Later, a Dynamic Topic Models (DTM) [2] is proposed based on LDA to mine topic evolution over time by extending the idea of LDA to allow topic mining over fixed time intervals. Hence, DTM can handle sequential documents and generate topics for different time slices of corpus. Specifically, the documents within each time slice are modeled with the static LDA, the topics associated with slice evolve from the topics associated from the previous time slice . The dynamics of the topic model is given by Eq. 1 and 2, table 1 shows the meaning of the notations. DTM mines topics of each time slice with LDA and its parameters and are chained together in a state space model which evolve with Gaussian noise to get a smooth evolution of topics from time to time.

Symbol Description
as the per-document topic distribution at time t.
as the word distribution of topic k at time t.
as the topic distribution for document d in time t.
as the topic for the word in document d in time t.
as the word at time slice , document d.
Table 1: Meaning of the notations.
(1)
(2)

3.2 Tweets Sentiment Analysis

Sentiment Analysis (SA) also commonly referred as Opinion Mining(OM) that aims to find opinionated information and detect the sentiment polarity. Nowadays, SA techniques are quite mature and many tools are openly available, such as Stanford’s CoreNLP [14], VADER [12], SentiStrength [19], SentiCircles [17], which are specifically designed for social media conversation. In this work, we employ VADER to identify the sentiment polarity of each tweet. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiment expressed in social media. It was introduced by Hutto et al.in 2014 and has been widely used in many social media text sentiment analysis tasks [7, 5, 9, 23, 18, 24]

. VADER can classify the sentiment into negative, neutral, and positive categories and employ compound score which is computed by summing the valence scores of each word in the lexicon and normalized in range (-1,1), where “-1” represents most extreme negative and “1” represents most extreme positive. If compound score is greater than 0.05, the text is considered positive, if the score is less than -0.05, it is considered negative, if the score is between 0.05 and -0.05, the text polarity is neutral. The biggest advantage of VADER is that it does not require data preprocessing and model training, and can be used directly on the raw tweet to generate sentiment polarity. Some examples of tweet sentiment results from VADER are shown in figure

2. In this work, we use VADER to produce the sentiment polarity of each original tweet, namely positive, negative and neutral. And then, combine the topic mining result from DTM to analyze the topic-level sentiment.

Figure 2: Example tweets with sentiment polarity inferred by VADER.

3.3 Measuring Topic Sentiment

After the above two steps, all tweets will be clustered in the corresponding topics, and marked with sentiment polarity. The sentiment polarity is aggregated over tweets to estimate the overall sentiment distribution. For each topic per day, we sum up the number of positive, negative and neutral tweets in the topic, thus, each topic is associated with three sentiment counts.

(3)
(4)

For each topic in the studied days, we define a sentiment distribution in Eq. 3 and 4. is the total studied days and is the total topic number. represents the total number of tweets under topic in day , , and respectively denote the positive, negative and neutral tweet counts. represents the total number of tweets in our dataset. We observe the daily hot topics about COVID-19 on Twitter (April 1 to April 14), and analyze the sentiment polarity distribution of each topic. The details are presented in Section 4.3.

4 Experimental Study

In this section, we will introduce the processes of data collection and data preprocessing, and how to determine the optimum topic number in DTM.

4.1 Data Collection and Preprocessing

The data collection process is described as follows. Firstly, we obtain tweet IDs from the public available coronavirus Twitter dataset333https://github.com/echen102/covid-19-tweetids that collected by Chen et al. [4] from Twitter API444https://developer.twitter.com/en/docs/api-reference-index based on keywords such as Coronavirus, Covid, Covid19, Wuhanlockdown and account names such as CDCemergency, CDCgov, WHO, HHSGov to actively tracking tweets from Twitter. We collect totally 13,746,822 tweets with Tweepy (a python library for the Twitter API) based on the given tweet IDs from April 1 to April 14. After that, the non-English tweets in the tweet dataset are removed, and we get a total of 8,430,793 English tweets.

Data preprocessing or cleaning is the first step for text mining task [22]. It includes converting all letters into lowercase, removing stop words, non-English letters, URLs, etc. Then, phrase extraction is used to ensure that words such as “human_rights” could be one token instead of separating “human” and “rights”. Additionally, lemmatization is also adopted to remove inflectional endings and return the base or dictionary form of a word. After preprocessing, we remove very short tweets (less than 6 words) and retained a total of 4,919,471 tweets with 269,391 unique tokens. Figure 3 shows the ratio of raw tweets, English tweets, and finally adopted tweets. Table 2 shows the number of daily tweets used for the experiment.

Figure 3: Statistics of tweets in dataset. The green rectangle represents the total tweets per day, the blue rectangle represents the English tweets in all tweets, and the yellow rectangle represents the pre-processed tweets used in the experiment.
Date April 1 April 2 April 3 April 4 April 5 April 6 April 7
Total 374,327 355,504 350,211 354,884 352,499 355,478 373,342
Date April 8 April 9 April 10 April 11 April 12 April 13 April 14
Total 377,615 373,812 334,297 335,607 307,422 331,806 342,667
Table 2: Size of daily tweets after preprocessing.

4.2 Topic Model Setup

The number of topics is a crucial parameter in topic modeling and capable of making these topics human interpretable. According to [2], for the first slice of Dynamic Topic Models(DTM) to get setup, we fit LDA on the first day of the dataset to learn the best topic number. We employ Gensim package in Python to train LDA model for the selection of best topic number. Here, we use the coherence [16] by measuring the degree of semantic similarity between high scoring words of topics as an indicator to choose the best topic number. The coherence score helps distinguish between human understandable topics and artifacts of statistical inference.

(5)

where select top frequently occurring words in each topic, then aggregate all the pairwise scores of the top words of the topic.

Figure 4 shows the coherence scores of different topic numbers on the LDA model based on one day tweets. As we can see, when the topic number is 70, the coherence score gets the maximum value that is around 0.39. Therefore, we assume the total number of topics is stable and set the topic number for DTM as 70.

Figure 4: Coherence score of different topic numbers on LDA.

4.3 Results

Figure 5 shows the overview of the number of deaths and confirmed cases, these datasets are collected from WHO 555https://covid19.who.int/. COVID-19 has gradually expanded to all parts of the world and has aroused the attention of countries, reported confirmed and deaths are slowly increasing.

Figure 5: Worldwide deaths and confirmed cases of COVID-19 during the study period.

4.4 Overall Sentiment Dynamics

Figure 6 presents the overall sentiment distribution on Twitter during the study period. The number of tweets about the COVID-19 is around 350,000 per day, and the daily number of positive and negative tweets is similar but all greater than the number of neutral tweets. In most days, the number of positive sentiment tweets is slightly higher than negative sentiment tweets. This shows that despite the spread of COVID-19, the community showed a dominant positive sentiment during the study period. This observation is also consistent with the finding of other researchers who have reported the similar conclusions based on country-level sentiment analysis [1, 13, 24].

Figure 6: The overall sentiment dynamics on Twitter during the study period.

4.5 Daily Hot Topics

For daily topics discussed by users, we use DTM to generate 70 topics and then observe the popularity of topics. Table 3 shows the extracted top 10 highest volume topics with the most relevant words associated with the topics. To find the hottest topic discussed by people in the studied period, we rank the topics by their corresponding tweet number. Figure 7 shows the index of the top 10 topics for each day, topics are sorted by their volume from top row to bottom row in figure 7. As we can see, topics 11, 49, and 64 are steadily ranked as the top 3 of daily hot topics.

Topic 64 Topic 49 Topic 11 Topic 26 Topic 31
(disease) (report) (stay home) (lockdown) (life)
people case stay day time
die new home lockdown good
ignore death safe fight first
seriously report go covid life
get total employee road talk
disease important toxic_relationship go hard
take far healthy medical_supplie hour
go covid request deliver go
say bank complete benefit failure
intelligence stuff panic stayhom pandemic
Topic 43 Topic 16 Topic 66 Topic 56 Topic 40
(work) (social_distancing) (stop spread) (face mask) (health care)
know think virus make health
work thing spread mask care
medium month stop face right
create next woman share company
lead social_distancing leader wear risk
little go move sell worker
difficult vaccine citizen concern result
street get act stayhome would
need article deadly expect demand
tip finally slow wonder resource
Table 3: Top 10 topics about COVID-19 on Twitter during the study period. After each topic, the most contributing words related to the topic are displayed.
Figure 7: The top ten topics per day. The number in the cell represents the index of the topic, and daily topics are sorted in descending order according to volume.
(a) Topic 11
(b) Topic 49
(c) Topic 64
Figure 8: The most significant words for the three hot topics.

Figure 8 shows the most significant words that are associated with top three topics. Topics 11, 49, and 64 reflect the common concerns discussed by people, they are about staying at home to ensure safety, the latest case reports and people death due to the disease.

To further analysis of these three topics, we hope to know people’s opinions on these topics. We analyze the proportion of sentiment polarity of tweets under these topics to dynamically observe people’s sentiment changes as the pandemic spreads, the results are shown in figures 9 to 11. Overall, the topic’s sentiment polarity various from topic to topic.

Topic 11 is related to staying at home, and our results show that the public kept a highly positive sentiment dominantly towards this measure to prevent infection. This may be because people work or study at home and enjoy more time with their families. They also had a positive belief to the combat against COVID-19 by the government and the society.

Figure 9: The sentiment dynamics of topic 11 during the study period.

Topic 49 is about latest report about cases, the dominant sentiment around the topic of cases was almost negative despite positive sentiment existing. As the pandemic spread to more countries, the number of confirmed and deaths continues to rise, and people feel that the actual situation is worse than they expected.

Figure 10: The sentiment dynamics of topic 49 during the study period.

Topic 64 is about people’s death due to COVID-19, it shows a diametrically opposite result to topic 11, tweets with negative emotions are much higher than others. It is believed that the outbreak cannot be effectively controlled completely since policy makers ignored the seriousness of the pandemic, so they expressed strong dissatisfaction with this consequence.

Figure 11: The sentiment dynamics of topic 64 during the study period.

5 Conclusions and Future Work

This study conducted a comprehensive analysis about hot topics with associated sentiment polarity distribution during COVID-19 period. Instead of the country-level study, the sentiment in this work was analysed at global-level on more than 13 million tweets collected from Twitter for two weeks from 1 April 2020. The results of analysis showed that people concern about the latest confirmed coronavirus cases, measures to prevent infection, the attitudes and specific measures of governments towards the pandemic. The overall sentimental polarity was positive, but topic sentiment polarity various from topic to topic. More interesting topics can be explored based on the current study in the future. For example, more specific topics can be analyzed to help policy maker, government and local communities to formulate measures to prevent the spread of negative emotions on social network, such as food shortage, lose job, debt, study at home.

References

  • [1] M. Bhat, M. Qadri, M. K. Noor-ul-Asrar Beg, N. Ahanger, and B. Agarwal (2020) Sentiment analysis of social media response on the covid19 outbreak. Brain, Behavior, and Immunity. Cited by: §4.4.
  • [2] D. M. Blei and J. D. Lafferty (2006) Dynamic topic models. In

    Proceedings of the 23rd international conference on Machine learning

    ,
    pp. 113–120. Cited by: §1, §3.1, §4.2.
  • [3] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: §3.1.
  • [4] E. Chen, K. Lerman, and E. Ferrara (2020) Covid-19: the first public coronavirus twitter dataset. arXiv preprint arXiv:2003.07372. Cited by: §4.1.
  • [5] J. Cheng, M. Bernstein, C. Danescu-Niculescu-Mizil, and J. Leskovec (2017) Anyone can become a troll: causes of trolling behavior in online discussions. In Proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, pp. 1217–1230. Cited by: §3.2.
  • [6] M. Cinelli, W. Quattrociocchi, A. Galeazzi, C. M. Valensise, E. Brugnoli, A. L. Schmidt, P. Zola, F. Zollo, and A. Scala (2020) The covid-19 social media infodemic. arXiv preprint arXiv:2003.05004. Cited by: §2.
  • [7] T. Davidson, D. Warmsley, M. Macy, and I. Weber (2017) Automated hate speech detection and the problem of offensive language. In Eleventh international aaai conference on web and social media, Cited by: §3.2.
  • [8] A. Depoux, S. Martin, E. Karafillakis, R. Preet, A. Wilder-Smith, and H. Larson (2020) The pandemic of social media panic travels faster than the covid-19 outbreak. Oxford University Press. Cited by: §2.
  • [9] E. Ferrara and Z. Yang (2015) Measuring emotional contagion in social media. PloS one 10 (11), pp. e0142390. Cited by: §3.2.
  • [10] X. Han, J. Wang, M. Zhang, and X. Wang (2020) Using social media to mine and analyze public opinion related to covid-19 in china. International Journal of Environmental Research and Public Health 17 (8), pp. 2788. Cited by: §1, §2.
  • [11] B. Huang and K. M. Carley (2020) Disinformation and misinformation on twitter during the novel coronavirus outbreak. arXiv preprint arXiv:2006.04278. Cited by: §2.
  • [12] C. J. Hutto and E. Gilbert (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media, Cited by: §1, §3.2.
  • [13] K. Jaidka, S. Giorgi, H. A. Schwartz, M. L. Kern, L. H. Ungar, and J. C. Eichstaedt (2020) Estimating geographic subjective well-being from twitter: a comparison of dictionary and data-driven language methods. Proceedings of the National Academy of Sciences 117 (19), pp. 10165–10171. Cited by: §4.4.
  • [14] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky (2014)

    The stanford corenlp natural language processing toolkit

    .
    In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp. 55–60. Cited by: §3.2.
  • [15] D. Prabhakar Kaila, D. A. Prasad, et al. (2020) Informational flow on twitter–corona virus outbreak–topic modelling approach. International Journal of Advanced Research in Engineering and Technology (IJARET) 11 (3). Cited by: §1, §2.
  • [16] M. Röder, A. Both, and A. Hinneburg (2015) Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pp. 399–408. Cited by: §4.2.
  • [17] H. Saif, M. Fernandez, Y. He, and H. Alani (2014) Senticircles for contextual and conceptual semantic sentiment analysis of twitter. In European Semantic Web Conference, pp. 83–98. Cited by: §3.2.
  • [18] K. Sharma, S. Seo, C. Meng, S. Rambhatla, A. Dua, and Y. Liu (2020) Coronavirus on social media: analyzing misinformation in twitter conversations. arXiv preprint arXiv:2003.12309. Cited by: §2, §3.2.
  • [19] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and A. Kappas (2010) Sentiment strength detection in short informal text. Journal of the American society for information science and technology 61 (12), pp. 2544–2558. Cited by: §3.2.
  • [20] World Health Organization (2020) Coronavirus disease (covid-19) pandemic. Note: [Online; accessed 2020-05-15] External Links: Link Cited by: §1.
  • [21] S. Yang, G. Huang, and B. Cai (2019) Discovering Topic Representative Terms for Short Text Clustering. IEEE Access 7, pp. 92037–92047. External Links: Document, ISSN 2169-3536, Link Cited by: §1.
  • [22] S. Yang, G. Huang, B. Ofoghi, and J. Yearwood (2020) Short text similarity measurement using context-aware weighted biterms. In Concurrency Computation, External Links: Document, ISSN 15320634 Cited by: §4.1.
  • [23] Q. You, J. Luo, H. Jin, and J. Yang (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the Ninth ACM international conference on Web search and data mining, pp. 13–22. Cited by: §3.2.
  • [24] J. Zhou, S. Yang, C. Xiao, and F. Chen (2020) Examination of community sentiment dynamics due to covid-19 pandemic: a case study from australia. CoRR abs/2006.12185. External Links: Link, 2006.12185 Cited by: §1, §1, §2, §3.2, §4.4.