Starting in late 2019, the COVID-19 pandemic has rapidly impacted over 200 countries, areas and territories. As of April 19, according to the World Health Organization (WHO), 2,241,359 COVID-19 cases were confirmed worldwide, with 152,551 deaths111https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200419-sitrep-90-covid-19.pdf?sfvrsn=551d47fd_2. This disease has tremendous impacts on the world’s population’s daily lives.
In light of the deteriorating situation in the United States, discussions of the pandemic on social media has drastically increased since March 2020. Within these discussions, an overwhelming trend is the use of controversial terms targeting Asian and, specifically, Chinese population, insinuating that the virus originated in China. On March 16, the President of United States, Donald Trump, posted on Twitter calling COVID-19 the “Chinese Virus”.222https://twitter.com/realdonaldtrump/status/1239685852093169664 Around March 18, media coverage of the term “Chinese Flu” also took off.333https://blog.gdeltproject.org/is-it-coronavirus-or-covid-19-or-chinese-flu-the-naming-of-a-pandemic/ Although most public figures who used the controversial terms claimed them to be non-discriminative, such terms have stimulated racism and discrimination against Asian-Americans in the US, as reported by New York Times444https://www.nytimes.com/2020/03/23/us/chinese-coronavirus-racist-attacks.html, the Washington Post555https://www.washingtonpost.com/nation/2020/03/20/coronavirus-trump-chinese-virus/, the Guardian666https://www.theguardian.com/world/2020/mar/24/coronavirus-us-asian-americans-racism, and other main stream news media.
A recent work was done with social media data to characterize users who used controversial or non-controversial terms associated with COVID-19 and found the associations between demographics, user-level features, political following status, and geo-location attributes with the use of controversial terms [Lyu2020sense]. In this study, we analyse from a language perspective crawled tweets (Twitter posts) with and without controversial terms associated with COVID-19. To operationalize this idea, we perform two investigations. First, latent Dirichlet allocation (LDA) [blei2003latent] is applied to extract the topics in controversial and non-controversial posts [blei2003latent]. Next, LIWC2015 (Linguistic Inquiry and Word Count 2015) [pennebaker2015development] is applied to build multi-dimensional profiles of the posts. We then made comparisons between the topics and profiles presented in both controversial and non-controversial posts, trying to investigate any association between the use of controversial terms and the underlying mindsets.
|Classification||Topics||Top 10 Topic Words|
|Controversial||Racism||call chinese virus stop racist people let kill pandemic infect|
|Anecdote||virus chinese people say world covid would death call take|
|Consipracy||chinese spread virus be send full conspiracy exactly theory vid|
|Work in hospital||get good chinese test fight virus hospital need work last|
|Blame the lie||chinese virus must people lie blame tweet fact need die|
|Non-controversial||Test cases||case test covid death new day positive patient break number|
|Anecdote||tell virus covid man story spread free say try government|
|Trump||say people know get trump could make would|
|Health workers||help need covid crisis health fight worker work government pandemic|
|Stay home||people virus home stay corona take get die|
Note: All appearances of “Chinese virus” related keywords were removed prior to the LDA process. Bigrams and trigrams were included in the LDA model. None of them appears in the top 10 topic words due to infrequency. Some topics contain less than 10 topic words due to deletion of some short (less than 3 characters) words.
Ii Related Work
Our work is built upon previous works on text mining using data from social media during influential events.
Studies have been conducted using topic modeling, a process of identifying topics in a collection of documents. The commonly used model, Latent Dirichlet Allocation (LDA), provides a way to automatically detect hidden topics in a given number [blei2003latent]. Previous research has been conducted on inferring topics on social media. Kim et al. [kim2016topic] investigated the topic coverage and sentiment dynamics on Twitter and news press regarding the issue of Ebola. Chen et al. [Chen2019ecig] found LDA-generated topics from e-cigarette related posts on Reddit to identify potential associations between e-cigarette uses and various self-reporting health symptoms. Wang et al. [wang2016catching] applied negative binomial regression upon abstract topics of LDA to model the “likes” on Trump’s Twitter and infer topic preferences among followers.
A large number of studies were performed with LIWC, an API777https://liwc.wpengine.com/ to do linguistic analysis of documents. Tumasjan et al. [tumasjan2010predicting] used LIWC to capture the political sentiment and predict elections with Twitter. The API was also used by Zhang et al. [zhang2020contributes] to provide insights into the sentiment of the descriptions of crowdfunding campaigns. Our motivation is to combine qualitative analysis with LDA and quantitative analysis with LIWC, comparatively investigate discrepancies between the tweets that use controversial terms associated with COVID-19 and the tweets that use non-controversial terms.
Iii Data and Methodology
The related tweets (Twitter posts) were crawled with the Tweepy API using keyword filtering in reference to [Lyu2020sense]. Simultaneous streams were collected to build the controversial dataset (CD) and the non-controversial dataset (ND) from March 23 – April 5. The controversial keywords consist of “Chinese virus” and “#ChineseVirus”, whereas non-controversial keywords include “corona”, “covid-19”, “covid19”, “coronavirus”, “#Corona”, “#Covid 19”. In total, 2,607,753 tweets for CD and 69,627,062 tweets for ND were collected. We then randomly sampled 2 million tweets from the two datasets, respectively. For preprocessing, we removed all URLs, emails and newlines, as they are not informative for language or textual analysis.
Latend Dirichlet Allocation (LDA) and LIWC2015 are the two methods we employed in textual analysis. We first use LDA to extract topics from the tweets in CD and ND, respectively. The hyperparameters were tuned with an experimental dataset, with the objective of maximizing the coherence score of Cv888Cv
is a performance measure based on a sliding window, one-set segmentation of the top words and an indirect confirmation measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity.. In the end, we set the number of topics to 5 and each topic’s number of words to 10. Since the objective of topic modeling is to find what people talk when using controversial or non-controversial terms, we also masked all the appearances of the aforementioned streaming keywords by deleting them out of the documents. Next, bi-grams and tri-grams were applied on the document. We performed a qualitative, comparative analysis to find differences and similarities of topics generated by the two datasets.
Next, LIWC2015 was applied to extract the sentiment of the tweets of CD and ND. LIWC2015 is a dictionary-based linguistic analysis tool that can count the percentage of words that reflect different emotions, thinking styles, social concerns, and capture people’s psychological states.999https://liwc.wpengine.com/how-it-works/. We focused on 4 summary linguistic variables and 12 more detailed variables that reflect psychological states, cognition, drives, time orientation, and personal concerns of the Twitter users of CD and ND. We followed the similar methodology used by Yu et al. [yu2008exploring] by concatenating all tweets posted by users of CD and ND, respectively. One text sample was composed of all the tweets from the aforementioned sampled dataset of CD, and the other was composed of all the tweets from that of ND. Then we applied LIWC2015 to analyze these two text samples. In the end, there are 16 linguistic variables for CD and ND, respectively.
Iv-a Topic Modeling Results
In Table I, 5 topics generated by LDA on CD and ND are reported, respectively, together with the top 10 topic words. We manually assigned each topic a topic name to generalize what would most likely be discussed under the topic.
Comparing across CD and ND, we observe that the topics in CD contain more opinions. 3 out of 5 topics have very strong opinion-related topic words, which are highlighted in Table I. In the topic “blame the lie”, a very strong signal about lie and blame is present, as well as indication of some correlation between “lie” and people “die”. In contrast, the generated topics in ND are more related to fighting the pandemic (n=3), stories (n=1) and government (n=1), which are all related to COVID-19 in the US. No strong opinion-related keywords could be found.
One finding is that all 5 topics in CD contain the topic word “chinese”, even though we have removed all keywords related to Chinese virus in the documents for LDA. This suggests that discussions in CD are closely related to China or the Chinese people/government. In addition, all 5 topics in CD contain topic word “virus”, whereas only 2 of 5 topics in ND contain this topic word. Such a difference suggests that in CD, discussions are more related to the virus, while in ND discussions are more related to the pandemic caused by this virus. In fact, only one topic in CD is about how people work towards containing the pandemic (“work in hospital”), whereas 3 topics in ND are discussing measures to relieve this pandemic (“test cases,” “health workers” and “stay home”).
These discrepancies in the topic modeling result contradict the claim of “only referring to the geo-locational origin of the pandemic” by some public figures who employed the use of “Chinese virus” when referring to COVID-19. Nevertheless, such words have provoked, to certain degree, racist or xenophobic opinions and hate speeches towards China or people with Chinese ethnicity on social media. Furthermore, hate speeches can spread extremely fast on online social media platforms and can stay online for a long time [gag2015counter]. Gagliardone et al. [gag2015counter] found that such speeches are also itinerant, meaning that despite forcefully removed by the platforms, one can still find related expression elsewhere on the Internet and even offline.
Iv-B LIWC Sentiment Features
Fig. 1 shows 4 summary variables for CD and ND. We observe that the clout scores for CD and ND are similar. A high clout score suggests that the author is speaking from the perspective of high expertise [pennebaker2015development]. At the same time, analytical thinking, authentic and emotional tones scores for ND are higher than those for CD. The analytical thinking score reflects the degree of hierarchical thinking. Higher numbers indicate a more logical and formal thinking [pennebaker2015development]. A higher authentic score suggests that the content of the text is more honest, personal and disclosing [pennebaker2015development]. The emotional tones scores for CD and ND are both lower than 50, indicating that the overall emotions for CD and ND are negative. This is consistent with our expectation. However, the emotional tone score for ND is higher than that for CD, indicating that the Twitter users in ND are expressing a relatively more positive emotion.
Fig. 2 shows 12 more detailed linguistic variables of tweets of CD and ND. The scores of “future-oriented” and “past-oriented” reflect the temporal focus of attention of the Twitter users by analyzing the verb tense used in the tweets [tausczik2010psychological]. The tweets of ND are more future-oriented, while those of CD are more past-oriented. To better understand this difference, we conducted a similar analysis as Gunsch et al. [gunsch2000differential]. We extract five more linguistic variables including four pronouns score and one time orientation score. The scores of “i”, “we”, “she/he”, “they”, and present-orientation are shown in Table II. The tweets of CD show more other-references (“they”), whereas more self-references (“i”, “we”) are present in the tweets of ND. The scores for “she/he” of CD and ND are close. The score of present orientation of CD is higher than that of ND. From this similar observation to the findings of Gunsch et al. [gunsch2000differential], we can infer that the tweets of CD focus on the past and present actions of the others, and the tweets of ND focus more on the future acts of themselves. Research shows that LIWC can identify the emotion in language use [tausczik2010psychological]. From the aforementioned discussion, the tweets of both CD and ND are expressing a negative emotion, and the emotion expressed by the Twitter users of ND is relatively more positive. This is consistent with the positive emotions score and negative emotions score. However, there are nuanced differences across the sadness, anxiety and anger scores. When referring to COVID-19, the tweets of ND express more sadness and anxiety than those of CD do. More anger is expressed through the tweets of CD. The certainty and tentativeness scores reveal the extent to which the event the author is going through may have been established or is still being formed [tausczik2010psychological]. A higher percentage of words like “always” or “never” results in a higher score for certainty, and a higher percentage of words like “maybe” or “perhaps” leads to a higher score for tentativeness [pennebaker2015development]. We observe a higher tentative score and a higher certainty score for the tweets of CD, while these two scores for the tweets of ND are both lower. We have an interesting hypothesis for this subtle difference. Since 1986, Pennebaker et al. [pennebaker2015development] have been collecting text samples from a variety of studies including blogs, expressive writing, novels, natural speech, New York Times, and Twitter to get a sense of the degree to which language varies across settings. Of all the studies, the tentative and certainty scores for the text of New York Times are the lowest. However, these two scores for expressive writing, blog, and natural speech are relatively higher. This observation leads to our hypothesis that the tweets of CD are more like blog, expressive writing, or natural speech that focus on expressing ideas, whereas the tweets of ND are more like newspaper articles which focus more one describing facts. As for the score of “achievement”, McClelland [mcclelland1979inhibited] found that the stories people told in response to drawings of people could provide important clues to their needs for achievement. We hypothesize that the higher value of the “achievement” score for the tweets of ND reflects the need of these Twitter users to succeed in fighting against COVID-19. As for the personal concerns, the scores for “work” and “money” of ND are both higher than those of CD which shows that the Twitter users of ND focus more on the work and money issue (e.g. working from home, unemployment). According to the reports of the U.S. Department of Labor, the advance seasonally adjusted insured unemployment rate was 8.2% for the week ending April 4. The previous high was 7.0% in May of 1975.101010https://www.dol.gov/ui/data.pdf
V Conclusion and Future Work
We have presented a study of the topic preference related to the use of controversial and non-controversial terms associated with COVID-19 on Twitter during the ongoing COVID-19 pandemic. We first use LDA to extract topics from the controversial and non-controversial posts crawled from Twitter, and then qualitatively compare them across the two sets of posts. We find that topics in the controversial posts are more related to China, even after the keywords related to “Chinese virus” were removed before the analysis, whereas discussions in non-controversial posts are more related to fighting the pandemic in the US. We also find differences across the sentiment of the tweets posted by the users using controversial terms and the users using non-controversial terms. Both groups express a negative emotion, yet the tweets of ND are relatively more positive. The tweets of ND also show more analytical thinking and are expressed in a more truthful manner. The tweets of CD focus more on the past and present action of others, while the tweets of ND focus more on the future acts of the authors themselves. More anger is present in the tweets of CD, while more anxiety and sadness are observed in the tweets of ND. More tentativeness and certainty are observed in the tweets of CD, which is not contradictory since these two scores are both higher in the text samples from blogs and expressive writings that focus on expressing ideas and opinions. These two scores are both lower for the tweets of ND which is similar to the case of newspaper articles like New York Times. Tweets of ND reflect a strong need for achievement. As for the personal concerns, users of ND focus more on work and money issues.
It is reported that the widespread use of controversial terms associated with COVID-19 has induced hate speeches and, to some degree, racism and xenophobia on social media. Therefore, such content on social media should be closely monitored to prevent a further escalation in the situation, which could potentially lead to social unrest. Our next step is to use textual, demographic and account-level features to detect the use of hate speeches on social media, in an effort of predicting and analyzing such behaviors for uses including but not limited to social media content management and policy-making.