Beyond a binary of (non)racist tweets: A four-dimensional categorical detection and analysis of racist and xenophobic opinions on Twitter in early Covid-19

07/18/2021 ∙ by Xin Pei, et al. ∙ 0

Transcending the binary categorization of racist and xenophobic texts, this research takes cues from social science theories to develop a four dimensional category for racism and xenophobia detection, namely stigmatization, offensiveness, blame, and exclusion. With the aid of deep learning techniques, this categorical detection enables insights into the nuances of emergent topics reflected in racist and xenophobic expression on Twitter. Moreover, a stage wise analysis is applied to capture the dynamic changes of the topics across the stages of early development of Covid-19 from a domestic epidemic to an international public health emergency, and later to a global pandemic. The main contributions of this research include, first the methodological advancement. By bridging the state-of-the-art computational methods with social science perspective, this research provides a meaningful approach for future research to gain insight into the underlying subtlety of racist and xenophobic discussion on digital platforms. Second, by enabling a more accurate comprehension and even prediction of public opinions and actions, this research paves the way for the enactment of effective intervention policies to combat racist crimes and social exclusion under Covid-19.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise of racism and xenophobia has become a remarkable social phenomenon stemming from Covid-19 as a global pandemic. Especially, attention has been increasingly drawn to the Covid-19 related racism and xenophobia which has manifested a more infectious nature and harmful consequences compared to the virus itself [28]. According to BBC report, throughout 2020, anti-Asian hate crimes increased by nearly one hundred and fifty percent, and there were around three thousand eight hundred anti-Asian racist incidents. Therefore, it has become urgent to comprehend public opinions regarding racism and xenophobia for the enactment of effective intervention policies preventing the evolvement of racist hate crimes and social exclusion under Covid-19. Social media as a critical public sphere for opinion expression provides platform for big social data analytics to understand and capture the dynamics of racist and xenophobic discourse alongside the development of Covid-19.

This research agenda has drawn attention from an increasing body of studies which have regarded Covid-19 as a social media infodemic [5], [11], [26], [18], [12]. The work in [25]

made an early and probably the first attempt to analyse the emergence of Sinophobic behaviour on Twitter and Reddit platforms. Soon after

[30] studied the role of counter hate speech in facilitating the spread of hate and racism against the Chinese and Asian community. The authors in [27] attempted to study the effect of hate speech on Twitter targeted on specific groups such as the older community and Asian community in general. The work in [23] demonstrated the dynamic changes in the sentiments along with the major racist and xenophobic hashtags discussed across the early time period of Covid-19. The authors in [20]

explored the user behavior which triggers the hate speech on Twitter and later how it diffuses via retweets across the network. All these methods have used highly advanced computational techniques and state-of-the-art language models for extracting insights from the data mined from Twitter and other platforms.

While focusing on technical advancement, many studies tend to neglect the foundation for accurate data detection and analysis – that is how to define racism and xenophobia. Especially, the computational techniques and models tend to apply a binary definition (either racist or non-racist) to categorise the linguistic features of the texts, with limited attention paid to the nuances of racist and xenophobic behaviours. However, understanding the nuances is critical for mapping the comprehensive picture of the development of racist and xenophobic discourse alongside the evolvement of Covid-19 – whether and how the expression of racism and xenophobia may change the topics across time. More importantly, capturing these changes reflected in the online public sphere will enable a more accurate comprehension and even prediction of public opinions and actions regarding racism and xenophobia in the offline world.

Reaching this goal demands a combination of computational methods and social science perspectives, which becomes the focus of this research. With the aid of BERT (Bi-directional Encoder Representations from Transformers) [9] and topic modelling [4], The main contribution of this research lies in two aspects:

  1. Development of a four-dimensional categorization of racist and xenophobic texts into stigmatization, offensiveness, blame, and exclusion;

  2. Performing a stage wise analysis of the categorized racist and xenophobic data to capture the dynamic changes amongst the discussion across the development of COVID-19.

Especially, this research situates the examination in Twitter, the most influential platform for political online discussion. And we focus on the most turbulent early phase of Covid-19 (Jan to Apr 2020) where the unexpected and constant global expansion of virus kept on changing people’s perception of this public health crisis and how it is related to race and nationality. To specify, this research divides the early phase into three stages based on the changing definitions of Covid-19 made by World Health Organization (WHO) - (1) 1st to 31st Jan 2020 as a domestic epidemic referred to as stage 1 (S1); (2) 1st Feb to 11th Mar 2020 as an International Public Health Emergency (after the announcement made by WHO on 1st Feb) referred to as stage 2 (S2); (3) 12th Mar to 30th Apr 2020 as a global pandemic (based on the new definition given by WHO on 11th Mar) referred to as stage 3 (S3).

The rest of the paper is organized as follows. In section 2, we outline the dataset mined from Twitter. Section 3 deals with two parts - firstly, it presents the data and method employed for category-based racism and xenophobia detection. Secondly, it details the topic modelling employed for extracting topics from the categorized data. In section 4, we discuss the findings of the overall process with the focus on topics emerging amongst the different racism and xenophobia categories across the early development of Covid-19. Finally, we conclude this paper in section 5.

2 Dataset

Dataset of this research is comprised of 247,153 tweets extracted through Tweepy API111https://www.tweepy.org/ from the eighteen most circulated racist and xenophobic hashtags related to Covid-19 from 1st January to 30th April in the year of 2020. The list of selected hashtags is as follows - #chinavirus, #chinesevirus, #boycottchina, #ccpvirus, #chinaflu, #china_is_terrorist, #chinaliedandpeopledied, #chinaliedpeopledied, #chinalies, #chinamustpay, #chinapneumonia, #chinazi, #chinesebioterrorism, #chinesepneumonia, #chinesevirus19, #chinesewuhanvirus, #viruschina, and #wuflu. The extracted tweets from the above hashtags are further divided into three stages that define the early development of Covid-19 as mentioned earlier.

Category Definition Example
Stigmatization Confirming negative stereotypes for conveying
a devalued social identity within a particular context[22]
“For all the #ChinaVirus jumped from a bat at the wet market”
Offensiveness Attacking a particular social group
through aggressive and abusive language
[17]
“Real misogyny in communist China. #chinazi #China_is_terrorist #China_is_terrorists #FuckTheCCP”
Blame Attributing the responsibility for the
negative consequences of the crisis to one social group
[7]
“These Chinese are absolutely disgusting. They spread the #ChineseVirus. Their lies created a pandemic #ChinaMustPay”
Exclusion the process of othering to draw a clear boundary
between in-group and out-group members
[1]
“China deserves to be isolated by all means forever. SARS was also initiated in China, 2003 by eating anything & everything #BoycottChina”
Table 1: Definition and example of categorization of racist and xenophobic behaviors.

3 Method

3.1 Category-based racism and xenophobia detection

Beyond a binary categorization of racism and xenophobia, this research applies the perspective of social science to categorizing racism and xenophobia into four dimensions as demonstrated in Table 1. This basically translates into a problem of five class classification of text data, where four classes represent the racism and xenophobia categories and fifth class corresponds to the category of non-racist and non-xenophobic.

3.1.1 Annotated dataset

For this purpose, we annotate a dataset of 6000 tweets. These tweets were randomly selected from all hashtags across the three development stages, and annotated by four research assistants with inter-coder reliability reaching above 70%. The annotation followed a coding method with 0 representing stigmatization, 1 for offensiveness, 2 for blame, and 3 for exclusion in alignment with the linguistic features of the tweets. The non-marked tweets were regarded as non-racist and non-xenophobic and represented class category 4. We limit the annotation for each tweet to only one label which aligns to the strongest category. The distribution of 6000 tweets amongst the five classes is as follows - 1318 stigmatization, 1172 offensive, 1045 blame, 1136 exclusion, and 1329 non-racist and non-xenophobic.

We view the task of classification of the above-mentioned categories as a supervised learning problem and target developing machine learning and deep learning techniques for the same. We firstly pre-process the input data text by removing punctuation and URLs from a text sample and converting it to lower case before providing it to train our models. We split the data into random train and test splits with 90:10 ratio for training and evaluating the performance of our models respectively.

3.1.2 Bert

Recently, word language models such as Bi-directional Encoder Representations from Transformers (BERT) [9]

have become extremely popular due to their state-of-the-art performance on natural language processing tasks. Due to the nature of bi-directional training of BERT, it can learn the word representations from unlabelled text data powerfully and enables it to have a better performance compared to the other machine learning and deep learning techniques

[9]

. The common approach for adopting BERT for a specific task on a smaller dataset is to fine-tune a pre-trained BERT model which has already learnt the deep context-dependent representations. We select the “bert-base-uncased” model which comprises of 12 layers, 12 self-attention heads, a hidden size of 768 totalling 110M parameters. We fine-tune the BERT model with a categorical cross-entropy loss for the five categories. The various hyperparameters used for fine-tuning the BERT model are selected as recommended from the paper

[9]

. We use the AdamW optimizer with the standard learning rate of 2e-5, a batch size of 16, and train it for 5 epochs. For selecting the maximum length of the sequences, we tokenize the whole dataset using Bert tokenizer and check the distribution of the token lengths. We notice that the minimum value of token length is 8, maximum is 130, median is 37 and mean is  42. Based on the density distribution shown in Fig.

1, we experiment with two values of sequence length – 64 and 128 and find that the sequence length of 64 provides a better performance.

Figure 1: Density distribution of token lengths of the tweets in our dataset.

As additional baselines, we also train two more techniques. Long Short Term Memory Networks (LSTMs)

[14]

have been very popular with text data as they can learn the dependencies of various words in the context of a text. Also, machine learning algorithms such as Support Vector Machine (SVMs)

[13] have been used previously by researchers for text classification tasks. We adopt the same data pre-processing and implementation technique as mentioned earlier and train the SVM with grid search, a 5-layer LSTM (using the pre-trained Glove [24] embeddings) and BERT model for the category detection of the racist and xenophobic tweets.

For evaluating the machine learning and deep learning approaches on our test dataset, we use the metrics of average accuracy and weighted f1-score for the five categories. The performance of the model is shown in Table 2. It can be seen from Table 2 that the fine-tuned BERT model performs the best compared to SVM and LSTM in terms of both accuracy and f1 score. Thus, we employ this fine-tuned BERT model for categorizing all the tweets from the remaining dataset. Having employed BERT on the remaining dataset, we get a refined dataset of the four categories of tweets spreaded across the three stages as shown in Table 3.

Technique Accuracy(%) F1-score
SVM 69 0.66
LSTM 74 0.72
BERT 86 0.81
Table 2: Performance of different models on the manually annotated test dataset.
Category Total S1 S2 S3
Stigmatization 116584 3723 5687 107174
Offensiveness 10503 1722 1808 6973
Blame 39765 31 777 38957
Exclusion 10293 872 1341 8080
Table 3: Distribution of tweets amongst the four categories across the three stages.
S1 T1.Virus virus spread country travel year control chinese ban corona show
T2.China/Chinese chinese virus deadly china situation mask stop animal source eat
T3.Infection people case health infect confirm death sar number report market
T4.Outbreak china coronavirus wuhan outbreak city hospital news patient put state
T5.Travel world china government make people time day bad flight start
S2 T1.Emergency virus spread day year corona show emergency food kit supply
T2.Globe china world time country report death global health travel confirm
T3.Infection people case call ncov infect kill pack state flu number
T4.China china coronavirus wuhan outbreak quarantine stop find man dead thing
T5.Chinese chinese make mask government news good work citizen start respirator
S3 T1.Government china world spread country lie pay communist government ccp make
T2.? time make india good give work day back fight buy
T3.China china coronavirus case death covid country economy war number wuhan
T4.Chinese chinese virus people call stop racist start die blame corona
T5.US american trump state medium president america news great propaganda show
Table 4: Extracted topics and their corresponding keywords for the category of stigmatization spread across the three stages S1, S2, and S3.
S1 T1.? country ccp citizen virus arrest live system security foreign understand
T2.Government people government democracy support life year regime uyghur camp give
T3.? china world spread stop communist happen taiwan wuhan govt ban
T4.Muslim chinese make muslim good kill police terrorist bad party lie
T5.Human right world freedom hong_kong human human_right time free stand hk fight
S2 T1.Freedom world stop freedom truth spread good free hk speech life
T2.Ccp china chinese ccp virus happen wuhan evil communist time uyghur
T3.People people make kill lie ppl trust camp police thing man
T4.China china country regime pay money outbreak start work force control
T5.Human right government citizen human fight support hong_kong taiwan give democracy death
S3 T1.Death world people pay lie kill truth fight life die humanity
T2.Government time call government india communist pandemic give global send real
T3.Virus chinese virus spread wuhan corona product buy control big day
T4.China china country make ccp stop good coronavirus human trust support
T5.World china world war case start covid economy death state italy
Table 5: Extracted topics and their corresponding keywords for the category of offensiveness spread across the three stages S1, S2, and S3.
S1 T1.Lie lie spread virus autocracy deceit imagine true horrible infect country
T2.Death china dead die day order monstrosity true thing kong high
T3.Safety coronavirus move lot cvirus epicenter safety march careful knowingly health
T4.Time wuhan lunar_new sick year time absolutely medium mutate emperor truth
T5.Infection people chinese make online pandemic catch number infect community official
S2 T1.Government lie chinese coronavirus government wuhan cover day body thing care
T2.Spread world country spread happen trust kill threat steal dead face
T3.China china truth bad free money communist case find start move
T4.Virus virus stop make control good china fight live report human
T5.Death people time number die real life entire back citizen death
S3 T1.World world china country pay pandemic kill global economy war
T2.? people stop human american eat put president market happen live
T3.Lie china lie coronavirus wuhan blame die case cover truth number
T4.? make time china good start buy trust back thing country
T5.Government chinese virus china government call communist ccp covid spread hold
Table 6: Extracted topics and their corresponding keywords for the category of blame spread across the three stages S1, S2, and S3.
S1 T1.Government support gov join people evil time stand sanction government money
T2.Human right product world stop human_right freedom tag good challenge ppl economic_infiltration
T3.Boycott china hong_kong fight regime boycott show international control trust communist
T4.Trade make buy ccp day thing friend taiwan japan hope today
T5.Virus country chinese people spread year human animal protect virus eat
S2 T1.Nation people chinese animal happen government initiative nation show economy law
T2.Virus virus control truth support live kill boycott start stand cover
T3.Threat china time lie threat company trust big entire spy wuhan
T4.Human right world country freedom spread human_right economic thing evil steal raise
T5.Trade make product stop buy day china good ccp challenge coronavirus
S3 T1.Virus china virus world pay spread ccp covid corona market call
T2.Pandemic world china company communist coronavirus pandemic global nation trust war
T3.Trade chinese make product buy boycott stop good India economy Indian
T4.Human right people lie government human life back animal kill eat bring
T5.China china country time start business give thing app sell money
Table 7: Extracted topics and their corresponding keywords for the category of exclusion spread across the three stages S1, S2, and S3.

3.2 Topic modelling

Topic modelling is one of the most extensively used methods in natural language processing for finding relationships across text documents, topic discovery and clustering, and extracting semantic meaning from a corpus of unstructured data [16]. Many techniques have been developed by researchers such as Latent Semantic Analysis (LSA) [8], Probabilistic Latent Semantic Analysis (pLSA) [15] for extracting semantic topic clusters from the corpus of data. In the last decade, Latent Dirichlet Allocation (LDA) [4] has become a successful and standard technique for inferring topic clusters from texts for various applications such as opinion mining [29], social medial analysis [6], event detection [19] and consequently there have also been various developed variants of LDA [3] and [2].

For our research, we adopt the baseline LDA model with Variational Bayes sampling from Gensim222https://pypi.org/project/gensim/ and the LDA Mallet model [21] with Gibbs sampling for extracting the topic clusters from the text data. Before passing the corpus of data to the LDA models, we perform data pre-processing and cleaning which include the following steps. Firstly, we remove any new line characters, punctuations, URLs, mentions and hashtags. Later we tokenize the texts in the corpus and also remove any stopwords using the Gensim utility of pre-processing and stopwords defined in the NLTK333https://pypi.org/project/nltk/ corpus. Finally, we make bigrams and lemmatize the words in the text.

After employing the above pre-processing for our corpus, we employ topic modelling using LDA from Gensim and LDA Mallet. We perform experiments by varying the number of topics from 5 to 25 at an interval of 5 and checking the corresponding coherence score of the models as was done in [10]. We train the models for 1000 iterations with varying number of topics, optimizing the hyperparameters every 10 passes after each 100 pass period. We set the values of , which control the distribution of topics and the vocabulary words amongst the topics to the default settings of 1 divided by the number of topics. We notice from our experiments that LDA Mallet has a higher coherence score (0.60-0.65) compared to the LDA model from Gensim (0.49-0.55) and thus we select LDA Mallet model for the task of topic modelling on our corpus of data.

The above strategy is employed for each racist and xenophobic category and for every stage individually. We find the highest coherence score corresponding to a specific number of topics for each category and stage. To analyse the results, we reduce the number of topics to 5 by clustering closely related topics using equation 1.

(1)

where refers to the number of topics to be clustered, represents the number of keywords in each topic, corresponds to the probability of the word in the topic, and is the resultant topic containing the average probabilities of all the words from the topics. We then represent the top 10 highest probability words in the resultant topic for every category and stage as is shown in Tables 4 to 7.

4 Findings

Table 4, 5, 6 and 7 demonstrate the ten most salient terms related to the generated five topics for each stage (S1, S2, and S3) of four categories, and we summarize each topic through the correlation between the ten terms. We put a question mark for topics from which no pattern can be generated. In general, under the four categories, China and Chinese are always at the centre of discussion. When considering the dynamics across stages, tweets of all four categories extended the discussion to the world situation, and terms representing other nations and races/ethnicities besides China and Chinese started to emerge.

Notably, the category-based detection and analysis enable us to capture the nuances of themes, and how themes develop through different trajectories across the stages. To specify, the topics in the category of stigmatization centre on virus. Discussion tends to associate China and Chinese with the infection and outbreak of virus as well as its negative influences (e.g. emergency; travel). In stage 3, discussion around America became a new focus, with terms trump, president, and propaganda showing up.

The discussion in the category of offensiveness is more political oriented compared to other categories. Especially, in the first two stages, discussion included sensitive political terms concerning China (e.g., hk, uyghyr, taiwan). Besides, ccp (Chinese Communist Party) and human right are two important topics. Only till stage 3, the topics in offensiveness gradually switched the focus to virus.

The data in the category of blame focuses on attributing the cause and consequence of virus to a particular political system (e.g., lie; autocracy, deceit) in the early stages of the discussion. Alike stigmatization, american and president emerged as new topics in stage 3 for the category of blame, although the overall three stages remained the focus on terms like lie and cover-up by the government.

The category of exclusion emphasizes virus, trade and human right. Especially, in terms of trade, more negative words are associated with it alongside the development of Covid-19 (e.g. from stop in stage 2 to stop and boycott in stage 3). Additionally, in stage 3, india and indian were related to china under the topic of trade.

5 Discussion and Conclusions

Bridging computational methods with social science theories, this research proposes a four-dimensional category for the detection of racist and xenophobic texts in the context of Covid-19. This categorization, combined with a stage wise analysis, enables us to capture the diversity of the topics emerging from racist and xenophobic expression on Twitter, and their dynamic changes across the early stages of Covid-19. This enables the methodological advancement proposed by this research to be transformed into constructive policy suggestions. For instance, as demonstrated in the findings, the topics falling under the category of offensiveness are more likely to be associated with sensitive political issues around China rather than virus in stage 1 and stage 2. Therefore, how to split the discussion of virus from the association of virus with other political topics should draw attention from government of different countries, and this agenda should be incorporated into the official media coverage from the government. Another example is from the category of blame. As shown in the findings, blame usually targets at the transparency of the information from the government (Chinese government especially in early Covid-19). Consequently, it is critical for government of different countries to work on effective and prompt communication with the public under Covid-19. We believe the contribution of this research can be generated beyond the context of Covid-19 to provide insights for future research on racism and xenophobia on digital platforms.

References

  • [1] O. G. Bailey and R. Harindranath (2005) Racialised ‘othering’. Journalism: critical issues, pp. 274–286. Cited by: Table 1.
  • [2] D. M. Blei, T. L. Griffiths, M. I. Jordan, J. B. Tenenbaum, et al. (2003) Hierarchical topic models and the nested chinese restaurant process.. In NIPS, Vol. 16. Cited by: §3.2.
  • [3] D. M. Blei and J. D. McAuliffe (2010) Supervised topic models. arXiv preprint arXiv:1003.0783. Cited by: §3.2.
  • [4] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. the Journal of machine Learning research 3, pp. 993–1022. Cited by: §1, §3.2.
  • [5] M. Cinelli, W. Quattrociocchi, A. Galeazzi, C. M. Valensise, E. Brugnoli, A. L. Schmidt, P. Zola, F. Zollo, and A. Scala (2020) The covid-19 social media infodemic. Scientific Reports 10 (1), pp. 1–10. Cited by: §1.
  • [6] R. Cohen and D. Ruths (2013) Classifying political orientation on twitter: it’s not easy!. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 7. Cited by: §3.2.
  • [7] T. Coombs and L. Schmidt (2000) An empirical analysis of image restoration: texaco’s racism crisis. Journal of Public Relations Research 12 (2), pp. 163–178. Cited by: Table 1.
  • [8] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. Journal of the American society for information science 41 (6), pp. 391–407. Cited by: §3.2.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.1.2.
  • [10] Z. Fang, X. Zhao, Q. Wei, G. Chen, Y. Zhang, C. Xing, W. Li, and H. Chen (2016) Exploring key hackers and cybersecurity threats in chinese hacker communities. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pp. 13–18. Cited by: §3.2.
  • [11] O. Gencoglu and M. Gruber (2020) Causal modeling of twitter activity during covid-19. Computation 8 (4), pp. 85. Cited by: §1.
  • [12] Y. Guo, C. Xypolopoulos, and M. Vazirgiannis (2021) How covid-19 is changing our language: detecting semantic shift in twitter word embeddings. arXiv preprint arXiv:2102.07836. Cited by: §1.
  • [13] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf (1998)

    Support vector machines

    .
    IEEE Intelligent Systems and their applications 13 (4), pp. 18–28. Cited by: §3.1.2.
  • [14] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.2.
  • [15] T. Hofmann (1999) Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. Cited by: §3.2.
  • [16] H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, and L. Zhao (2019) Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimedia Tools and Applications 78 (11), pp. 15169–15211. Cited by: §3.2.
  • [17] R. Jeshion (2013) Expressivism and the offensiveness of slurs. Philosophical Perspectives 27 (1), pp. 231–259. Cited by: Table 1.
  • [18] X. Li, M. Zhou, J. Wu, A. Yuan, F. Wu, and J. Li (2020) Analyzing covid-19 on online social media: trends, sentiments and emotions. arXiv preprint arXiv:2005.14464. Cited by: §1.
  • [19] C. X. Lin, B. Zhao, Q. Mei, and J. Han (2010) Pet: a statistical model for popular events tracking in social communities. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 929–938. Cited by: §3.2.
  • [20] S. Masud, S. Dutta, S. Makkar, C. Jain, V. Goyal, A. Das, and T. Chakraborty (2020) Hate is the new infodemic: a topic-aware modeling of hate speech diffusion on twitter. arXiv preprint arXiv:2010.04377. Cited by: §1.
  • [21] A. K. McCallum (2002) Mallet: a machine learning for language toolkit. http://mallet. cs. umass. edu. Cited by: §3.2.
  • [22] C. T. Miller and C. R. Kaiser (2001) A theoretical perspective on coping with stigma. Journal of social issues 57 (1), pp. 73–92. Cited by: Table 1.
  • [23] X. Pei and D. Mehta (2020) # coronavirus or# chinesevirus?!: understanding the negative sentiment reflected in tweets with racist hashtags across the development of covid-19. arXiv preprint arXiv:2005.08224. Cited by: §1.
  • [24] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.1.2.
  • [25] L. Schild, C. Ling, J. Blackburn, G. Stringhini, Y. Zhang, and S. Zannettou (2020) ” Go eat a bat, chang!”: an early look on the emergence of sinophobic behavior on web communities in the face of covid-19. arXiv preprint arXiv:2004.04046. Cited by: §1.
  • [26] M. Trajkova, F. Cafaro, S. Vedak, R. Mallappa, S. R. Kankara, et al. (2020)

    Exploring casual covid-19 data visualizations on twitter: topics and challenges

    .
    In Informatics, Vol. 7, pp. 35. Cited by: §1.
  • [27] N. Vishwamitra, R. R. Hu, F. Luo, L. Cheng, M. Costello, and Y. Yang (2020) On analyzing covid-19-related hate speech using bert attention. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 669–676. Cited by: §1.
  • [28] S. Wang, X. Chen, Y. Li, C. Luu, R. Yan, and F. Madrisotti (2021) ‘I’m more afraid of racism than of the virus!’: racism awareness and resistance among chinese migrants and their descendants in france during the covid-19 pandemic. European Societies 23 (sup1), pp. S721–S742. Cited by: §1.
  • [29] Z. Zhai, B. Liu, H. Xu, and P. Jia (2011) Constrained lda for grouping product features in opinion mining. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 448–459. Cited by: §3.2.
  • [30] C. Ziems, B. He, S. Soni, and S. Kumar (2020) Racism is a virus: anti-asian hate and counterhate in social media during the covid-19 crisis. arXiv preprint arXiv:2005.12423. Cited by: §1.