Hate speech detection (HSD) is a difficult task both for humans and machines because hateful content is more than just keyword detection. Hatred may be implied, the sentence may be grammatically incorrect and the abbreviations and slangs may be numerous 
. Recently, the use of machine learning methods for HSD has gained attention, as evidenced by these systems:[20, 14]. 
performed a comparative study between machine learning models and concluded that the deep learning models are more accurate. Current HSD systems are based on natural language processing (NLP) advances and rely on deep neural networks (DNN)[18, 15].
Finding the features that best represent the underlying hate speech phenomenon is challenging. Early works on automatic HSD used different word representations, such as a bag of words, surface forms, and character n-grams with machine learning classifiers. The combination of features, such as n-grams, linguistic and syntactic turns out to be interesting as shown by .  proposed lexical syntactic features, incorporating style features, structure features, and context-specific features to better predict hate speech in social media.  investigated user features to detect bullying and aggressive behavior in tweets.
The integration of word embeddings, sentence embeddings, or emojis features in DNN systems allow learning semantics, contexts, and long-term dependencies. For instance, fastText word embeddings are used in a DNN-based HSD system . Universal Sentence Encoder  or InferSent  allows taking into account the semantic information of the entire sentence.  showed that sentence embeddings outperform word embeddings. 
proposed hybrid emoji-based Masked Language Model to model the common information across different languages. Convolutional Neural Network-gram based system is proposed in and demonstrated good robustness in coarse-grained and fine-grained detection tasks.
In this paper, we focus our research on the automatic HSD in tweets using DNN. Our baseline system relies on Universal Sentence Embeddings (USE). We propose to enrich the baseline system using word-level features, called multiword expressions (MWEs) . MWEs are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. We believe that MWE modelling could help to reduce the ambiguity of tweets and lead to better detection of HS . To the best of our knowledge, MWE features have never been used in the framework of DNN-based automatic HSD. Our contribution is as follows. First, we extract different MWE categories and study their distribution in our tweet corpora. Secondly, we design a three-branch deep neural network to integrate MWE features. To do so, the baseline system based on USE embedding is modified by adding a second branch to model different MWE categories and a third branch to take into account the semantic meaning of MWE through word embedding. Thus, the designed system combines word-level and sentence-level features. Finally, we experimentally demonstrate the ability of the proposed MWE-based HSD system to better detect hate speech: a statistically significant improvement is obtained compared to the baseline system. We experimented on two tweet corpora to show that our approach is domain-independent.
2 Proposed methodology
In this section, we describe the proposed HSD system based on MWE features. This system is composed of a three-branch DNN network and combines global feature computed at the sentence level (USE embeddings) and word-level features: MWE categories and word embeddings representing the words belonging to MWEs.
2.1 Universal Sentence Encoder
Universal sentence encoder provides sentence level embeddings. The USE model is trained on a variety of data sources and demonstrated strong transfer performance on a number of NLP tasks . In particular, pre-trained USE showed very good performance on different sentiment detection and semantic textual similarity tasks. The HSD system based on USE obtained the best results at the SemEval2019 campaign (shared task 5) . This power of USE motivated us to use it to design our system.
2.2 MWE features
A multiword expression is a group of words that are treated as a unit . For example, the two MWEs stand for and get out have a meaning as a group, but have another meaning if the words are taken separately. MWEs include idioms, light verb constructions, verb-particle constructions, and many compounds. We think that adding information about MWE categories and semantic information from MWEs might help for the HSD task.
Automatic MWE identification and text tagging in terms of MWEs are difficult tasks. Different state-of-the-art deep learning methods have been studied for MWE identification, such as Convolutional Neural Network (CNN), bidirectional Long-Short Term Memory (LSTM) and transformers[11, 29, 28]. Over the past few years, the interest in MWEs increased as evidenced by different shared tasks: DiMSUM , PARSEME [24, 21].
In our work, we focus on social media data. These textual data are very particular, may be grammatically incorrect and may contain abbreviations or spelling mistakes. For this type of data, there are no state-of-the-art approaches for MWE identification. A specific MWE identification system is required to parse MWEs in social media corpora. As the adaptation of an MWE identification system for a tweet corpus is a complex task and as it is not the goal of our paper, we decided to adopt a lexicon-based approach to annotate our corpora in terms of MWEs. We extract MWEs from the STREUSLE web corpus (English online reviews corpus), annotated in MWEs. From this corpus, we create an MWE lexicon composed of MWEs which are classified into lexical categories of MWEs. Table 1 presents these categories with examples. Each tweet of our tweet corpora is lemmatized and parsed with the MWE lexicon. Our parser tags MWEs and takes into account the possible discontinuity of MWEs: we allow that one word, not belonging to the MWE, can be present between the words of the MWE. If, in a sentence, a word belongs to two MWEs, we tag this word with the longest MWE. We do not take into account spelling or grammatical mistakes. We add a special category for words not belonging to any MWE.
2.3 HSD system proposal
In this section, we describe our hate speech detection system using USE embeddings and MWE features. As USE is a feature at the sentence level and MWE features are at the word level, the architecture of our system is composed of a neural network with three branches: two branches are dedicated to the MWE features, the last one deals with USE features. Figure 1 shows the architecture of our system.
In the first branch, we associate to each word of the tweet the number of the MWE category (one-hot encoding). This branch is composed of 3 consecutive blocks of CNN (Conv1D) and MaxPooling layers. Previous experiments with different DNN structures and the fast learning of CNN allow us to focus on this architecture. The second branch takes into account the semantic context of words composing MWE. If a given tweet has one or several MWEs, we associate a word embedding to each word composing these MWEs. We believe that the semantic meaning of MWEs is important to better understand and model them. This branch uses one LSTM layer. We propose to use two types of word embeddings: static where a given word has a single embedding in any context, or dynamic, where a given word can have different embeddings according to his long-term context. We experiment with word2vec and BERT embeddings[17, 9]. BERT uses tokens instead of words. Therefore, we use the embedding of each token composing the words of the MWEs. We think that using two branches to model MWEs allows us to take into account complementary information and provides an efficient way of combining different features for a more robust HSD system.
The last branch, USE embedding, supplies relevant semantic information at the sentence level. The three branches are concatenated and went through two dense layers to obtain the system output. The output layer has as many neurons as the number of classes to predict.
3 Experimental setup
The different time frames of collection, the various sampling strategies, and the targets of abuse induce a significant shift in the data distribution and can give a performance variation on different datasets. We use two tweets corpora to show that our approach is domain-independent: the English corpus of SemEval2019 task 5 subTask A (called HatEval in the following)  and Founta corpora . We study the influence of MWE features on the HatEval corpus, and we use the Founta corpus to confirm our results. Note that these corpora contain different numbers of classes and different percentages of hateful speech.
We evaluate our models using the official evaluation script of SemEval shared task 5 111https://github.com/msang/hateval/tree/master/SemEval2019-Task5/evaluation in terms of macro-F1 measure. It is the average of the F1 scores of all classes.
HatEval corpus. In the HatEval corpus, the annotation of a tweet is a binary value indicating if HS is occurring against women or immigrants. The corpus contains 13k tweets. We use standard corpus partition in training, development, and test set with 9k, 1k, and 3k tweets respectively. Each set contains around % of hateful tweets. The vocabulary size of the corpus is 66k words.
|Adposition phrase (idiomatic)||on the phone||9||36||134|
|VMWE5||Inherently adpositional verb||stand for||11||21||447|
|Full light verb construction||have option||9||10||36|
|Verbal idioms||Give a crap||14||24||384|
|Full verb-particle construction||take off||11||20||387|
|Semi verb-particle construction||walk out||6||18||153|
|Auxiliary||be suppose to||4||0||475|
|Coordinating conjunction||and yet||1||0||8|
|Infinitive marker||to eat||0||0||12|
|Non-possessive pronoun||my self||0||3||11|
|Subordinating conjunction||even if||0||0||28|
|Cause light verb construction||give liberty||1||0||0|
|Interjection||lo and behold||0||0||0|
We apply the following pre-processing for each tweet: we remove mentions (words beginning by @), hashtags (words beginning by #), and URLs. We keep the case unchanged. We use this pre-processing because the systems using this pre-processing obtained the best results at the SemEval2019 shared task 5 subtask A.
For train and development sets, we keep only tweets that contain at least two words. Thus, we obtain tweets for the training set and tweets for the development set. We split the training part into two subsets, the first one ( tweets) to train the models, and the second one ( tweets) for model validation. In the test set, we keep all tweets after pre-processing, even empty tweets. We tag empty tweets as non-hateful.
Founta corpus contains 100k tweets annotated with normal, abusive, hateful, and spam labels. Our experiments focus on HSD, so we decided to remove spams and we keep around 86k tweets. The vocabulary size of the corpus is 132k words. We apply the same pre-processing as for the HatEval corpus. We divide the Founta corpus into sets: train, development, and test with %, %, and % respectively. As for the HatEval corpus, we use a small part of training as the validation part. Each set contains about %, %, and % of normal, abusive, and hateful tweets.
3.2 System parameters
Our baseline system utilizes only USE features and corresponds to figure 1 without MWE branches. The system proposed in this article uses USE and the MWE features as presented in figure 1 222https://github.com/zamp13/MWE-HSD.
For the USE embedding, we use the pre-trained model provided by google333https://tfhub.dev/google/universal-sentence-encoder-large/3 (space dimension is 512) without fine-tuning.
We tag the MWE of each tweet using the lexicon, presented in the section 2.2. If an MWE is found, we put the corresponding MWE category for all words of the MWE. To perform fine-grained analysis, we decided to select MWE categories that have more than 50 occurrences (arbitrary value) and appear less than 97% in hate and non-hate tweets at the same time. We obtain MWE categories: called MWE5 and VMWE5 which are respectively the first and second part of Table 1. VMWE5 is composed of Verbal MWE categories and MWE5 with the rest of the categories. The training part of the HatEval corpus contains occurrences of VMWE5 and occurrences of MWE5.
During our experiments, we experiment with all MWE categories presented in Table 1 (containing 19 categories: 18 categories, and a special category for words not belonging to any MWE) and with the combination of VMWE5 and MWE5 (10 MWE categories and a special category).
Concerning the MWE one-hot branch of the proposed system, we set the number of filters to 32, 16, and 8 for the 3 Conv1D layers. The kernel size of each CNN is set to 3.
For the MWE word embedding branch, we set the LSTM layer to 192 neurons. For BERT embedding, we use pre-trained uncased BERT model from  (embedding dimension is 768). The BERT embeddings are extracted from the last layer of this model. BERT model is token-based, so we model each token of the words belonging to a MWE. For word2vec embedding, we use the pre-trained embedding of . This model is trained on a large tweet corpus (embedding dimension is 400). In our systems, each dense layer contains 256 neurons.
For each system configuration, we train 9 models with different random initialization. We select the model that obtains the best result on the development set to make predictions on the test set.
4 MWE statistics
We first analyze the distribution of the MWEs in our corpora. Figure 2 presents the percentage of occurrences of MWEs per tweet in HateEval. We observe that about 25% of the HatEval training tweets contain at least one MWE and enable us to influence the HSD performance.
As a further investigation, we analyze MWEs appearing per MWE category and for hate/non-hate classes. In the training set of the HatEval corpus our parser, described in section 2.2, annotated MWEs. Table 1 shows MWEs that appear only in hateful or non-hateful tweets or both in HatEval training part. We observe that some MWE categories, as symbol and interjection, do not appear in HatEval training set. We decided to not use these two categories in our experiments. Most of the categories appear in hateful and non-hateful tweets. For the majority of MWE categories, there are MWEs that occur only in hateful speech and MWEs that occur only in non-hateful tweets.
Figure 3 presents the statistics of each MWE category for hate and non-hate classes. As in HatEval the classes are almost balanced (42% of hateful tweets, 58 % of non-hateful tweets), there is no bias due to imbalanced classes. Concerning the MWE categories, there is no categories used only in the hateful speech or only in the non-hateful speech excepted for the cause light verb construction category, but this category is underrepresented). We can note that there is a difference between the use of MWEs in the hateful and the non-hateful tweets: MWEs are used more often in non-hateful speech. For some MWE categories this difference is more important, as for adposition or full verb-particle construction. In contrast, the determiner category occurs more in hateful tweets. These observations reinforce our idea that MWE features can be useful for hate speech detection.
5 Experimental results
The goal of our experiments is to study the impact of MWEs on automatic hate speech detection for two different corpora: HatEval with two classes (hate and non-hate) and Founta with three classes (hate, abusive and normal). We carried out experiments with the different groups of MWE categories: MWEall, including all MWE categories, and the combination of VMWE5 and MWE5. We evaluated also different embeddings for the word embedding branch of the proposed system: word2vec and BERT. As described previously, we select the best-performing configuration on the development data to be applied to the test data.
Table 5 displays the macro-F1 on HatEval and Founta test sets. Our baseline system without MWE features, called USE in Table 5, achieves a % macro-F1 score on HatEval test set. Using MWE features with word2vec or BERT embeddings, the system proposed in this paper performs better than the baseline. For instance, on HatEval, MWEall with BERT embedding configuration achieves the best result with % of macro-F1. Regarding Founta corpus, we observe a similar result improvement: the baseline system achieves % and systems with MWE features obtain scores ranging from % to % of macro-F1. It is important to note that according to a matched pair test in terms of accuracy with 5% risk , the systems using MWE features and word2vec or BERT embeddings significantly outperform the baseline system on the two corpora. Finally, the proposed system with MWEall and BERT embedding for HatEval outperforms the state-of-the-art system FERMI submitted at HatEval competition (SemEval task 5): % for our system versus % for FERMI of macro-F1 .
|All test sets|
|USE, MWEall, word2vec||64.5||68.2||66.3||93.8||86.9||36.5||72.4|
|USE, VMWE5, MWE5, word2vec||66.1||67.0||66.5||93.9||87.1||37.2||72.7|
|USE, MWEall, BERT||64.2||69.4||66.8||94.0||87.1||37.5||72.9|
|USE, VMWE5, MWE5, BERT||64.8||68.2||66.5||93.8||86.9||38.2||73.0|
|Tweets containing at least one MWE|
|USE, MWEall, word2vec||71.7||61.4||66.6||91.4||86.9||44.6||76.5|
|USE, MWEall, BERT||73.9||61.3||67.6||90.9||94||43.3||76.1|
To analyze further MWE features, we experiment with different groups of MWE categories: VMWE5, MWE5, and MWEall. Preliminary experiments with the two-branch system with USE and word embeddings branches only gave a marginal improvement compared to the baseline system. Using the three-branch neural network with only VMWE5 or MWE5 instead of MWEall seems to be interesting only for word2vec embedding. With BERT embedding it is better to use MWEall categories. Finally, the use of all MWEs could be helpful rather than the use of a subgroup of MWE categories. Comparing word2vec and BERT embeddings, dynamic word embedding performs slightly better than the static one, however, the difference is not significant.
|USE||USE, MWEall, BERT|
|Predicted labels||Predicted labels|
|USE||USE, MWEall, BERT|
|Predicted labels||Predicted labels|
Table 3 compares the confusion matrices of two systems: the baseline system and the proposed one with MWEall and BERT embeddings. On the HatEval test corpus, the proposed system classifies better non-hateful tweets than the baseline system (62.0% versus 58.4%). On the other hand, the proposed system classifies a little less well hate speech (72.4% versus 75.6 %). On Founta test set (see Table 4), the conclusions are different, the proposed system classifies better hateful tweets than baseline system (27.5% versus 24.7%), and a little less well normal and abusive speech: 95.2% versus 95.4% for normal speech and 88.5% versus 90.1% for abusive speech. This difference in detection results between HatEval and Founta is confirmed by the F1 score per class in the first part of table 5. We think that the balance between the classes plays an important role here: in the case of HatEval corpus, the classes are balanced, in the case of Founta, the classes are unbalanced.
To perform a deeper analysis, we focus our observations on only the tweets from the test sets containing at least one MWE: 758 tweets from the HatEval test set and 3508 tweets from the Founta test set. Indeed, according to section 4, there is about 25% of tweets containing MWEs. The second part of table 5 shows that the results are consistent with those observed previously in this section, and the obtained improvement is more important.
In this work, we explored a new way to design a HSD system for short texts, like tweets. We proposed to add new features to our DNN-based detection system: mutliword expression features. We integrated MWE features in a USE-based neural network thanks to a neural network of three branches. This network allows to take into account sentence-level features (USE embedding) and word-level features (MWE categories and the embeddings of the words belonging to the MWEs). The results were validated on two tweet corpora: HatEval and Founta. The models we proposed yielded significant improvements in macro-F1 over the baseline system (USE system). Furthermore, on HatEval corpus, the proposed system with MWEall categories and BERT embedding significantly outperformed the state-of-the-art system FERMI ranked first at the SemEval2019 shared task 5. These results showed that MWE features allow to enrich our baseline system. The proposed approach can be adapted to other NLP tasks, like sentiment analysis or automatic translation.
-  (2017) Deep learning for hate speech detection in tweets. Proceedings of the 26th International Conference on World Wide Web Companion. Cited by: §1.
-  (2019-06) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 54–63. External Links: Cited by: §3.1.
-  (2018-11) Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. External Links: Cited by: §1, §2.1.
-  (2017) Mean birds:detecting aggression and bullying on twitter. In Proceedings of the 2017 ACM on Web Science Conference, pp. 13–22. Cited by: §1.
-  (2012) Detecting offensive language in social media to protect adolescent online safety. In Proceedings of the 2012 ASE/IEEE International Conference on Social Computing, Washington, USA, pp. 71––80. Cited by: §1.
-  (2017-09) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 670–680. External Links: Cited by: §1.
-  (2017-12) Multiword expression processing: a Survey. Computational Linguistics 43 (4), pp. 837–892. External Links: Cited by: §1.
Hybrid emoji-based masked language models for zero-shot abusive language detection. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 943–949. External Links: Cited by: §1.
-  (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §2.3, §3.2.
-  (2018) Large scale crowdsourcing and characterization of twitter abusive behavior. CoRR abs/1802.00393. External Links: Cited by: §3.1.
-  (2017-08) Deep learning models for multiword expression identification. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), Vancouver, Canada, pp. 54–64. External Links: Cited by: §2.2.
-  (1989) Some statistical issues in the comparison of speech recognition algorithms. In International Conference on Acoustics, Speech, and Signal Processing,, Vol. , Glasgow, UK, pp. 532–535 vol.1. Cited by: §5.
-  (2019) Improving and interpreting neural networks for word-level prediction tasks in natural language processing. Ph.D. Thesis, Ghent University, Belgium. Cited by: §3.2.
-  (2019-06) FERMI at SemEval-2019 task 5: using sentence embeddings to identify hate speech against immigrants and women in twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 70–74. External Links: Cited by: §1, §1, §2.1, §5.
-  (2020-11) Using transfer-based language models to detect hateful and offensive language online. In Proceedings of the Fourth Workshop on Online Abuse and Harms, Online, pp. 16–27. External Links: Cited by: §1.
-  (2018) Comparative studies of detecting abusive language on twitter. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 101–106. Cited by: §1.
-  (2013) . In ICLR Workshop Papers, External Links: Cited by: §2.3.
A bert-based transfer learning approach for hate speech detection in online social media. CoRR abs/1910.12574. External Links: Cited by: §1.
-  (2016) Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web, Montréal, Canada, pp. 145–153. Cited by: §1, §1.
-  (2018) Automatic identification of misogyny in english and italian tweets at evalita 2018 with a multilingual hate lexicon. In EVALITA@CLiC-it, Cited by: §1.
-  (2018-08) Edition 1.1 of the PARSEME shared task on automatic identification of verbal multiword expressions. In Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), Santa Fe, New Mexico, USA, pp. 222–240. External Links: Cited by: §2.2.
-  (2020) Hate-speech and offensive language detection in roman urdu. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing EMNLP, pp. 2512–2522. Cited by: §1.
-  (2002-02) Multiword expressions: a pain in the neck for nlp. In Proceedings of CICLING-2002, Mexico City, Mexico, pp. 1–15. Cited by: §1, §2.2.
-  (2017-04) The PARSEME shared task on automatic identification of verbal multiword expressions. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), Valencia, Spain, pp. 31–47. External Links: Cited by: §2.2.
-  (2016-06) SemEval-2016 task 10: detecting minimal semantic units and their meanings (DiMSUM). In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 546–559. External Links: Cited by: §2.2.
-  (2015-May–June) A corpus and model integrating multiword expressions and supersenses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1537–1547. External Links: Cited by: §2.2, Table 1.
-  (2020-12) Multi-word expressions for abusive speech detection in Serbian. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, online, pp. 74–84. External Links: Cited by: §1.
-  (2020-12) MTLB-STRUCT @parseme 2020: capturing unseen multiword expressions using multi-task learning and pre-trained masked language models. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, online, pp. 142–148. External Links: Cited by: §2.2.
-  (2018) SHOMA at parseme shared task on automatic identification of vmwes: neural multiword expression tagging with high generalisation. CoRR abs/1809.03056. External Links: Cited by: §2.2.
-  (2016-06) Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL Student Research Workshop, San Diego, California, pp. 88–93. External Links: Cited by: §1.