Multilingual Offensive Language Identification for Low-resource Languages

05/12/2021 ∙ by Tharindu Ranasinghe, et al. ∙ Rochester Institute of Technology University of Wolverhampton 0

Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task, 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020, 0.8568 F1 macro for Hindi in HASOC 2019 shared task and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) showing that our approach compares favourably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic, and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Offensive posts on social media result in a number of undesired consequences to users. They have been investigated as triggers of suicide attempts and ideation, and mental health problems (Bonanno and Hymel, 2013; Bannink et al., 2014). One of the most common ways to cope with offensive content online is training systems capable of recognizing offensive messages or posts. Once recognized, such offensive content can be set aside for human moderation or deleted from the respective platform (e.g. Facebook, Twitter) preventing harm to users and controlling the spread of abusive behavior in social media.

There have been several recent studies published on this topic focusing on different aspects of offensiveness such as abuse (Mubarak et al., 2017), aggression (Kumar et al., 2018, 2020), cyber-bullying (Rosa et al., 2019), and hate speech (Malmasi and Zampieri, 2017, 2018; Röttger et al., 2020) to name a few. While there are a few studies published on languages such as Arabic (Mubarak et al., 2020) and Greek (Pitenis et al., 2020), most studies and datasets created thus far have focused on English. Data augmentation (Ghadery and Moens, 2020) and multilingual word embeddings (Pamungkas and Patti, 2019) have been applied to take advantage of existing English datasets to improve the performance in systems dealing with languages other than English. To the best of our knowledge, however, state-of-the-art cross-lingual contextual embeddings such as XLM-R (Conneau et al., 2020) have not yet been applied to offensive language identification. To address this gap, we evaluate the performance of cross-lingual contextual embeddings and transfer learning (TL) methods in projecting predictions from English to other languages. We show that our methods compare favourably to state-of-the-art approaches submitted to recent shared tasks on all datasets. The main contributions of this paper are the following:

  1. We apply cross-lingual contextual word embeddings to offensive language identification. We take advantage of existing English data to project predictions in seven other languages: Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish.

  2. We tackle both off-domain and off-task data for Bengali. We show that not only these methods can project predictions on different languages but also on different domains (e.g. Twitter vs. Facebook) and tasks (e.g. binary vs. three-way classification).

  3. We provide important resources to the community: the code, and the English model will be freely available to everyone interested in working on low-resource languages using the same methodology.

Finally, it is worth noting that the seven languages other than English included in this study are all languages with millions of speakers. Their situation in terms of language resources is more favorable and not comparable to the situation of minority or endangered languages for which there are virtually no datasets available, often called truly low-resource languages (Agić et al., 2016) or extremely low-resource languages (Tapo et al., 2020). That said, we consider these seven languages as low-resourced in the context of offensive language identification due to the lack of large available offensive language datasets in these languages especially when compared to English. The methods presented in this paper can be used to truly low-resource languages addressing data scarcity.

2. Related Work

There have been different types of abusive content addressed in recent studies including hate speech (Malmasi and Zampieri, 2018), aggression (Kumar et al., 2018, 2020), and cyberbullying (Rosa et al., 2019). A few annotation taxonomies, such as the one proposed by OLID (Zampieri et al., 2019a) and replicated in other studies (Rosenthal et al., 2020), try to take advantage of the similarities between these sub-tasks allowing us to consider multiple types of abusive language at once.

In terms of computational approaches, a variety of methods have been tested to identify abusive posts automatically including simple profanity lexicons, machine learning classifiers like Naive Bayes and SVMs, and deep learning representations and neural networks like convolutional neural networks (CNN), recurrent neural networks (RNN)

(Hettiarachchi and Ranasinghe, 2019), LSTM (Aroyehun and Gelbukh, 2018; Majumder et al., 2018) and transformers (Ranasinghe et al., 2019). In this section we briefly summarise related work addressing the most common abusive language phenomena.

Aggression identification: The two editions of the TRAC shared task on Aggression Identification organized in 2018 and 2020 (Kumar et al., 2018, 2020) provided participants with annotated Facebook dataset containing posts and comments labeled as non-aggressive, covertly aggressive, and overtly aggressive. The best-performing systems in the 2018 edition competition used deep learning approaches based on CNN, RNN, and LSTM (Aroyehun and Gelbukh, 2018; Majumder et al., 2018). In Section 5 we compare our results for Bengali with the best performing methods of TRAC 2020.

Cyberbullying detection: Cyberbulling is another popular topic with several studies published such as (Xu et al., 2012)

who used sentiment analysis methods and topic modelling and

(Dadvar et al., 2013) who addressed the problem using profiling-related features like the frequency of curse words in users’ messages.

Hate Speech: Hate speech is by far the most studied phenomenon with several studies published on various languages (Burnap and Williams, 2015; Djuric et al., 2015). The recent HatEval (Basile et al., 2019) competition at SemEval 2019 addressed hate speech against women and migrants featuring datasets in English and Spanish.

Offensive Language: In addition to the popular OffensEval shared task, which introduced a taxonomy encompassing multiple types of abusive and offensive content discussed in more detail in Section 5.1, there have been several studies published on identifying offensive posts (Pitenis et al., 2020). The GermEval (Wiegand et al., 2018) shared task provided participants with a dataset containing 8,500 annotated German tweets. Two sub-tasks were organized, the first sub-task was to discriminate between offensive and non-offensive tweets and the second sub-task was to discriminate between profanity, insult, and abuse.

Other Languages: While the clear majority of studies deal with English, there have been a number of studies dealing of other languages as well. Examples included Dutch (Van Hee et al., 2015; Tulkens et al., 2016), German (Ross et al., 2016), Italian (Pelosi et al., 2017), and the languages studied in the present paper.

3. Data

To carry out the experiments presented in this paper, we acquire datasets in English and other seven languages: Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish listed in Table 1.

Lang. Instances Source Labels
Arabic 8,000 T offensive, non-offensive
Bengali 4,000 Y overtly aggressive, covertly aggressive, non aggressive
Danish 2,961 F, R offensive, non-offensive
English 14,100 T offensive, non-offensive
Greek 8,743 T offensive, non-offensive
Hindi 8,000 T hate offensive, non hate-offensive
Spanish 6,600 T hateful, non-hateful
Turkish 31,756 T offensive, non-offensive
Table 1. Instances (Inst.), source (S) and labels in all datasets. F stands for Facebook, R for Reddit, T for Twitter and Y for Youtube.

The Bengali, Hindi, and Spanish datasets have been used in shared tasks in 2019 and 2020 allowing us to compare the performance of our methods to other approaches. The Hindi dataset (Mandl et al., 2019) was used in the HASOC 2019 shared task while the Spanish dataset (Basile et al., 2019) was used in SemEval-2019 Task 5 (HatEval). They both contain Twitter data and two labels. The Bengali dataset (Bhattacharya et al., 2020) was used in the TRAC-2 shared task (Kumar et al., 2020) on aggression identification. It is different than the other three datasets in terms of domain (Youtube instead of Twitter) and set of labels (three classes instead of binary) allowing us to compare the performance of cross-lingual embeddings on off-domain data and off-task data.

The Arabic, Danish, Greek, and Turkish datasets are official datasets of the OffensEval 2020 competition hosted at SemEval 2020 (Task 12) (Zampieri et al., 2020). We report the results, we obtained for Danish and Greek test sets. in Section 5.1. However, since Arabic and Turkish are morphologically rich, they require language specific segmentation approaches which we hope to conduct in the future. Therefore we only report results on the development set in Section 5.1 and a detailed explanation of future work on Arabic and Turkish is reported in Section 6.

Finally, as our English dataset, we choose the Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019a), used in the SemEval-2019 Task 6 (OffensEval) (Zampieri et al., 2019b). OLID is arguably one of the most popular offensive language datasets. It contains manually annotated tweets with the following three-level taxonomy and labels:

  • Offensive language identification - offensive vs. non-offensive;

  • Categorization of offensive language - targeted insult or thread vs. untargeted profanity;

  • Offensive language target identification - individual vs. group vs. other.

We chose OLID due to the flexibility provided by its hierarchical annotation model which considers multiple types of offensive content in a single taxonomy (e.g. targeted insults to a group are often hate speech whereas targeted insults to an individual are often cyberbulling). This allows us to map OLID level A (offensive vs. non-offensive) to labels in the other three datasets. OLID’s annotation model is intended to serve as a general-purpose model for multiple abusive language detection sub-tasks as described by Waseem et al. (2017) (Waseem et al., 2017). The transfer learning strategy used in this paper provides us with the opportunity to evaluate how closely the OLID labels relate to the classes in other datasets which were annotated using different guidelines and definitions (e.g. aggression and hate speech).

4. Methodology

Transformer models have been used successfully for various NLP tasks (Devlin et al., 2019) such as text classification (Ranasinghe and Hettiarachchi, 2020; Hettiarachchi and Ranasinghe, 2020b), NER (Ranasinghe and Zampieri, 2021; Taher et al., 2019), context similarity (Hettiarachchi and Ranasinghe, 2020a), language identification (Jauhiainen et al., 2021) etc. Most of the tasks were focused on English language due to the fact the most of the pre-trained transformer models were trained on English data. Even though, there were several multilingual models like BERT-m (Devlin et al., 2019) there were many speculations about its ability to represent all the languages (Pires et al., 2019) and although BERT-m model showed some cross-lingual characteristics it has not been trained on crosslingual data (Karthikeyan et al., 2020). The motivation behind this methodology was the recently released cross-lingual transformer models - XLM-R (Conneau et al., 2020) which has been trained on 104 languages. The interesting fact about XLM-R is that it is very compatible in monolingual benchmarks while achieving best results in cross-lingual benchmarks at the same time (Conneau et al., 2020). The main idea of the methodology is that we train a classification model on a resource rich, typically English, using a cross-lingual transformer model and perform transfer learning on a less resource language.

There are two main parts of the methodology. Subsection 4.1 describes the classification architecture we used for all the languages. In Subsection 4.2 we describe the transfer learning strategies we used to utilise English offensive language data in predicting offense in less-resourced languages.

4.1. XLM-R for Text Classification

Similar to other transformer architectures XLM-R transformer architecture can also be used for text classification tasks (Conneau et al., 2020). XLM-R-large model contains approximately 125M parameters with 12-layers, 768 hidden-states, 3072 feed-forward hidden-states and 8-heads (Conneau et al., 2020). It takes an input of a sequence of no more than 512 tokens and outputs the representation of the sequence. The first token of the sequence is always [CLS] which contains the special classification embedding (Sun et al., 2019).

For text classification tasks, XLM-R takes the final hidden state h of the first token [CLS] as the representation of the whole sequence. A simple softmax classifier is added to the top of XLM-R to predict the probability of label c: as shown in Equation

1 where W is the task-specific parameter matrix (Ranasinghe and Hettiarachchi, 2020; Hettiarachchi and Ranasinghe, 2020b).

(1)

We fine-tune all the parameters from XLM-R as well as W jointly by maximising the log-probability of the correct label. The architecture diagram of the classification is shown in Figure 1. We specially used the XLM-R large model.

Figure 1. Text Classification Architecture (Ranasinghe and Zampieri, 2020)

4.2. Transfer-learning strategies

When we adopt XLM-R for multilingual offensive language identification, we perform transfer learning in two different ways.

Inter-language transfer learning

We first trained the XLM-R classification model on first level of English offensive language identification dataset (OLID) (Zampieri et al., 2019a)

. Then we save the weights of the XLM-R model as well as the softmax layer. We use this saved weights from English to initialise the weights for a new language. To explore this transfer learning aspect we experimented on Hindi language which was released for HASOC 2019 shared task

(Mandl et al., 2019), on Spanish data released for Hateval 2019 (Basile et al., 2019) and on Arabic, Danish, Greek and Turkish data releasred for OffensEval 2020 (Zampieri et al., 2020).

Inter-task and inter-language transfer learning

Similar to inter-language transfer learning strategy, we first trained the XLM-R classification model on the first level of English offensive language identification dataset (OLID) (Zampieri et al., 2019a). Then we only save the weights of the XLM-R model and use the saved weights to initialise the weights for a new language. We did not use the weights of the last softmax layer since we wanted to test this strategy on different data that has a different number of offensive classes to predict. Since the last softmax layer reflects the number of classes, it is not possible to transfer the weights of the softmax layer when the number of the classes are different in the classification task. We explored this transfer learning aspect with Bengali dataset released with TRAC - 2 shared task (Kumar et al., 2020). As described in the Section 3 the classifier should make a 3-way classification in between ‘Overtly Aggressive’, ‘Covertly Aggressive’ and ‘Non Aggressive’ text data.

We used a Nvidia Tesla K80 GPU to train the models. We divided the data set into a training set and a validation set using 0.8:0.2 split on the dataset. We mainly fine tuned the learning rate and number of epochs of the classification model manually to obtain the best results for the validation set. We obtained

as the best value for learning rate and 3 as the best value for number of epochs for all the languages. The other configurations of the transformer model were set to a constant value over all the languages in order to ensure consistency between the languages. This also provides a good starting configuration for researchers who intend to use this technique on a new language. We used a batch-size of eight, Adam optimiser (Kingma and Ba, 2014) and a linear learning rate warm-up over 10% of the training data. The models were trained using only training data. We performed early stopping if the evaluation loss did not improve over ten evaluation rounds.

Training for English language took around 1 hour while training for other languages took around 30 minutes.

5. Results and Evaluation

We evaluate the results obtained by all models using the test set provided by the organizers of each competition. We compared our results to the best systems in TRAC-2 for Bengali, HASOC for Hindi, HatEval for Spanish in terms of weighted and macro F1 score according to the metrics reported by the task organizers - TRAC-2 reported only macro F1, HatEval reported only weighted F1 and HASOC reported both macro F1 and weighted F1. Finally, we quantify the improvement of the transfer learning strategy in the performance of both BERT and XLM-R. TL indicates that the model used the transfer learning strategy described in Subsection 4.2.

Non Hate Offensive Hate Offensive Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.84 0.87 0.88 0.84 0.86 0.85 0.86 0.85 0.86 0.86
BERT-m (TL) 0.80 0.83 0.84 0.80 0.82 0.81 0.82 0.81 0.82 0.82
Bashar and Nayak (2019) NR NR NR NR NR NR NR NR 0.82 0.81
XLM-R 0.84 0.96 0.90 0.86 0.61 0.71 0.85 0.85 0.81 0.80
BERT-m 0.77 0.99 0.86 0.94 0.33 0.49 0.82 0.78 0.80 0.80
Table 2. Results for offensive language detection in Hindi. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Bashar and Nayak (2019) is the best result submitted to the competition.
Non Hateful Hateful Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.78 0.76 0.77 0.76 0.78 0.73 0.76 0.76 0.76 0.75
BERT-m (TL) 0.76 0.74 0.75 0.74 0.76 0.71 0.74 0.74 0.74 0.73
Vega et al. (2019) NR NR NR NR NR NR NR NR 0.73 NR
Pérez and Luque (2019) NR NR NR NR NR NR NR NR 0.73 NR
XLM-R 0.77 0.74 0.74 0.73 0.76 0.70 0.73 0.73 0.73 0.72
BERT-m 0.75 0.72 0.73 0.71 0.74 0.68 0.72 0.72 0.72 0.71
Table 3. Results for offensive language detection in Spanish. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Vega et al. (2019) and Pérez and Luque (2019) are the best results submitted to the competition.
Non Aggressive Overtly Aggressive Covertly Aggressive Weighted Average
Model P R F1 P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.96 0.94 0.95 0.73 0.81 0.77 0.72 0.80 0.76 0.86 0.85 0.84 0.84
Risch and Krestel (2020) NR NR NR NR NR NR NR NR NR NR NR 0.82 0.82
BERT-m (TL) 0.94 0.92 0.93 0.71 0.79 0.75 0.70 0.78 0.68 0.80 0.83 0.82 0.82
XLM-R 0.91 0.90 0.91 0.70 0.78 0.74 0.69 0.77 0.67 0.79 0.83 0.82 0.81
BERT-m 0.90 0.89 0.90 0.69 0.77 0.73 0.68 0.76 0.66 0.78 0.81 0.81 0.81
Table 4. Results for offensive language detection in Bengali. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Risch and Krestel (2020) is the best result submitted to the competition.

For Hindi, as presentend in Table 2, transfer learning with XLM-R cross lingual embeddings provided the best results achieving 0.86 for both weighted and macro F1 score. In HASOC 2019 (Mandl et al., 2019), the best model by Bashar and Nayak (2019) scored 0.81 Macro F1 and 0.82 Weighted F1 using convolutional neural networks.

For Spanish transfer learning with XLM-R cross lingual embeddings also provided the best results achieving 0.75 and 0.76 macro and weighted F1 score respectively. The best two models in HatEval (Basile et al., 2019)

for Spanish scored 0.73 macro F1 score. Both models applied SVM classifiers trained on a variety of features like character and word n-grams, POS tags, offensive word lexica, and embeddings.

The results on Bengali shown in Table 4 deserve special attention because the Bengali data is off-domain with respect to the English data (Facebook instead of Twitter) and it contains three labels (covertly aggressive, overtly aggressive, and not aggressive) instead of two in the English dataset (offensive and non-offensive). TL indicates that the model used the inter-task, inter-domain, and inter-language transfer learning strategy described in Subsection 4.2. Similar to the Hindi and Spanish, for Bengali also transfer learning with XLM-R cross lingual embeddings provided the best results achieving 0.75 and 0.76 macro and weighted F1 respectively thus outperforming the other models by a significant margin. The best model in the TRAC-2 shared task (Kumar et al., 2020) scored 0.82 weighted F1 score in Bengali using a BERT-based system.

We look closer to the test set predictions by XLM-R (TL) for Bengali in Figure 2. We observe that the performance for the non-aggressive class is substantially better than the performance for the overtly aggressive and covertly aggressive classes following a trend observed by the TRAC-2 participants including Risch and Krestel (2020).

Figure 2. Heat map of the Bengali test set predictions by XLM-R (TL).

Finally, it is clear that in all the experimental settings, the cross-lingual embedding models fine-tuned with transfer learning, outperforms the best system available for the three languages. Furthermore, the results show that the cross-lingual nature of the XLM-R model provided a boost over the multilingual BERT model in all languages tested.

5.1. OffensEval 2020

We also experimented our methods with the recently released OffensEval 2020 (Zampieri et al., 2020) datasets in four languages: Arabic, Danish, Greek and Turkish. In Danish and Turkish we evaluate our results on the test sets comparing them to the best systems in the competition. As shown in Table 5 and Table 6 our methodology out performs the best systems submitted in each of these languages. As we mentioned earlier, remaining languages of OffensEval 2020 (Zampieri et al., 2020): Arabic and Turkish are morphologically rich languages and they require language specific segmentation techniques which we hope to explore in future research. That said, in order to show our transfer learning strategy works for a wider range of languages, we presents the results we obtained for development set in the Tables 7 and 8. Although not fully comparable, the high results obtained on the development set provides us an indication that the proposed approach is likely to obtain high results on the Arabic and Turkish test sets as well. We aim to perform this evaluation with the different morphological segmentation techniques in both Arabic and Turkish.

Finally, in all of the OffensEval 2020 languages, XLM-R with the transfer learning strategy performs best from the experiments. It is clear that transfer learning approach boosts the performance of both BERT-m and XLM-R models through out all the languages. Furthermore, it is clear that there is a significant advantage in using cross lingual transformers with transfer learning rather than training transformer models from scratch.

Not Offensive Offensive Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.85 0.88 0.89 0.69 0.75 0.72 0.87 0.87 0.87 0.83
BERT-m (TL) 0.89 0.87 0.88 0.67 0.73 0.70 0.85 0.86 0.86 0.82
Pamies et al. (2020) NR NR NR NR NR NR NR NR NR 0.81
XLM-R 0.87 0.85 0.86 0.66 0.72 0.70 0.85 0.85 0.85 0.81
BERT-m 0.86 0.84 0.85 0.65 0.71 0.69 0.84 0.84 0.84 0.80
Table 5. Results for offensive language detection in Danish. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Pamies et al. (2020) is the best result submitted to the competition.
Not Offensive Offensive Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.95 0.93 0.94 0.72 0.80 0.76 0.91 0.91 0.91 0.87
BERT-m (TL) 0.94 0.92 0.93 0.71 0.79 0.75 0.90 0.90 0.90 0.86
Ahn et al. (2020) NR NR NR NR NR NR NR NR NR 0.85
XLM-R 0.91 0.89 0.90 0.68 0.75 0.71 0.86 0.86 0.87 0.83
BERT-m 0.90 0.88 0.89 0.67 0.74 0.70 0.85 0.85 0.85 0.81
Table 6. Results for offensive language detection in Greek. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Ahn et al. (2020) is the best result submitted to the competition.
Not Offensive Offensive Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.96 0.94 0.95 0.73 0.81 0.77 0.92 0.92 0.92 0.88
BERT-m (TL) 0.95 0.92 0.93 0.72 0.78 0.76 0.90 0.90 0.90 0.86
XLM-R 0.93 0.90 0.91 0.70 0.76 0.74 0.89 0.89 0.89 0.85
BERT-m 0.91 0.89 0.90 0.70 0.75 0.49 0.87 0.87 0.88 0.84
Table 7. Results for offensive language detection in Arabic. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Alami et al. (2020) is the best result submitted to the competition.
Not Offensive Offensive Weighted Average
Model P R F1 P R F1 P R F1 F1 Macro
XLM-R (TL) 0.83 0.81 0.82 0.70 0.78 0.74 0.89 0.89 0.89 0.85
BERT-m (TL) 0.81 0.79 0.80 0.69 0.77 0.72 0.87 0.88 0.88 0.84
XLM-R 0.79 0.77 0.78 0.67 0.75 0.70 0.85 0.85 0.86 0.80
BERT-m 0.77 0.75 0.77 0.66 0.74 0.69 0.84 0.84 0.85 0.79
Table 8. Results for offensive language detection in Turkish. For each model, Precision (P), Recall (R), and F1 are reported on all classes, and weighted averages. Macro-F1 is also listed. NR indicates that the research did not report that value. Wang et al. (2020) is the best result submitted to the competition.

5.2. Progress Tests

(a) Hindi Results
(b) Spanish Results
(c) Danish Results
(d) Greek Results
(e) Bengali Results
Figure 3. Transfer learning impact in Offensive Language Identification. XLM-R and BERT-m indicates that the model was trained from scratch while XLM-R (TL) and BERT (TL) indicates that models followed the transfer learning strategy.

The biggest challenge in developing deep learning models for offensive language identification in low resources languages is finding suitable annotated data. Therefore, we investigate whether our transfer learning approach can thrive in even less resource environments. In order to analyze this, we conduct the experiments for 0 (unsupervised), 100, 200, 300 and up to 1,000 training instances. We performed this in five languages: Hindi, Spanish, Danish, Greek and Bengali. We compare the results with and without transfer learning for both BERT-m and XLM-R. We report the Macro-F1 score of the predictions and gold labels in the test set against the number of training instances in Figure 3.

The transfer learning strategy significantly impacts the results. For Hindi, with only 100 training instances, training XLM-R from scratch achieves only 0.25 Macro F1 between the predictions and gold labels of the test set. However, with only 100 training instances, training XLM-R using the transfer learning strategy achieves 0.61 Macro F1 in Hindi. When the number of training instances grows, the results from the XLM-R models trained with the transfer learning strategy and the results from the XLM-R models trained from scratch converge. Similar pattern can be observed with BERT-m model too. This can be noticed in all the languages we experimented. However, in Bengali increasing number of training instances from 0 to 100 dropped the transfer learning results a bit, most probably because it takes certain number of examples to train the softmax layer we added.

It is interesting to see that, in all the languages, transfer learning strategy with zero number of training instances provided a good Macro F1 score. Therefore, it is clear that the cross lingual nature of these transformer models can identify offensive content in an unsupervised environment too, once we do transfer learning from English.

Our results confirm that XLM-R with the transfer learning strategy can be hugely beneficial to low-resource languages in offensive language identification where annotated training instances are scarce.

6. Conclusion and Future Work

This paper is the first study to apply cross-lingual contextual word embeddings in offensive language identification projecting predictions from English to other languages. We developed systems based on different transformer architectures and evaluated their performance on benchmark datasets on Bengali (Kumar et al., 2020), Hindi (Mandl et al., 2019), and Spanish (Basile et al., 2019). Finally, we compared them with the performance obtained by the systems that participated in shared tasks and we have showed that our best models outperformed the best systems in these competitions. Furthermore, our methods obtained promising results on the Arabic, Danish, Greek, and Turkish training and development sets released as part of the recent OffensEval 2020 (Zampieri et al., 2020).

We have showed that XLM-R with transfer learning outperfoms all other methods we tested as well as the best results obtained by participants of the three competitions. Furthermore, the results from Bengali show that it is possible to achieve high performance using transfer learning on off-domain and off-task data when the labels do not have a direct correspondence in the projected dataset (two in English and three in Bengali). This opens exciting new avenues for future research considering the multitude of phenomena (e.g. hate speech, abuse, cyberbulling), annotation schemes, and guidelines used in offensive language datasets.

We are keen to cover more languages for offensive language identification using this strategy. Furthermore, when we are experimenting with morphologically rich languages like Arabic and Turkish we are interested to see whether language specific preprocessing like segmentation would improve the results. Finally, we would also like to apply our models to more low-resource languages creating important resources to cope with the problem of offensive language in social media.

We are currently experimenting with the recently finished HASOC 2020 shared task on identifying offensive YouTube comments in Code-mixed Malayalam. We are keen to experiment our transfer learning strategy in this task since the dataset is off domain and it contains code-mixed data. To the best of our knowledge, this is the first time that such methods are being tested on code-mixed data. Initial experiments (Ranasinghe et al., 2020) show that XLM-R with transfer learning strategy provides 0.85 and 0.73 weighted F1 and macro F1 scores respectively while training XLM-R from scratch gets only 0.83 and 0.71 weighted F1 and macro F1 scores respectively. These results suggest that our transfer learning approach is very likely to perform well with code-mixed data compared to other state-of-the-art methods.

Acknowledgments

We would like to thank the shared task organizers for making the datasets used in this paper available. We further thank the anonymous reviewers who provided us with constructive and insightful feedback to improve the quality of this paper.

References

  • Ž. Agić, A. Johannsen, B. Plank, H. M. Alonso, N. Schluter, and A. Søgaard (2016) Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics 4, pp. 301–312. Cited by: §1.
  • H. Ahn, J. Sun, C. Y. Park, and J. Seo (2020) NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer. In Proceedings of SemEval, Cited by: Table 6.
  • H. Alami, S. O. E. Alaoui, A. Benlahbib, and N. En-nahnahi (2020) LISAC FSDM-USMBA Team at SemEval 2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification. In Proceedings of SemEval, Cited by: Table 7.
  • S. T. Aroyehun and A. Gelbukh (2018) Aggression detection in social media: using deep neural networks, data augmentation, and pseudo labeling. In Proceedings of TRAC, Cited by: §2, §2.
  • R. Bannink, S. Broeren, P. M. van de Looij–Jansen, F. G. de Waart, and H. Raat (2014) Cyber and Traditional Bullying Victimization as a Risk Factor for Mental Health Problems and Suicidal Ideation in Adolescents. PloS one 9 (4). Cited by: §1.
  • M. A. Bashar and R. Nayak (2019) QutNocturnal@ HASOC’19: CNN for hate speech and offensive content identification in Hindi language. In Proceedings of FIRE, Cited by: Table 2, §5.
  • V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo, P. Rosso, and M. Sanguinetti (2019) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In Proceedings of SemEval, Cited by: Multilingual Offensive Language Identification for Low-resource Languages, §2, §3, §4.2, §5, §6.
  • S. Bhattacharya, S. Singh, R. Kumar, A. Bansal, A. Bhagat, Y. Dawer, B. Lahiri, and A. Kr. Ojha (2020) Developing a multilingual annotated corpus of misogyny and aggression. In Proceedings of TRAC, Cited by: §3.
  • R. A. Bonanno and S. Hymel (2013) Cyber bullying and internalizing difficulties: above and beyond the impact of traditional forms of bullying. Journal of youth and adolescence 42 (5), pp. 685–697. Cited by: §1.
  • P. Burnap and M. L. Williams (2015) Cyber hate speech on Twitter: an application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7 (2), pp. 223–242. Cited by: §2.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL, Cited by: §1, §4.1, §4.
  • M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong (2013) Improving cyberbullying detection with user context. In Advances in Information Retrieval, pp. 693–696. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, Cited by: §4.
  • N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, and N. Bhamidipati (2015) Hate speech detection with comment embeddings. In Proceedings of the Web Conference (WWW), Cited by: §2.
  • E. Ghadery and M. Moens (2020) LIIR at semeval-2020 task 12: a cross-lingual augmentation approach for multilingual offensive language identification. In Proceedings of SemEval, Cited by: §1.
  • H. Hettiarachchi and T. Ranasinghe (2019) Emoji powered capsule network to detect type and target of offensive posts in social media. In Proceedings of RANLP, Cited by: §2.
  • H. Hettiarachchi and T. Ranasinghe (2020a) BRUMS at SemEval-2020 task 3: contextualised embeddings for predicting the (graded) effect of context in word similarity. In Proceedings of SemEval, Cited by: §4.
  • H. Hettiarachchi and T. Ranasinghe (2020b) InfoMiner at WNUT-2020 task 2: transformer-based covid-19 informative tweet extraction. In Proceedings of W-NUT, Cited by: §4.1, §4.
  • T. Jauhiainen, T. Ranasinghe, and M. Zampieri (2021) Comparing approaches to dravidian language identification. In Proceedings of VarDial, Cited by: §4.
  • K. Karthikeyan, Z. Wang, S. Mayhew, and D. Roth (2020) Cross-lingual ability of multilingual bert: an empirical study. In Proceedings of ICLR, Cited by: §4.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Cited by: §4.2.
  • R. Kumar, A. Kr. Ojha, S. Malmasi, and M. Zampieri (2018) Benchmarking Aggression Identification in Social Media. In Proceedings of TRAC, Cited by: §1, §2, §2.
  • R. Kumar, A. Kr. Ojha, S. Malmasi, and M. Zampieri (2020) Evaluating Aggression Identification in Social Media. In Proceedings of TRAC, Cited by: Multilingual Offensive Language Identification for Low-resource Languages, §1, §2, §2, §3, §4.2, §5, §6.
  • P. Majumder, T. Mandl, et al. (2018) Filtering aggression from the multilingual social media feed. In Proceedings of TRAC, Cited by: §2, §2.
  • S. Malmasi and M. Zampieri (2017) Detecting Hate Speech in Social Media. In Proceedings of RANLP, Cited by: §1.
  • S. Malmasi and M. Zampieri (2018) Challenges in Discriminating Profanity from Hate Speech.

    Journal of Experimental & Theoretical Artificial Intelligence

    .
    Cited by: §1, §2.
  • T. Mandl, S. Modha, P. Majumder, D. Patel, M. Dave, C. Mandlia, and A. Patel (2019) Overview of the hasoc track at fire 2019: hate speech and offensive content identification in indo-european languages. In Proceedings of FIRE, Cited by: Multilingual Offensive Language Identification for Low-resource Languages, §3, §4.2, §5, §6.
  • H. Mubarak, D. Kareem, and M. Walid (2017) Abusive language detection on Arabic social media. In Proceedings of ALW, Cited by: §1.
  • H. Mubarak, A. Rashed, K. Darwish, Y. Samih, and A. Abdelali (2020) Arabic offensive language on twitter: analysis and experiments. Cited by: §1.
  • M. Pamies, E. Ohman, K. Kajava, and J. Tiedemann (2020) LT@Helsinki at SemEval-2020 Task 12: Multilingual or language-specific BERT?. In Proceedings of SemEval, Cited by: Table 5.
  • E. W. Pamungkas and V. Patti (2019) Cross-domain and cross-lingual abusive language detection: a hybrid approach with deep learning and a multilingual lexicon. In Proceedings ACL:SRW, Cited by: §1.
  • S. Pelosi, A. Maisto, P. Vitale, and S. Vietri (2017) Mining Offensive Language on Social Media. In Proceedings of CLiC-it, Cited by: §2.
  • J. M. Pérez and F. M. Luque (2019) Atalaya at semeval 2019 task 5: robust embeddings for tweet classification. In Proceedings of SemEval, Cited by: Table 3.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual BERT?. In Proceedings of ACL, Cited by: §4.
  • Z. Pitenis, M. Zampieri, and T. Ranasinghe (2020) Offensive Language Identification in Greek. In Proceedings of LREC, Cited by: §1, §2.
  • T. Ranasinghe, S. Gupte, M. Zampieri, and I. Nwogu (2020) WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments. In Proceedings of FIRE, Cited by: §6.
  • T. Ranasinghe and H. Hettiarachchi (2020) BRUMS at SemEval-2020 task 12: transformer based multilingual offensive language identification in social media. In Proceedings of SemEval, Cited by: §4.1, §4.
  • T. Ranasinghe, M. Zampieri, and H. Hettiarachchi (2019) BRUMS at hasoc 2019: deep learning models for multilingual hate speech and offensive language identification. In Proceedings of FIRE, Cited by: §2.
  • T. Ranasinghe and M. Zampieri (2020) Multilingual offensive language identification with cross-lingual embeddings. In Proceedings of EMNLP, Cited by: Figure 1.
  • T. Ranasinghe and M. Zampieri (2021) MUDES: Multilingual Detection of Offensive Spans. In Proceedings of NAACL, Cited by: §4.
  • J. Risch and R. Krestel (2020) Bagging bert models for robust aggression identification. In Proceedings of TRAC, Cited by: Table 4, §5.
  • H. Rosa, N. Pereira, R. Ribeiro, P. C. Ferreira, J. P. Carvalho, S. Oliveira, L. Coheur, P. Paulino, A. V. Simão, and I. Trancoso (2019) Automatic cyberbullying detection: a systematic review. Computers in Human Behavior 93, pp. 333–345. Cited by: §1, §2.
  • S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, and P. Nakov (2020) A large-scale semi-supervised dataset for offensive language identification. arXiv preprint arXiv:2004.14454. Cited by: §2.
  • B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, and M. Wojatzki (2016) Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. In Proceedings of NLP4CMC, Cited by: §2.
  • P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Margetts, and J. Pierrehumbert (2020) HateCheck: functional tests for hate speech detection models. arXiv preprint arXiv:2012.15606. Cited by: §1.
  • C. Sun, X. Qiu, Y. Xu, and X. Huang (2019) How to fine-tune bert for text classification?. In Chinese Computational Linguistics, External Links: ISBN 978-3-030-32381-3 Cited by: §4.1.
  • E. Taher, S. A. Hoseini, and M. Shamsfard (2019)

    Beheshti-NER: Persian named entity recognition using BERT

    .
    In Proceedings of NSURL:ICNLSP, Cited by: §4.
  • A. A. Tapo, B. Coulibaly, S. Diarra, C. Homan, J. Kreutzer, S. Luger, A. Nagashima, M. Zampieri, and M. Leventhal (2020) Neural machine translation for extremely low-resource african languages: a case study on bambara. In Proceedings of LowResMT, Cited by: §1.
  • S. Tulkens, L. Hilte, E. Lodewyckx, B. Verhoeven, and W. Daelemans (2016) A Dictionary-based Approach to Racism Detection in Dutch Social Media. In Proceedings of TA-COS, Cited by: §2.
  • C. Van Hee, E. Lefever, B. Verhoeven, J. Mennes, B. Desmet, G. De Pauw, W. Daelemans, and V. Hoste (2015) Automatic detection and prevention of cyberbullying. In Proceedings of HUSO, Cited by: §2.
  • L. E. A. Vega, J. C. Reyes-Magaña, H. Gómez-Adorno, and G. Bel-Enguix (2019) MineriaUNAM at semeval-2019 task 5: detecting hate speech in twitter using multiple features in a combinatorial framework. In Proceedings of SemEval, Cited by: Table 3.
  • S. Wang, J. Liu, X. Ouyang, and Y. Sun (2020)

    Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification using Pre-trained Language Models

    .
    In Proceedings of SemEval, Cited by: Table 8.
  • Z. Waseem, T. Davidson, D. Warmsley, and I. Weber (2017) Understanding abuse: a typology of abusive language detection subtasks. In Proceedings of ALW, Cited by: §3.
  • M. Wiegand, M. Siegel, and J. Ruppenhofer (2018) Overview of the GermEval 2018 shared task on the identification of offensive language. In Proceedings of GermEval, Cited by: §2.
  • J. Xu, K. Jun, X. Zhu, and A. Bellmore (2012) Learning from bullying traces in social media. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technology (NAACL-HLT), Cited by: §2.
  • M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019a) Predicting the type and target of offensive posts in social media. In Proceedings of NAACL, Cited by: §2, §3, §4.2, §4.2.
  • M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019b) SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of SemEval, Cited by: §3.
  • M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, and Ç. Çöltekin (2020) SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, Cited by: Multilingual Offensive Language Identification for Low-resource Languages, §3, §4.2, §5.1, §6.