Deep Multi-Task Models for Misogyny Identification and Categorization on Arabic Social Media

by   Abdelkader El Mahdaouy, et al.

The prevalence of toxic content on social media platforms, such as hate speech, offensive language, and misogyny, presents serious challenges to our interconnected society. These challenging issues have attracted widespread attention in Natural Language Processing (NLP) community. In this paper, we present the submitted systems to the first Arabic Misogyny Identification shared task. We investigate three multi-task learning models as well as their single-task counterparts. In order to encode the input text, our models rely on the pre-trained MARBERT language model. The overall obtained results show that all our submitted models have achieved the best performances (top three ranked submissions) in both misogyny identification and categorization tasks.


page 1

page 2

page 3

page 4


KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media

In this paper, we describe our approach to utilize pre-trained BERT mode...

Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection

Nowadays, offensive content in social media has become a serious problem...

Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments

Although the prediction of dialects is an important language processing ...

Towards Responsible Natural Language Annotation for the Varieties of Arabic

When building NLP models, there is a tendency to aim for broader coverag...

Neural DrugNet

In this paper, we describe the system submitted for the shared task on S...

Learning Mutual Fund Categorization using Natural Language Processing

Categorization of mutual funds or Exchange-Traded-funds (ETFs) have long...

1 Introduction

With the popularity of the Internet and the rise of social media platforms, users around the world are having more freedom of expression. They can express their thoughts and opinions with minimal limitations and restrictions. As a result, they can share their positive thoughts about a specific product or service, a political decision, etc. Besides, they can share their negative thoughts about other things. Unfortunately, many users can employ these communication channels and freedom of expression to bully other people or groups. Misogyny is one of these phenomena, and it is defined as hate speech towards the female gender [Moloney2018AssessingOM]

. Misogyny can be classified into several categories such as sexual harassment, damning, dominance, etc


Misogynistic behavior has prevailed on social media such as Facebook and Twitter. The ease of use and richness of these platforms have upraised misogyny to new levels of violence around the globe. Moreover, women suffer from misogyny in the 1st tier world as they suffer from it in the 2nd and 3rd tier world despite their race, language, age, etc. In the Arabic world, women’s rights and liberty have been always a controversial subject. Therefore, women are also exposed to online misogyny, where people can start campaigns of intimidation and harassment against them for one reason or another.

Fighting online misogyny has become a topic of interest of several Internet players, where social media networks such as Facebook and Twitter propose reporting systems that allow users to post messages expressing misogynistic behavior. These reporting systems can detect these behaviors from users’ posts and delete them automatically. For high-resource languages such as English, Spanish, and French, these systems have been shown to perform well. However, when it comes to languages such as Arabic, automatic reporting systems are not yet deployed, and that is mainly due to: 1) the lack of annotated data needed to build such systems and 2) the complexity of the Arabic language compared to other languages.

Fine-tuning pre-trained transformer-based language models [devlin-etal-2019-bert] on downstream tasks has shown state-of-the-the-art (SOTA) performances on various languages including Arabic [antoun-etal-2020-arabert, abdul-mageed-etal-2021-arbert, el-mekki-etal-2021-domain, el-mekki-etal-2021-bert, el-mahdaouy-etal-2021-deep]. Although several research works based on pre-trained transformers have been introduced for misogyny detection in Indo-European languages [safi-samghabadi-etal-2020-aggression, FersiniNR20, 9281090], works on Arabic language remain under explored [mulki2021letmi].

In this paper, we present our participating system and submissions to the first Arabic Misogyny Identification (ArMI) shared tasks [armi2021overview]. We introduce three Multi-Task Learning (MTL) models and their single-task counterparts. To embed the input texts, our models employ the pre-trained MARBERT language model [abdul-mageed-etal-2021-arbert]. Moreover, for Task 2, we tackle the class imbalance problem by training our models to minimize the Focal Loss [abs-1708-02002]. The obtained results demonstrate that our three submissions have achieved the best performances for both ArMI tasks in comparison to the other participating systems. The results also show that MTL models outperform their single-task counterparts on most evaluation measures. Additionally, the Focal Loss has shown effective performances, especially on F1 measures.

The rest of this paper is organized as follows. Section 2 describes the ArMI tasks and the provided dataset. In Section 3

, we introduce our participating system and the investigated deep learning models. Section

4 presents the conducted experiments and shows the obtained results. In section 5, we conclude the paper.

2 Tasks and dataset description

The Arabic Misogyny Identification (ArMI) task consists of the automatic detection of misogyny from Arabic tweets [armi2021overview]. This task is composed of two main sub-tasks: the 1st sub-task is a binary classification task where the objective is to classify whether a tweet is misogynistic or not. In the second sub-task, the objective is to detect the misogynistic behavior expressed in a tweet. It is modeled as a multi-class classification problem consisting of seven misogynistic behaviors (labels). The organizers of this task have provided 7,866 labeled tweets to serve both sub-tasks for model training, while 1,966 tweets have been used for model testing and evaluation. Figure 1 presents the distribution of both tasks labels. It shows that the class labels are imbalanced for both misogyny identification and categorization tasks.

(a) Distribution of misogynistic tweets
(b) Distribution of misogynistic categories
Figure 1: Labels distribution for both misogyny and category detection tasks.

The provided tweets are expressed mainly in Modern Standard Arabic (MSA), while several tweets are expressed in some Arabic dialects such as Egyptian, Gulf, and Levantine. The Levantine tweets are taken from Let-Mi misogyny detection dataset, proposed by mulki2021letmi. Besides, the rest of the tweets have been scrapped from Twitter using hashtags related to the misogyny phenomenon. The provided dataset is manually annotated by Arabic native speakers.

3 Methodology

We propose three deep Multi-task Learning (MTL) models based on the pre-trained MARBERT encoder [abdul-mageed-etal-2021-arbert] for the ArMI shared task. We also investigate the single-task version of the proposed MTL models. The choice of MARBERT encoder is motivated by the fact that this language model is pre-trained on 1B tweet corpus, containing both dialectal Arabic and MSA. Moreover, Fine-tuning MARBERT on downstream NLP tasks has shown effective results in many Arabic NLP applications [abdul-mageed-etal-2021-arbert, el-mekki-etal-2021-bert, el-mahdaouy-etal-2021-deep]. In what follows, we describe each component of our submitted system.

3.1 Preprocessing

The tweet preprocessing component performs emojis extraction, user mention and URL substitution, and hashtag normalization. Following MARBERT’s tweets preprocessing guidelines, user mentions and URLs are replaced by "user" and "url" token, respectively. For hashtags normalization, we remove "#" symbol and replace "_" by white space. It is worth mentioning that diacritics are already removed from the training and testing datasets. Based on our preliminary experiments, emojis are not removed from the normalized text and added after the [SEP] token of the employed encoder. Finally, each tweet is represented using its normalized text and its emojis, as follows:

  • [CLS] normalized tweet [SEP] emojis [SEP]

3.2 Deep Learning Models

In this section, we describe the employed MTL models and their single task counterparts. All our models utilize MARBERT encoder to represent the input tweets. The models are described as follows:

  • MT_CLS uses a classification layer for each task on top of MARBERT encoder. It relies on [CLS] token embedding to predict the class label for each task. The single-task version of this model is denoted by ST_CLS.

  • MT_ATT consists of MARBERT encoder, two task-specific attention layers, and two classification layers. Each attention layer [Bahdanau2015NeuralMT, yang-etal-2016-hierarchical]

    extracts task discriminative features by weighting the output token embedding of the encoder according to their contribution to the task at hand. Each classification layer is feed with the concatenation of the task attention output and the [CLS] token embedding. This model has shown effective performances in many NLP tasks, including dialect identification, sentiment analysis and sarcasm detection for the Arabic language

    [el-mekki-etal-2021-bert, el-mahdaouy-etal-2021-deep], humor detection and rating, as well as lexical complexity prediction in English [essefar-etal-2021-cs, el-mamoun-etal-2021-cs]. The single-task counterpart of MT_ATT is denoted by ST_ATT.

  • MT_VHATT is an extension of the MT_ATT model. In addition to the task-specific attention layers (called horizontal attention layers), it employs vertical attention layers to incorporate the features of the top intermediate layers of MARBERT encoder for both tasks. This model utilizes six attention layers to extract features from the token embedding of the top six layers of the encoder [Bahdanau2015NeuralMT, yang-etal-2016-hierarchical]. Then, another attention layer is employed to aggregate features from the six vertical attention layers. Note that, we exclude the top output layer of the encoder as its features are already used by the horizontal attention layers (task-specific attention). Finally, the input of the classification layers for both tasks is the concatenation of the [CLS] token embedding of the last layer of the encoder, the task-specific attention output, and the aggregated features of intermediate layers. The single-task version of this model (MT_VHATT) is denoted by ST_VHATT.

For misogyny identification (Task 1), all models are trained to minimize the binary cross-entropy loss. For misogyny categorization (Task 2), we have investigated the Cross-Entropy (CE) loss, as well as the Focal Loss (FL) [abs-1708-02002]. The latter loss is employed to handle the class imbalance problem. It reduces the loss contribution from easy examples and assigns higher importance weights for hard-to-classify examples. The FL is given by:


where, denotes the category’s label,

is a vector representing the predicted probability distribution over the labels,

is the weight of label , and controls the contribution of high-confidence predictions in the loss. In other words, a higher value of implies lower loss contribution for well-classified examples [abs-1708-02002].

4 Experiments and results

In this section, we present the experiment settings as well as the obtained results for our development set and the provided test set.

4.1 Experiment settings

All our models are implemented using PyTorch


framework and the open-source Transformers

222 libraries. Experiments are performed using a PowerEdge R740 Server, having 44 cores Intel Xeon Gold 6152 2.1GHz, a RAM of 384 GB, and a single Nvidia Tesla V100 with 16GB of RAM. The provided training set is split into for the training and

for the development. Based on our preliminary results, all models are trained using Adam optimizer. The learning rate, the number of epochs, and the batch size are fixed to

, , and respectively. The hyper-parameter of the Focal Loss is set to , while the weights of Task 2 labels are set to . All models are evaluated using the Accuracy as well as the macro averaged Precision, Recall, and F1 measures.

4.2 Results

In order to select the best models for our official submissions, we have evaluated the three MTL models and their single-task counterparts. For Task 2, we have investigated both CE and FL losses. Table 1 presents the obtained results on the development set using the three single-task models. The overall obtained results for Task 1 show that the ST_ATT model outperforms the other models on most evaluation measures. It shows also the best Recall and F1 measures for Task 2. Moreover, ST_VHATT yields slightly better performances on Task 1 and achieves far better precision and F1 scores on Task 2 in comparison to ST_CLS model. Furthermore, FL outperforms the CE loss on most evaluation measures for Task 2, except for the accuracy and the precision of model ST_CLS. Table 2

presents the classification reports for Task 2 of the ST_ATT model using CE and FL loss functions. The obtained results show that the FL leads to better F1 scores for all categories, except "Discredit" and "Damning" misogynistic behaviours. Indeed, the classification of rare events is increased while maintaining the overall performance.

Task 1 Task 2
Model Accuracy Precision Recall F1 Cat. Task Loss Accuracy Precision Recall F1
ST_CLS 90.72 90.48 89.91 90.18 CE 81.58 71.80 56.05 60.66
FL 79.67 62.15 63.00 62.05
ST_ATT 90.98 90.80 90.12 90.43 CE 80.81 67.79 56.70 59.63
FL 80.94 70.22 62.12 64.60
ST_VHATT 90.85 90.50 90.20 90.34 CE 80.43 64.87 60.15 61.99
FL 80.94 68.43 61.79 63.96
Table 1: The obtained results on the dev set using the three single-task models for both Task 1 and Task 2.
Category CE loss FL loss Support
Precision Recall F1 Precision Recall F1
None 0.8509 0.8954 0.8726 0.8845 0.8758 0.8801 306
Damning 0.8841 0.9104 0.8971 0.8955 0.8955 0.8955 67
Derailing 0.2500 0.0909 0.1333 0.4286 0.2727 0.3333 11
Discredit 0.8247 0.8362 0.8304 0.7980 0.8397 0.8183 287
Dominance 0.3636 0.3636 0.3636 0.4375 0.3182 0.3684 22
Sexual harassment 1.0000 0.3333 0.5000 1.0000 0.5000 0.6667 6
Stereotyping & objectification 0.6786 0.5846 0.6281 0.6897 0.6154 0.6504 65
Threat of violence 0.5714 0.5217 0.5455 0.4839 0.6522 0.5556 23
Table 2: ST_ATT model’s classification reports on the dev set of Task 2 using CE and FL loss functions.
Task 1 Task 2
Model Cat. Task Loss Accuracy Precision Recall F1 Accuracy Precision Recall F1
MT_CLS CE 90.98 90.39 90.72 90.55 79.67 67.68 57.02 60.18
FL 91.49 91.34 90.66 90.97 80.43 67.65 60.55 62.92
MT_ATT CE 91.11 91.48 89.75 90.46 80.56 66.80 58.63 60.52
FL 91.74 91.42 91.16 91.28 80.81 67.90 61.55 63.29
MT_VHATT CE 91.11 90.81 90.40 90.60 80.18 66.91 57.39 60.01
FL 91.49 91.19 90.84 91.01 80.05 66.82 58.92 61.67
Table 3: The obtained results on the dev set using the three multi-task models for both Task 1 and Task 2.

Table 3 presents the obtained results on the dev set using the three multi-task models for both Task 1 and Task 2. The overall obtained results show that the MT_ATT outperforms all other models on both tasks for most evaluation measures. The results demonstrate that using the FL loss for Task 2 improves also the model’s performance on Task 1 in multi-task settings. In accordance with the obtained results using single-task models, MT_VHATT shows slightly better performances on Task 1 than ST_CLS model. The overall obtained show that muti-task learning models surpass their single-task counterparts on Task 1. This can be explained by the fact MT models leverage signals from both tasks [Caruana94, sun2019ernie20].

4.3 Official submissions results

Based on the obtained results on the development set, we have submitted models that are trained using the FL for misogyny categorization (Task 2). This choice is motivated by the fact that the FL loss has lead to better F1 scores (Tables 1 and 3) than CE loss on the dev set. Our three official submissions are described as follows:

  • run1: corresponds to the submission of the obtained results on both tasks using the single-task model ST_ATT.

  • run2: corresponds to the obtained results on both tasks using the multi-task model MT_ATT.

  • run3: corresponds to the ensembling of the three multi-task learning models, namely MT_CLS, MT_ATT, and MT_VHATT

    models. In this submission, the logits of the three models are averaged. Depending on the task, either the sigmoid or the softmax activation is applied to get the labels probabilities.

Accuracy Precision Recall F1
UM6P-NLP_run3 91.9 92 90.9 91.4
UM6P-NLP_run2 91.5 91.5 90.5 91
UM6P-NLP_run1 91.5 91.1 91.1 91.1
UoT_run1 90.5 90.1 89.9 90
SOA_NLP_run1 88.3 87.8 87.6 87.7
Table 4: Top five submitted systems’s performance on ArMI Task 1.
Accuracy Precision Recall F1
UM6P-NLP_run2 82.7 69.7 64.7 66.5
UM6P-NLP_run3 83.3 71.7 63.6 65.3
UM6P-NLP_run1 81.6 69.2 65.2 65.1
SOA_NLP_run2 76.4 67.6 48 53.1
SOA_NLP_run3 74.5 54.9 50.8 52.6
Table 5: Top five submitted systems’s performance on ArMI Task 2.

Tables 4 and 5 summaries the official results of top five submitted systems to Task 1 and Task 2 respectively. The results show that all our submissions are ranked top three among all submitted systems. In accordance with our previous results, our multi-task models have achieved the first and the second-ranking positions. Although the ensembling of the three MTL models (run3) has yielded the best performances on most evaluation measures for both tasks, the best F1 score for Task 2 is obtained by run2 (MT_ATT model).

5 Conclusion

In this paper, we have presented our participating system in the first Arabic Misogyny Identification shared task. We have investigated three Multi-Task Learning models and their single-task counterparts using the pre-trained MARBERT encoder. In order to deal with class labels imbalance for Task 2, we have employed the Focal Loss. The results show that our three submitted systems are top-ranked among the participating systems to both ArMI tasks. The overall obtained results demonstrate that MTL models outperform their single-task versions in most evaluation scenarios. Besides, the Focal Loss has shown effective performances, especially on F1 measures.

Experiments presented in this paper were carried out using the supercomputer simlab-cluster, supported by Mohammed VI Polytechnic University (, and facilities of simlab-cluster HPC & IA platform.