With the popularity of the Internet and the rise of social media platforms, users around the world are having more freedom of expression. They can express their thoughts and opinions with minimal limitations and restrictions. As a result, they can share their positive thoughts about a specific product or service, a political decision, etc. Besides, they can share their negative thoughts about other things. Unfortunately, many users can employ these communication channels and freedom of expression to bully other people or groups. Misogyny is one of these phenomena, and it is defined as hate speech towards the female gender [Moloney2018AssessingOM]
. Misogyny can be classified into several categories such as sexual harassment, damning, dominance, etc[bookhaters].
Misogynistic behavior has prevailed on social media such as Facebook and Twitter. The ease of use and richness of these platforms have upraised misogyny to new levels of violence around the globe. Moreover, women suffer from misogyny in the 1st tier world as they suffer from it in the 2nd and 3rd tier world despite their race, language, age, etc. In the Arabic world, women’s rights and liberty have been always a controversial subject. Therefore, women are also exposed to online misogyny, where people can start campaigns of intimidation and harassment against them for one reason or another.
Fighting online misogyny has become a topic of interest of several Internet players, where social media networks such as Facebook and Twitter propose reporting systems that allow users to post messages expressing misogynistic behavior. These reporting systems can detect these behaviors from users’ posts and delete them automatically. For high-resource languages such as English, Spanish, and French, these systems have been shown to perform well. However, when it comes to languages such as Arabic, automatic reporting systems are not yet deployed, and that is mainly due to: 1) the lack of annotated data needed to build such systems and 2) the complexity of the Arabic language compared to other languages.
Fine-tuning pre-trained transformer-based language models [devlin-etal-2019-bert] on downstream tasks has shown state-of-the-the-art (SOTA) performances on various languages including Arabic [antoun-etal-2020-arabert, abdul-mageed-etal-2021-arbert, el-mekki-etal-2021-domain, el-mekki-etal-2021-bert, el-mahdaouy-etal-2021-deep]. Although several research works based on pre-trained transformers have been introduced for misogyny detection in Indo-European languages [safi-samghabadi-etal-2020-aggression, FersiniNR20, 9281090], works on Arabic language remain under explored [mulki2021letmi].
In this paper, we present our participating system and submissions to the first Arabic Misogyny Identification (ArMI) shared tasks [armi2021overview]. We introduce three Multi-Task Learning (MTL) models and their single-task counterparts. To embed the input texts, our models employ the pre-trained MARBERT language model [abdul-mageed-etal-2021-arbert]. Moreover, for Task 2, we tackle the class imbalance problem by training our models to minimize the Focal Loss [abs-1708-02002]. The obtained results demonstrate that our three submissions have achieved the best performances for both ArMI tasks in comparison to the other participating systems. The results also show that MTL models outperform their single-task counterparts on most evaluation measures. Additionally, the Focal Loss has shown effective performances, especially on F1 measures.
, we introduce our participating system and the investigated deep learning models. Section4 presents the conducted experiments and shows the obtained results. In section 5, we conclude the paper.
2 Tasks and dataset description
The Arabic Misogyny Identification (ArMI) task consists of the automatic detection of misogyny from Arabic tweets [armi2021overview]. This task is composed of two main sub-tasks: the 1st sub-task is a binary classification task where the objective is to classify whether a tweet is misogynistic or not. In the second sub-task, the objective is to detect the misogynistic behavior expressed in a tweet. It is modeled as a multi-class classification problem consisting of seven misogynistic behaviors (labels). The organizers of this task have provided 7,866 labeled tweets to serve both sub-tasks for model training, while 1,966 tweets have been used for model testing and evaluation. Figure 1 presents the distribution of both tasks labels. It shows that the class labels are imbalanced for both misogyny identification and categorization tasks.
The provided tweets are expressed mainly in Modern Standard Arabic (MSA), while several tweets are expressed in some Arabic dialects such as Egyptian, Gulf, and Levantine. The Levantine tweets are taken from Let-Mi misogyny detection dataset, proposed by mulki2021letmi. Besides, the rest of the tweets have been scrapped from Twitter using hashtags related to the misogyny phenomenon. The provided dataset is manually annotated by Arabic native speakers.
We propose three deep Multi-task Learning (MTL) models based on the pre-trained MARBERT encoder [abdul-mageed-etal-2021-arbert] for the ArMI shared task. We also investigate the single-task version of the proposed MTL models. The choice of MARBERT encoder is motivated by the fact that this language model is pre-trained on 1B tweet corpus, containing both dialectal Arabic and MSA. Moreover, Fine-tuning MARBERT on downstream NLP tasks has shown effective results in many Arabic NLP applications [abdul-mageed-etal-2021-arbert, el-mekki-etal-2021-bert, el-mahdaouy-etal-2021-deep]. In what follows, we describe each component of our submitted system.
The tweet preprocessing component performs emojis extraction, user mention and URL substitution, and hashtag normalization. Following MARBERT’s tweets preprocessing guidelines, user mentions and URLs are replaced by "user" and "url" token, respectively. For hashtags normalization, we remove "#" symbol and replace "_" by white space. It is worth mentioning that diacritics are already removed from the training and testing datasets. Based on our preliminary experiments, emojis are not removed from the normalized text and added after the [SEP] token of the employed encoder. Finally, each tweet is represented using its normalized text and its emojis, as follows:
[CLS] normalized tweet [SEP] emojis [SEP]
3.2 Deep Learning Models
In this section, we describe the employed MTL models and their single task counterparts. All our models utilize MARBERT encoder to represent the input tweets. The models are described as follows:
MT_CLS uses a classification layer for each task on top of MARBERT encoder. It relies on [CLS] token embedding to predict the class label for each task. The single-task version of this model is denoted by ST_CLS.
MT_ATT consists of MARBERT encoder, two task-specific attention layers, and two classification layers. Each attention layer [Bahdanau2015NeuralMT, yang-etal-2016-hierarchical]
extracts task discriminative features by weighting the output token embedding of the encoder according to their contribution to the task at hand. Each classification layer is feed with the concatenation of the task attention output and the [CLS] token embedding. This model has shown effective performances in many NLP tasks, including dialect identification, sentiment analysis and sarcasm detection for the Arabic language[el-mekki-etal-2021-bert, el-mahdaouy-etal-2021-deep], humor detection and rating, as well as lexical complexity prediction in English [essefar-etal-2021-cs, el-mamoun-etal-2021-cs]. The single-task counterpart of MT_ATT is denoted by ST_ATT.
MT_VHATT is an extension of the MT_ATT model. In addition to the task-specific attention layers (called horizontal attention layers), it employs vertical attention layers to incorporate the features of the top intermediate layers of MARBERT encoder for both tasks. This model utilizes six attention layers to extract features from the token embedding of the top six layers of the encoder [Bahdanau2015NeuralMT, yang-etal-2016-hierarchical]. Then, another attention layer is employed to aggregate features from the six vertical attention layers. Note that, we exclude the top output layer of the encoder as its features are already used by the horizontal attention layers (task-specific attention). Finally, the input of the classification layers for both tasks is the concatenation of the [CLS] token embedding of the last layer of the encoder, the task-specific attention output, and the aggregated features of intermediate layers. The single-task version of this model (MT_VHATT) is denoted by ST_VHATT.
For misogyny identification (Task 1), all models are trained to minimize the binary cross-entropy loss. For misogyny categorization (Task 2), we have investigated the Cross-Entropy (CE) loss, as well as the Focal Loss (FL) [abs-1708-02002]. The latter loss is employed to handle the class imbalance problem. It reduces the loss contribution from easy examples and assigns higher importance weights for hard-to-classify examples. The FL is given by:
where, denotes the category’s label,is the weight of label , and controls the contribution of high-confidence predictions in the loss. In other words, a higher value of implies lower loss contribution for well-classified examples [abs-1708-02002].
4 Experiments and results
In this section, we present the experiment settings as well as the obtained results for our development set and the provided test set.
4.1 Experiment settings
All our models are implemented using PyTorch111https://pytorch.org/
framework and the open-source Transformers222https://huggingface.co/transformers/ libraries. Experiments are performed using a PowerEdge R740 Server, having 44 cores Intel Xeon Gold 6152 2.1GHz, a RAM of 384 GB, and a single Nvidia Tesla V100 with 16GB of RAM. The provided training set is split into for the training and
for the development. Based on our preliminary results, all models are trained using Adam optimizer. The learning rate, the number of epochs, and the batch size are fixed to, , and respectively. The hyper-parameter of the Focal Loss is set to , while the weights of Task 2 labels are set to . All models are evaluated using the Accuracy as well as the macro averaged Precision, Recall, and F1 measures.
In order to select the best models for our official submissions, we have evaluated the three MTL models and their single-task counterparts. For Task 2, we have investigated both CE and FL losses. Table 1 presents the obtained results on the development set using the three single-task models. The overall obtained results for Task 1 show that the ST_ATT model outperforms the other models on most evaluation measures. It shows also the best Recall and F1 measures for Task 2. Moreover, ST_VHATT yields slightly better performances on Task 1 and achieves far better precision and F1 scores on Task 2 in comparison to ST_CLS model. Furthermore, FL outperforms the CE loss on most evaluation measures for Task 2, except for the accuracy and the precision of model ST_CLS. Table 2
presents the classification reports for Task 2 of the ST_ATT model using CE and FL loss functions. The obtained results show that the FL leads to better F1 scores for all categories, except "Discredit" and "Damning" misogynistic behaviours. Indeed, the classification of rare events is increased while maintaining the overall performance.
|Task 1||Task 2|
|Model||Accuracy||Precision||Recall||F1||Cat. Task Loss||Accuracy||Precision||Recall||F1|
|Category||CE loss||FL loss||Support|
|Stereotyping & objectification||0.6786||0.5846||0.6281||0.6897||0.6154||0.6504||65|
|Threat of violence||0.5714||0.5217||0.5455||0.4839||0.6522||0.5556||23|
|Task 1||Task 2|
|Model||Cat. Task Loss||Accuracy||Precision||Recall||F1||Accuracy||Precision||Recall||F1|
Table 3 presents the obtained results on the dev set using the three multi-task models for both Task 1 and Task 2. The overall obtained results show that the MT_ATT outperforms all other models on both tasks for most evaluation measures. The results demonstrate that using the FL loss for Task 2 improves also the model’s performance on Task 1 in multi-task settings. In accordance with the obtained results using single-task models, MT_VHATT shows slightly better performances on Task 1 than ST_CLS model. The overall obtained show that muti-task learning models surpass their single-task counterparts on Task 1. This can be explained by the fact MT models leverage signals from both tasks [Caruana94, sun2019ernie20].
4.3 Official submissions results
Based on the obtained results on the development set, we have submitted models that are trained using the FL for misogyny categorization (Task 2). This choice is motivated by the fact that the FL loss has lead to better F1 scores (Tables 1 and 3) than CE loss on the dev set. Our three official submissions are described as follows:
run1: corresponds to the submission of the obtained results on both tasks using the single-task model ST_ATT.
run2: corresponds to the obtained results on both tasks using the multi-task model MT_ATT.
Tables 4 and 5 summaries the official results of top five submitted systems to Task 1 and Task 2 respectively. The results show that all our submissions are ranked top three among all submitted systems. In accordance with our previous results, our multi-task models have achieved the first and the second-ranking positions. Although the ensembling of the three MTL models (run3) has yielded the best performances on most evaluation measures for both tasks, the best F1 score for Task 2 is obtained by run2 (MT_ATT model).
In this paper, we have presented our participating system in the first Arabic Misogyny Identification shared task. We have investigated three Multi-Task Learning models and their single-task counterparts using the pre-trained MARBERT encoder. In order to deal with class labels imbalance for Task 2, we have employed the Focal Loss. The results show that our three submitted systems are top-ranked among the participating systems to both ArMI tasks. The overall obtained results demonstrate that MTL models outperform their single-task versions in most evaluation scenarios. Besides, the Focal Loss has shown effective performances, especially on F1 measures.