This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.
* These two authors contributed equally.
Nowadays, offensive content has invaded social media and becomes a serious problem for government organizations, online communities, and social media platforms. Therefore, it is essential to automatically detect and throttle the offensive content before it appears in social media. Previous studies have investigated different aspects of offensive languages such as abusive language [22, 21] and hate speech [20, 5].
Recently,  first studied the target of the offensive language in twitter and  expand it into the multilingual version, which is practical for studying hate speech concerning a specific target. The task is based on a three-level hierarchical annotation schema that encompasses the following three general sub-tasks: (A) Offensive Language Detection; (B) Categorization of Offensive Language; (C) Offensive Language Target Identification.
To tackle this problem, we emphasize that it is crucial to leverage pre-trained language model (e.g., BERT ) to better understand the meaning of sentences and generate expressive word-level representations due to the inherent data noise (e.g., misspelling, grammatical mistakes) in social media (e.g., twitter). In addition, we hypothesize that the internal connections exist among the three general sub-tasks, and to improve one task, we can leverage the information of the other two tasks. Therefore, we first generate the representations of the input text based on the pre-trained language model BERT, and then we conduct multi-task learning based on the representations.
Experimental results show that leveraging more task information can improve the offensive language detection performance. In the OffensEval-2020 competition, our system achieves 91.51% macro-F1 score in English Sub-task A (ranked 7th out of 85 submissions). Especially, only the OLID  is used to train our model and no additional data is used. Our code is available at: https://github.com/wenliangdai/multi-task-offensive-language-detection.
2 Related Works
In general, offensive language detection includes some particular types, such as aggression identification , bullying detection  and hate speech identification .  applied concepts from NLP to exploit the lexical syntactic feature of sentences for offensive language detection.  integrated the textual features with social network features, which significantly improved cyberbullying detection.  and 
used convolutional neural network in the hate-speech detection in tweets. Recently, introduce an offensive language identification dataset, which aims to detect the type and the target of offensive posts in social media.  expanded this dataset into the multilingual version, which advances the multilingual research in this area.
In this project, two datasets are involved, which are the dataset of OffensEval-2019 and OffensEval-2020 respectively. In this section, we introduce the details of them and discuss our data pre-processing methods. Table 1 shows the types of labels and how they overlap.
3.1 Offensive Language Identification Dataset (OLID)
The OLID  is a hierarchical dataset to identify the type and the target of offensive texts in social media. The dataset is collected on Twitter and publicly available. There are 14,100 tweets in total, in which 13,240 are in the training set, and 860 are in the test set. For each tweet, there are three levels of labels: (A) Offensive/Not-Offensive, (B) Targeted-Insult/Untargeted, (C) Individual/Group/Other. The relationship between them is hierarchical. If a tweet is offensive, it can have a target or no target. If it is offensive to a specific target, the target can be an individual, a group, or some other objects. This dataset is used in the OffensEval-2019 competition in SemEval-2019 . The competition contains three sub-tasks, each corresponds to recognizing one level of label in the dataset.
3.2 Multilingual Offensive Language Identification Dataset (MOLID)
A multilingual offensive language detection dataset  is proposed in the OffensEval-2020 competition in SemEval-2020. It contains five languages: Arabic, Danish, English, Greek, and Turkish. For English data, similar to OLID , it still has three levels, but this time only confidence scores, generated by different models, are provided rather than human annotated labels. In addition, the data in level A is separated from levels B and C. In level A, there are 9,089,140 tweets, in levels B and C, there are different 188,973 tweets. For the rest languages, they only have data in level A but with human annotated labels.
3.3 Data Pre-processing
Data pre-processing is crucial to this task as the data from Twitter is noisy and sometimes disordered. Moreover, people tend to use more Emojis and hashtags on Twitter, which are unusual in other situations.
Emoji to word.
We convert all emojis to words with corresponding semantic meanings. For example, *B is converted to thumbs up. We achieve this by first utilizing a third-party Python library 111https://github.com/carpedm20/emoji, and then removing useless punctuation in it.
All hashtags in the tweets are segmented by recognizing the capital characters. For example, #KeithEllisonAbuse is transformed to keith ellison abuse. This is also achieved by using a third-party Python library 222https://github.com/grantjenks/python-wordsegment.
User mention replacement.
After reviewing the dataset, we find out that the token @USER appears very frequently (a single tweet can have multiple of them), which is a typical phenomenon in tweets. As a result, for those with more than one @USER token, we replace all of them with one @USERS token. In this way, we remove the redundant words while keeping the key information, which is useful for recognizing targets if there is any.
Rare word substitution.
We substitute some out-of-vocabulary (OOV) words with their synonyms. For example, every URL is replaced with a special token, http.
We truncate all the tweets to a max length of 64. Although this can get rid of some information in the data, it lowers the GPU memory usage and slightly improves the performance.
We propose a Multi-Task Learning (MTL) method (Figure 1(b)) for this offensive language detection task. It takes good advantage of the nature of the OLID , and achieves an excellent result comparable to state-of-the-art performance only with the OLID  and no external data resources. A thorough analysis is provided in Section 5.2 to explain the reasons of not using the new multilingual dataset created in OffensEval-2020 .
4.1 Task Description
The OffensEval-2020  is a task that organized at SemEval-2020 Workshop. As mentioned in Section 3.2, it proposes a multilingual offensive language detection dataset which contains five different languages. It has three sub-tasks: (A) Offensive Language Detection; (B) Categorization of Offensive Language; (C) Offensive Language Target Identification. In this paper, we mainly focus on the sub-task A of the English data .
We re-implement the model of the best performing team  in OffensEval-2019  as our baseline. As illustrated in Figure 1(a),  fine-tuned the pre-trained model, BERT , by adding a linear layer on top of it.
Bidirectional Encoder Representation from Transformer (BERT)  is a large-scale masked language model based on the encoder of Transformer model . It is pre-trained on the BookCorpus  and English Wikipedia datasets using two unsupervised tasks: (a) Masked Language Model (MLM) (b) Next Sentence Prediction (NSP). In MLM, 15% of input tokens are masked, and the model is trained to recover them at the output. In NSP, two sentences are fed into the model and it is trained to predict whether the second sentence is the actual next sentence of the first one. As shown in , by fine-tuning, BERT achieves superior results on many NLP downstream tasks.
4.3 Multi-task Offense Detection Model
In recent years, multi-task learning (MTL) technique is used in many machine learning fields to improve performance and generalization ability of a model[12, 19, 13, 8, 17, 4]. Generally, MTL has three advantages. Firstly, with multiple supervision signals, it can improve the quality of representation learning, because a good representation should have better performance on more tasks. Secondly, MTL can help the model generalize better because multiple tasks introduce more noises and prevent the model from over-fitting. Thirdly, sometimes it is hard to learn features by one task but easier to learn by another task. MTL provides complementary supervisions to one task and makes it possible to eavesdrop other tasks and get more information.
, the three labels in OLID are hierarchical and they are designed to be inclusive from top to bottom. This makes it possible for one sub-task to eavesdrop information form the other tasks. For example, if a tweet is labelled as Targeted in sub-task B, then it must be classified to Offensive in sub-task A.
Our MTL architecture is shown in Figure 1(b)
. The bottom part is a BERT model, which is shared among all three sub-tasks. The upper parts are three separate modules dedicated for each sub-task, each module contains a Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM) cells. The input
is first fed into the shared BERT, then each sub-task module takes the contextualized embeddings generated by BERT and produces a probability distribution for its own target labels. The overall lossis calculated by . Here, and is the loss weight for each task-specific Cross-Entropy loss , where . The loss weights are chosen by cross validation.
During the training phase, we evaluate our models on the test set of OLID (OffensEval-2019). As a reference, we also evaluate them on the test set of MOLID (OffensEval-2020), which is only released after the submission date.
5.1 Experimental Settings
To find the optimal architecture for this task within the models we have, we set five different experiments. For the first two, we train our baseline model on OLID and MOLID separately. As MOLID’s labels are AVG_CONF scores between 0 to 1 rather than binary classes, we set the threshold as 0.3 based on statistical analysis to convert MOLID to a classification dataset. Then, we set an experiment that pre-train the baseline model on MOLID and fine-tune on OLID by utilizing the pre-train strategy shown in Session 5.2. Finally, we train our Multi-task Offense Detection Model only on OLID and fine-tune the hyper-parameters based on Sub-task A. To further improve the generalization performance of our method, we ensemble five MTL models with different random seeds and generate final results through majority voting.
To evaluate the performance of each model, we use macro-F1 which is computed as a simple arithmetic mean of per-class F1-scores. Since OLID released its test set last year, we use this test set as our validation set and optimize the hyper-parameters manually over the successive runs on it. For our best MTL model, we set the learning rate as 3e-6 and batch size as 32, the loss weights of subtasks A, B, C are 0.4, 0.3, 0.3 respectively. We train the model with maximum 20 epochs and utilize an early stop strategy to stop training if the validation macro-F1 doesn’t increase in three continuous epochs. Our code is implemented in PyTorch and all experiments are run on a single GTX 1080Ti.
5.2 Result Analysis
The results on Table 2 show the macro-F1 scores on OLID and MOLID’s test set and they are consistent except the model with pre-training. Our ensembled MTL model achieves the best performance in both two test sets.
Pre-train vs. No pre-train on MOLID.
Since the MOLID  contains more than 9 million samples with the AVG_CONF score. To make full use of the dataset, we conduct pre-train strategy which let the model pre-trained on MOLID and then fine-tuned on the Offensive Language Identification Dataset(OLID) 
. To pre-train the model on MOLID, we regard the Sub-task A as a regression problem based on the AVG_CONF score. Instead of setting a threshold to divide the data into two classes(OFF, NOT), we directly apply Mean Square Error(MSE) loss function on AVG_CONF. However, our result shows that conducting pre-training makes little difference. We believe it is because the MOLID contains lots of noisy data which is also the reason why the baseline model trained on MOLID is much worse than on OLID.
BERT and Multi-Task Learning
From the result, we find that incorporating BERT and multi-task learning can help improve the macro-F1 score of Sub-task A a lot. This can be attributed to two reasons. Firstly, BERT model is pre-trained on a huge corpus which helps to produce more meaningful representations for the input text. Meanwhile, the large model size increases the learning ability for the task. Secondly, with the large capacity of BERT, through multi-task learning, sub-task A can get more information from the other shared part of the model, and it will be more certain to some cases. For example, if the label of sub-task B is NULL, then label of sub-task A must be NOT. If the label of sub-task B is TIN or UNT, then the label of sub-task A must be OFF.
6 Conclusion and Future work
From all of our experiments, we conclude that MTL improves the performance of Sub-task A in both OLID and MOLID. Moreover, our finding shows that pre-training Sub-task A as a regression task doesn’t improve the model’s performance. We think that there are several paths for further work. Firstly, more studies about the combination of the sub-tasks can be investigated for MTL. This can show us more about the interaction between sub-tasks, and how much does one influence another. Secondly, as mentioned in , simultaneously updating the model’s parameters during MTL can have negative effects on optimization as the total gradients are too noisy. This becomes more significant when the number of tasks is large or the batch size is small. As a result, asynchronous optimizations for each task may provide a more stable gradient descent.
TwitterBERT: framework for twitter sentiment analysis based on pre-trained language model representations. In International Conference of Reliable Information and Communication Technology, pp. 428–437. Cited by: §2.
SciBERT: a pretrained language model for scientific text.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3606–3611. Cited by: §2.
-  (2012) Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, pp. 71–80. Cited by: §2.
-  (2019-11) Modelling the interplay of metaphor and emotion through multitask learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2218–2229. External Links: Cited by: §4.3.
-  (2017) Automated hate speech detection and the problem of offensive language. In Eleventh international aaai conference on web and social media, Cited by: §1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, §4.2, §4.2.
-  (2017) Using convolutional neural networks to classify hate-speech. In Proceedings of the first workshop on abusive language online, pp. 85–90. Cited by: §2.
DensePose: dense human pose estimation in the wild. CoRR abs/1802.00434. External Links: Cited by: §4.3.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.3.
-  (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §2.
-  (2014) Cyber bullying detection using social and textual analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia, pp. 3–6. Cited by: §2.
-  (2011) Learning with whom to share in multi-task feature learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, USA, pp. 521–528. External Links: Cited by: §4.3.
-  (2016) UberNet: training a ’universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. CoRR abs/1609.02132. External Links: Cited by: §4.3, §6.
-  (2018) Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 1–11. Cited by: §2.
-  (2019) Team yeon-zi at semeval-2019 task 4: hyperpartisan news detection by de-noising weakly-labeled data. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 1052–1056. Cited by: §2.
NULI at SemEval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 87–91. External Links: Cited by: §3.3, §4.2.
-  (2018) PAD-net: A perception-aided single image dehazing network. CoRR abs/1805.03146. External Links: Cited by: §4.3.
-  (2019) Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. arXiv preprint arXiv:1911.09273. Cited by: §2.
-  (2015) Learning multiple tasks with deep relationship networks. CoRR abs/1506.02117. External Links: Cited by: §4.3.
-  (2017) Detecting hate speech in social media. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pp. 467–472. Cited by: §1.
-  (2017) Abusive language detection on arabic social media. In Proceedings of the First Workshop on Abusive Language Online, pp. 52–56. Cited by: §1.
-  (2016) Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pp. 145–153. Cited by: §1.
-  (2017) One-step and two-step classification for abusive language detection on twitter. arXiv preprint arXiv:1706.01206. Cited by: §2.
-  (2018-06) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Cited by: §2.
-  (2020) A Large-Scale Semi-Supervised Dataset for Offensive Language Identification. In arxiv, Cited by: §4.1.
-  (2019) Generalizing question answering system with pre-trained language model fine-tuning. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 203–211. Cited by: §2.
-  (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §4.2.
-  (2019-06) Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1415–1420. External Links: Cited by: §3.1, §3.2, §4, §5.2.
-  (2019-06) SemEval-2019 task 6: identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 75–86. External Links: Cited by: §3.1, §3.3, §4.2.
-  (2019) Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666. Cited by: §1, §1, §2.
-  (2020) SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, Cited by: §1, §2, §3.2, §4.1, §4, §5.2.
-  (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. CoRR abs/1506.06724. External Links: Cited by: §4.2.