Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for Offensive Language Detection

04/28/2020 ∙ by Wenliang Dai, et al. ∙ The Hong Kong University of Science and Technology 0

Nowadays, offensive content in social media has become a serious problem, and automatically detecting offensive language is an essential task. In this paper, we build an offensive language detection system, which combines multi-task learning with BERT-based models. Using a pre-trained language model such as BERT, we can effectively learn the representations for noisy text in social media. Besides, to boost the performance of offensive language detection, we leverage the supervision signals from other related tasks. In the OffensEval-2020 competition, our model achieves 91.51 Sub-task A, which is comparable to the first place (92.23 analysis is provided to explain the effectiveness of our approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.

* These two authors contributed equally.

Nowadays, offensive content has invaded social media and becomes a serious problem for government organizations, online communities, and social media platforms. Therefore, it is essential to automatically detect and throttle the offensive content before it appears in social media. Previous studies have investigated different aspects of offensive languages such as abusive language [22, 21] and hate speech [20, 5].

Recently, [30] first studied the target of the offensive language in twitter and [31] expand it into the multilingual version, which is practical for studying hate speech concerning a specific target. The task is based on a three-level hierarchical annotation schema that encompasses the following three general sub-tasks: (A) Offensive Language Detection; (B) Categorization of Offensive Language; (C) Offensive Language Target Identification.

To tackle this problem, we emphasize that it is crucial to leverage pre-trained language model (e.g., BERT [6]) to better understand the meaning of sentences and generate expressive word-level representations due to the inherent data noise (e.g., misspelling, grammatical mistakes) in social media (e.g., twitter). In addition, we hypothesize that the internal connections exist among the three general sub-tasks, and to improve one task, we can leverage the information of the other two tasks. Therefore, we first generate the representations of the input text based on the pre-trained language model BERT, and then we conduct multi-task learning based on the representations.

Experimental results show that leveraging more task information can improve the offensive language detection performance. In the OffensEval-2020 competition, our system achieves 91.51% macro-F1 score in English Sub-task A (ranked 7th out of 85 submissions). Especially, only the OLID [30] is used to train our model and no additional data is used. Our code is available at: https://github.com/wenliangdai/multi-task-offensive-language-detection.

2 Related Works

In general, offensive language detection includes some particular types, such as aggression identification [14], bullying detection [11] and hate speech identification [23]. [3] applied concepts from NLP to exploit the lexical syntactic feature of sentences for offensive language detection. [11] integrated the textual features with social network features, which significantly improved cyberbullying detection. [23] and [7]

used convolutional neural network in the hate-speech detection in tweets. Recently,

[30] introduce an offensive language identification dataset, which aims to detect the type and the target of offensive posts in social media. [31] expanded this dataset into the multilingual version, which advances the multilingual research in this area.

Pre-trained language models, such as ELMo [24] and BERT [6] have achieved great performance on a variety of tasks. Many recent papers have used a basic recipe of fine-tuning such pre-trained models on a certain domain [1, 15, 2] or on downstream tasks [10, 18, 26].

3 Datasets

In this project, two datasets are involved, which are the dataset of OffensEval-2019 and OffensEval-2020 respectively. In this section, we introduce the details of them and discuss our data pre-processing methods. Table 1 shows the types of labels and how they overlap.

Table 1: Labels of three subtasks.

3.1 Offensive Language Identification Dataset (OLID)

The OLID [28] is a hierarchical dataset to identify the type and the target of offensive texts in social media. The dataset is collected on Twitter and publicly available. There are 14,100 tweets in total, in which 13,240 are in the training set, and 860 are in the test set. For each tweet, there are three levels of labels: (A) Offensive/Not-Offensive, (B) Targeted-Insult/Untargeted, (C) Individual/Group/Other. The relationship between them is hierarchical. If a tweet is offensive, it can have a target or no target. If it is offensive to a specific target, the target can be an individual, a group, or some other objects. This dataset is used in the OffensEval-2019 competition in SemEval-2019 [29]. The competition contains three sub-tasks, each corresponds to recognizing one level of label in the dataset.

3.2 Multilingual Offensive Language Identification Dataset (MOLID)

A multilingual offensive language detection dataset [31] is proposed in the OffensEval-2020 competition in SemEval-2020. It contains five languages: Arabic, Danish, English, Greek, and Turkish. For English data, similar to OLID [28], it still has three levels, but this time only confidence scores, generated by different models, are provided rather than human annotated labels. In addition, the data in level A is separated from levels B and C. In level A, there are 9,089,140 tweets, in levels B and C, there are different 188,973 tweets. For the rest languages, they only have data in level A but with human annotated labels.

3.3 Data Pre-processing

Data pre-processing is crucial to this task as the data from Twitter is noisy and sometimes disordered. Moreover, people tend to use more Emojis and hashtags on Twitter, which are unusual in other situations.

Firstly, all characters are converted to lowercase, and the spaces at ends are stripped. Then, inspired by [29, 16], we further process the dataset in five specific aspects:

Emoji to word.

We convert all emojis to words with corresponding semantic meanings. For example, *B is converted to thumbs up. We achieve this by first utilizing a third-party Python library 111https://github.com/carpedm20/emoji, and then removing useless punctuation in it.

Hashtag segmentation.

All hashtags in the tweets are segmented by recognizing the capital characters. For example, #KeithEllisonAbuse is transformed to keith ellison abuse. This is also achieved by using a third-party Python library 222https://github.com/grantjenks/python-wordsegment.

User mention replacement.

After reviewing the dataset, we find out that the token @USER appears very frequently (a single tweet can have multiple of them), which is a typical phenomenon in tweets. As a result, for those with more than one @USER token, we replace all of them with one @USERS token. In this way, we remove the redundant words while keeping the key information, which is useful for recognizing targets if there is any.

Rare word substitution.

We substitute some out-of-vocabulary (OOV) words with their synonyms. For example, every URL is replaced with a special token, http.


We truncate all the tweets to a max length of 64. Although this can get rid of some information in the data, it lowers the GPU memory usage and slightly improves the performance.

(a) Baseline (BERT)
(b) Our MTL model
Figure 1: (a) BERT baseline model. (b) Our MTL model. The bottom is a shared BERT, and the upper parts are separate modules dedicated for each sub-task.

4 Methodology

We propose a Multi-Task Learning (MTL) method (Figure 1(b)) for this offensive language detection task. It takes good advantage of the nature of the OLID [28], and achieves an excellent result comparable to state-of-the-art performance only with the OLID [28] and no external data resources. A thorough analysis is provided in Section 5.2 to explain the reasons of not using the new multilingual dataset created in OffensEval-2020 [31].

4.1 Task Description

The OffensEval-2020 [31] is a task that organized at SemEval-2020 Workshop. As mentioned in Section 3.2, it proposes a multilingual offensive language detection dataset which contains five different languages. It has three sub-tasks: (A) Offensive Language Detection; (B) Categorization of Offensive Language; (C) Offensive Language Target Identification. In this paper, we mainly focus on the sub-task A of the English data [25].

4.2 Baseline

We re-implement the model of the best performing team [16] in OffensEval-2019 [29] as our baseline. As illustrated in Figure 1(a), [16] fine-tuned the pre-trained model, BERT [6], by adding a linear layer on top of it.


Bidirectional Encoder Representation from Transformer (BERT) [6] is a large-scale masked language model based on the encoder of Transformer model [27]. It is pre-trained on the BookCorpus [32] and English Wikipedia datasets using two unsupervised tasks: (a) Masked Language Model (MLM) (b) Next Sentence Prediction (NSP). In MLM, 15% of input tokens are masked, and the model is trained to recover them at the output. In NSP, two sentences are fed into the model and it is trained to predict whether the second sentence is the actual next sentence of the first one. As shown in [6], by fine-tuning, BERT achieves superior results on many NLP downstream tasks.

4.3 Multi-task Offense Detection Model

In recent years, multi-task learning (MTL) technique is used in many machine learning fields to improve performance and generalization ability of a model

[12, 19, 13, 8, 17, 4]. Generally, MTL has three advantages. Firstly, with multiple supervision signals, it can improve the quality of representation learning, because a good representation should have better performance on more tasks. Secondly, MTL can help the model generalize better because multiple tasks introduce more noises and prevent the model from over-fitting. Thirdly, sometimes it is hard to learn features by one task but easier to learn by another task. MTL provides complementary supervisions to one task and makes it possible to eavesdrop other tasks and get more information.

For this task, MTL is a very effective strategy. As mentioned in Section 3.1 and shown in Table 2

, the three labels in OLID are hierarchical and they are designed to be inclusive from top to bottom. This makes it possible for one sub-task to eavesdrop information form the other tasks. For example, if a tweet is labelled as Targeted in sub-task B, then it must be classified to Offensive in sub-task A.

Our MTL architecture is shown in Figure 1(b)

. The bottom part is a BERT model, which is shared among all three sub-tasks. The upper parts are three separate modules dedicated for each sub-task, each module contains a Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM) cells

[9]. The input

is first fed into the shared BERT, then each sub-task module takes the contextualized embeddings generated by BERT and produces a probability distribution for its own target labels. The overall loss

is calculated by . Here, and is the loss weight for each task-specific Cross-Entropy loss , where . The loss weights are chosen by cross validation.

5 Experiments

During the training phase, we evaluate our models on the test set of OLID (OffensEval-2019). As a reference, we also evaluate them on the test set of MOLID (OffensEval-2020), which is only released after the submission date.

5.1 Experimental Settings

To find the optimal architecture for this task within the models we have, we set five different experiments. For the first two, we train our baseline model on OLID and MOLID separately. As MOLID’s labels are AVG_CONF scores between 0 to 1 rather than binary classes, we set the threshold as 0.3 based on statistical analysis to convert MOLID to a classification dataset. Then, we set an experiment that pre-train the baseline model on MOLID and fine-tune on OLID by utilizing the pre-train strategy shown in Session 5.2. Finally, we train our Multi-task Offense Detection Model only on OLID and fine-tune the hyper-parameters based on Sub-task A. To further improve the generalization performance of our method, we ensemble five MTL models with different random seeds and generate final results through majority voting.

To evaluate the performance of each model, we use macro-F1 which is computed as a simple arithmetic mean of per-class F1-scores. Since OLID released its test set last year, we use this test set as our validation set and optimize the hyper-parameters manually over the successive runs on it. For our best MTL model, we set the learning rate as 3e-6 and batch size as 32, the loss weights of subtasks A, B, C are 0.4, 0.3, 0.3 respectively. We train the model with maximum 20 epochs and utilize an early stop strategy to stop training if the validation macro-F1 doesn’t increase in three continuous epochs. Our code is implemented in PyTorch and all experiments are run on a single GTX 1080Ti.

5.2 Result Analysis

The results on Table 2 show the macro-F1 scores on OLID and MOLID’s test set and they are consistent except the model with pre-training. Our ensembled MTL model achieves the best performance in both two test sets.

Pre-train vs. No pre-train on MOLID.

Since the MOLID [31] contains more than 9 million samples with the AVG_CONF score. To make full use of the dataset, we conduct pre-train strategy which let the model pre-trained on MOLID and then fine-tuned on the Offensive Language Identification Dataset(OLID) [28]

. To pre-train the model on MOLID, we regard the Sub-task A as a regression problem based on the AVG_CONF score. Instead of setting a threshold to divide the data into two classes(OFF, NOT), we directly apply Mean Square Error(MSE) loss function on AVG_CONF. However, our result shows that conducting pre-training makes little difference. We believe it is because the MOLID contains lots of noisy data which is also the reason why the baseline model trained on MOLID is much worse than on OLID.

BERT and Multi-Task Learning

From the result, we find that incorporating BERT and multi-task learning can help improve the macro-F1 score of Sub-task A a lot. This can be attributed to two reasons. Firstly, BERT model is pre-trained on a huge corpus which helps to produce more meaningful representations for the input text. Meanwhile, the large model size increases the learning ability for the task. Secondly, with the large capacity of BERT, through multi-task learning, sub-task A can get more information from the other shared part of the model, and it will be more certain to some cases. For example, if the label of sub-task B is NULL, then label of sub-task A must be NOT. If the label of sub-task B is TIN or UNT, then the label of sub-task A must be OFF.

Model F1 - OLID F1 - MOLID BERT (OLID) 0.8203 0.9088 BERT (MOLID) 0.7280 0.9060 BERT(Pre-trained with MSE loss) 0.8138 0.9107 BERT + MTL 0.8341 0.9139 BERT + MTL (Ensemble) 0.8382 0.9151
Table 2:

Experimental results on sub-task A. The evaluation metric is macro F1 score, which is official in OffensEval-2020.

6 Conclusion and Future work

From all of our experiments, we conclude that MTL improves the performance of Sub-task A in both OLID and MOLID. Moreover, our finding shows that pre-training Sub-task A as a regression task doesn’t improve the model’s performance. We think that there are several paths for further work. Firstly, more studies about the combination of the sub-tasks can be investigated for MTL. This can show us more about the interaction between sub-tasks, and how much does one influence another. Secondly, as mentioned in [13], simultaneously updating the model’s parameters during MTL can have negative effects on optimization as the total gradients are too noisy. This becomes more significant when the number of tasks is large or the batch size is small. As a result, asynchronous optimizations for each task may provide a more stable gradient descent.


  • [1] N. Azzouza, K. Akli-Astouati, and R. Ibrahim (2019)

    TwitterBERT: framework for twitter sentiment analysis based on pre-trained language model representations

    In International Conference of Reliable Information and Communication Technology, pp. 428–437. Cited by: §2.
  • [2] I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: a pretrained language model for scientific text. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 3606–3611. Cited by: §2.
  • [3] Y. Chen, Y. Zhou, S. Zhu, and H. Xu (2012) Detecting offensive language in social media to protect adolescent online safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Confernece on Social Computing, pp. 71–80. Cited by: §2.
  • [4] V. Dankers, M. Rei, M. Lewis, and E. Shutova (2019-11) Modelling the interplay of metaphor and emotion through multitask learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2218–2229. External Links: Link, Document Cited by: §4.3.
  • [5] T. Davidson, D. Warmsley, M. Macy, and I. Weber (2017) Automated hate speech detection and the problem of offensive language. In Eleventh international aaai conference on web and social media, Cited by: §1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, §4.2, §4.2.
  • [7] B. Gambäck and U. K. Sikdar (2017) Using convolutional neural networks to classify hate-speech. In Proceedings of the first workshop on abusive language online, pp. 85–90. Cited by: §2.
  • [8] R. A. Güler, N. Neverova, and I. Kokkinos (2018)

    DensePose: dense human pose estimation in the wild

    CoRR abs/1802.00434. External Links: Link, 1802.00434 Cited by: §4.3.
  • [9] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.3.
  • [10] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 328–339. Cited by: §2.
  • [11] Q. Huang, V. K. Singh, and P. K. Atrey (2014) Cyber bullying detection using social and textual analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia, pp. 3–6. Cited by: §2.
  • [12] Z. Kang, K. Grauman, and F. Sha (2011) Learning with whom to share in multi-task feature learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, USA, pp. 521–528. External Links: ISBN 978-1-4503-0619-5, Link Cited by: §4.3.
  • [13] I. Kokkinos (2016) UberNet: training a ’universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. CoRR abs/1609.02132. External Links: Link, 1609.02132 Cited by: §4.3, §6.
  • [14] R. Kumar, A. K. Ojha, S. Malmasi, and M. Zampieri (2018) Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), pp. 1–11. Cited by: §2.
  • [15] N. Lee, Z. Liu, and P. Fung (2019) Team yeon-zi at semeval-2019 task 4: hyperpartisan news detection by de-noising weakly-labeled data. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 1052–1056. Cited by: §2.
  • [16] P. Liu, W. Li, and L. Zou (2019-06)

    NULI at SemEval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers

    In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 87–91. External Links: Link, Document Cited by: §3.3, §4.2.
  • [17] Y. Liu and G. Zhao (2018) PAD-net: A perception-aided single image dehazing network. CoRR abs/1805.03146. External Links: Link, 1805.03146 Cited by: §4.3.
  • [18] Z. Liu, G. I. Winata, Z. Lin, P. Xu, and P. Fung (2019) Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems. arXiv preprint arXiv:1911.09273. Cited by: §2.
  • [19] M. Long and J. Wang (2015) Learning multiple tasks with deep relationship networks. CoRR abs/1506.02117. External Links: Link, 1506.02117 Cited by: §4.3.
  • [20] S. Malmasi and M. Zampieri (2017) Detecting hate speech in social media. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pp. 467–472. Cited by: §1.
  • [21] H. Mubarak, K. Darwish, and W. Magdy (2017) Abusive language detection on arabic social media. In Proceedings of the First Workshop on Abusive Language Online, pp. 52–56. Cited by: §1.
  • [22] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang (2016) Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pp. 145–153. Cited by: §1.
  • [23] J. H. Park and P. Fung (2017) One-step and two-step classification for abusive language detection on twitter. arXiv preprint arXiv:1706.01206. Cited by: §2.
  • [24] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018-06) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §2.
  • [25] S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, and P. Nakov (2020) A Large-Scale Semi-Supervised Dataset for Offensive Language Identification. In arxiv, Cited by: §4.1.
  • [26] D. Su, Y. Xu, G. I. Winata, P. Xu, H. Kim, Z. Liu, and P. Fung (2019) Generalizing question answering system with pre-trained language model fine-tuning. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pp. 203–211. Cited by: §2.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §4.2.
  • [28] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019-06) Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1415–1420. External Links: Link, Document Cited by: §3.1, §3.2, §4, §5.2.
  • [29] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019-06) SemEval-2019 task 6: identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 75–86. External Links: Link, Document Cited by: §3.1, §3.3, §4.2.
  • [30] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019) Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666. Cited by: §1, §1, §2.
  • [31] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, and Ç. Çöltekin (2020) SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, Cited by: §1, §2, §3.2, §4.1, §4, §5.2.
  • [32] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. CoRR abs/1506.06724. External Links: Link, 1506.06724 Cited by: §4.2.