Log In Sign Up

LT@Helsinki at SemEval-2020 Task 12: Multilingual or language-specific BERT?

This paper presents the different models submitted by the LT@Helsinki team for the SemEval 2020 Shared Task 12. Our team participated in sub-tasks A and C; titled offensive language identification and offense target identification, respectively. In both cases we used the so-called Bidirectional Encoder Representation from Transformer (BERT), a model pre-trained by Google and fine-tuned by us on the OLID and SOLID datasets. The results show that offensive tweet classification is one of several language-based tasks where BERT can achieve state-of-the-art results.


problemConquero at SemEval-2020 Task 12: Transformer and Soft label-based approaches

In this paper, we present various systems submitted by our team problemC...

Czert – Czech BERT-like Model for Language Representation

This paper describes the training process of the first Czech monolingual...

indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Languages

The paper presents the submission of the team indicnlp@kgp to the EACL 2...

Keyphrase Prediction With Pre-trained Language Model

Recently, generative methods have been widely used in keyphrase predicti...

GUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection

Offensive language detection is an important and challenging task in nat...

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International License. License details:

The number of social media users has reached 3.5 billion, and an average of 6,000 tweets are generated every second [12] . With such a large volume of tweets, it seems inevitable that some use offensive language. A study from 2014 found that 67% of social media users had been exposed to online hate, and 21% had been the target of online hate [17].

The usual approach on social media sites is to forbid hate speech in the terms of service and to censor any inappropriate content detected by their algorithms, but still, companies like Facebook or Twitter have been harshly criticized for not doing enough [4]. This criticism has forced companies to try and find accurate and scalable solutions that solve the problem of offensive language detection using automated methods. Workshops dealing with offensive language, such as TRAC [7], TA-COS [9], or ALW1 [24], as well as shared tasks like GermEval [25], TRAC-1 [8], and OffensEval 2019 [27], are becoming more and more prevalent.

The task we address here, SemEval 2020 task 12 [28], is titled Multilingual Offensive Language Identification in Social Media and is divided into the following sub-tasks:


Offensive Language Identification in several languages (Arabic, Danish, English, Greek, Turkish): whether a tweet is offensive or not.


Categorization of Offense Types: whether an offensive tweet is targeted or untargeted.


Offense Target Identification: whether a targeted offensive tweet is directed towards an individual, a group or otherwise.

In this paper the system created by the LT@Helsinki team for sub-tasks A and C will be described. In sub-task A we participated in all the language tracks. For sub-task C the only language available was English. We qualified as second in sub-task C and our submission for sub-task A ranked first for Danish, seventh for Greek, eighteenth for Turkish, and forty-sixth for Arabic. In all submissions we used BERT-Base models [5]

fine-tuned on each dataset provided by the task organizers. We also experimented with random forest with TF-IDF and other kinds of features, but the results on the development set were not as good as with transfer learning techniques based on pre-trained language models. We discovered that, at least for this data, the language-specific model worked better than the multilingual.

2 Background

All the data used to train our classifiers was provided by the OffensEval organizers

[26]. The tweets were retrieved from the Twitter Search API and manually labeled by at least two human annotators. As a pre-processing step, they desensitized all tweets replacing usernames and website URLs by general tokens. The following datasets were used for the offensive language identification problem:

Language Training Test Total
Arabic 8,000 2,000 10,000
Danish 2,960 329 3,289
Greek 8,743 1,544 10,287
Turkish 31,756 3,528 35,284
Table 1: Sub-task A datasets

Some of the algorithms that can be found in the literature are random forest [1]

, logistic regression


, and Support Vector Machine


, as well as deep learning approaches like Convolutional Neural Networks

[6] or Convolutional-GRU [29]. However, in 2018 deep pre-trained language models obtained state-of-the-art results in several NLP downstream tasks, text classification being one of them. In particular, Google’s Bidirectional Encoder Representations from Transformers (BERT) [5] stood above the rest for being deeply bidirectional and using the novel self-attention layers from the transformer model [23], which allows it to better understand a word’s context by looking at both its left and right neighbours. BERT provides out-of-the-box pre-trained monolingual and multilingual models that, after massive training on general corpora, can be fine-tuned with a small amount of task-specific data and still offer excellent performance. The results published in OffensEval’s previous edition [27] proved that BERT is well suited for the offensive language detection task, since it was the chosen method of most of the top teams, including the winners of sub-tasks A [10] and C [15].

For the Danish dataset [22] we used Nordic BERT111, which is pre-trained on Danish Wikipedia texts, Danish text from Common Crawl, Danish OpenSubtitles, and text from popular Danish online forums. All in all, the training corpus consists of over 90M sentences and almost 20M unique tokens. For the other languages we used the standard BERT-Base models [5] with no further pre-training and only the provided datasets for each language, i.e. Arabic [14], Greek [20], and Turkish [2].

In sub-task C, since the language was English it was possible to use the Offensive Language Identification Dataset (OLID), provided by the organizers of OffensEval 2019. Despite consisting of different sub-tasks, all of them shared the same dataset that was annotated according to a three-level hierarchical model, so that each sub-task could use as dataset a subset of the previous sub-task’s dataset. First, all tweets were labeled as either offensive (OFF) or not offensive (NOT). Then, for sub-task B, all the offensive tweets were labeled as targeted (TIN) or untargeted insults (UNT). And finally, for the last sub-task, the third level of the hierarchy labeled targeted insults based on who was the recipient of the offense: an individual (IND), a group (GRP) or a different kind of entity (OTH). To illustrate this, Figure 2 displays OLID’s label distribution.

A B C Training Test Total
OFF TIN IND 2,407 100 2,507
OFF TIN GRP 1,074 78 1,152
OFF TIN OTH 395 35 430
OFF UNT - 524 27 551
NOT - - 8,840 620 9,460
All 13,240 860 14,100
Table 2: Distribution of label combinations in OLID.

The corpus from sub-task C in OffensEval 2019 contains 4,089 English tweets, of which 3,876 originally belonged to the training set and the remaining 213 to the test set (Figure 2). We also had at our disposal over 9M English tweets provided by the OffensEval 2020 organizers [21]

. However, these were processed by unsupervised learning methods instead of human annotators, which is why on this occasion each tweet was not associated with a label but given two values: (1) the confidence that it belongs to a specific class and (2) its standard deviation.

3 System Overview and Experimental Setup

In this section we discuss the system we created and setup we used for subtask A and subtask C.

3.1 Sub-task A

For the binary classification problem (sub-task A) two baseline methods were implemented and evaluated on the training Danish dataset: random forest and BERT.

For the random forest implementation, the pre-processing steps were lower-casing all characters, removing irrelevant punctuation marks, reducing the length of characters that appear more than two consecutive times, and converting hashtags to sentences by adding white spaces before every capital letter. Tokenization was done with the TweetTokenizer tool from NLTK. The same library was also used to perform stopword removal and word stemming. Emojis were removed after storing their “sentiment score” as a feature employing the emosent Python utility package [16]. We also used surface-level features such as the number of URL tokens and @USER mentions, the total number of characters, punctuation marks and words in each post, average word length, percentage of capital letters, and the number of abusive terms. A ratio of 10:1 was applied when splitting the dataset into training and validation sets. Since the Danish dataset was relatively small, 10-fold cross-validation was done to obtain reliable results. Otherwise the F1 scores relied too much on which samples would fall in the validation set. The random forest implementation from scikit-learn [18] was used. The optimal parameters were found with grid search.

System macro-F1
All NOT 0.465
Random Forest with TFIDF 0.773
Multilingual BERT-Base 0.768
Nordic BERT 0.804
Table 3: Development results sub-task A, Danish dataset

The other approach was to apply pre-trained BERT models. After experimenting with both the original Base version (12-layer, 768-hidden, 12-heads, 110M parameters) and the publicly available Nordic BERT 222

it was clear that the second one was better suited for the task. This shows that further pre-training of BERT can significantly boost performance, especially in cases like this where there is very little data (the Danish dataset was by far the smallest of the shared task). Our final submission was generated by the Danish version of Nordic BERT fine-tuned on the OffensEval 2020 data, using a batch size of 32 for training and 16 for both validation and testing. The learning rate of Adam optimizer was set to 2e-5 and the model was trained for 4 epochs. The sequence length was set to 128 because, even though some instances are very long (some of them are not tweets but Reddit comments), an analysis of the length distribution showed that only 3.4% of the training examples reached the limit after being tokenized by BertTokenizer.

Due to time constraints we were not able to experiment in depth with all languages available for sub-task A, which is why Danish was our main focus. However, seeing that BERT could perform so well with almost no pre-processing, we decided to generate results for the other three languages as well. For Turkish and Arabic, we used the BERT-base-multilingual-cased model with a maximum sequence length of 128, a batch size of 32, and a learning rate of 2e-5. On the other hand, for Greek we used the BERT-base-uncased model after lowercasing and translating the entire dataset into English. In all cases we trained the model for 4 epochs, used BertTokenizer to tokenize the tweets, and padded and truncated the sequences to make sure each data instance had the same length.

3.2 Sub-task C

The target identification problem (sub-task C, English) had an additional level of difficulty with respect to the first task because the dataset was highly imbalanced and composed of three classes instead of two. Thus, in this case we focused our efforts on balancing the given dataset to prevent having a high number of misclassifications from instances of the minority class (which would notably affect the resulting macro-F1 score). In order to overcome the class imbalance problem, we trained our model on all the data from 2019 and some additional instances from the non-majority classes (GRP and OTH) of the 2020 dataset. Only the 300 OTH instances and 237 GRP instances of highest confidence were added in order to slightly increase the balance of the dataset. We experimented with different thresholds to select more samples but at the end we decided to keep the value low to ensure that all tweets used for training are tagged correctly. Finally, to overcome the class imbalance problem, an over-sampling technique with replacing was applied. We simply produced copies of instances from the minority classes to end up with a totally balanced dataset of 11,628 tweets (3,876 for each class). Under-sampling was not an option because then we would be facing a data scarcity problem, and we experimented with ratios other than 1:1:1 obtaining promising results but not significantly better than the proposed approach. Another interesting approach, which was chosen by the winners of last year’s edition [15], is to modify the classification thresholds (i.e. lower the thresholds for classes OTH and GRP) to get less new examples classified as the majority class.

Then, the now balanced dataset was used as input for a BERT-base-uncased model with a maximum sequence length of 128 and batch sizes of 32, 16 and 8 for training, validation and prediction respectively. The learning rate was set to 2e-5 and the training lasted 4 epochs. In this case we did not perform any pre-processing step other than lowercasing since none of the attempts (i.e. translating emojis to sentences, splitting hashtags into separate words etc.) significantly boosted performance. Moreover, we believe that for this specific task it might not be a good idea to remove @USER mentions or certain stop words since they can carry valuable information for target identification.

3.3 Unfruitful Attempts at Performance Improvement

pos neg ang ant dis fea joy sad sur tru
NOT 51.476 79.136 30.028 17.623 29.685 47.055 16.588 25.843 10.753 26.785
OFF 68.640 110.350 41.481 23.474 37.543 58.420 22.374 46.046 9.190 26.838
Table 4: Emotion word distribution in offensive and non-offensive messages.

Offensive tweets and posts had statistically significant amounts of emotion-laden words, including positive ones. We arrived at these results by augmenting the data (see Ohman ohman2016challenges) using the NRC Emotion Lexicon

[13] for the languages in question and then comparing the normalized word counts per offensive post. We did not take context or negation into account. Unfortunately, we were unable to meaningfully utilize this information to improve the performance of our system. Even though offensive tweets were more likely to include emotion words and more of them, a significant number of offensive tweets contained no emotion words and many non-offensive tweets did contain them. Nonetheless, the results were quite interesting so we wanted to share them here in the hopes that they might be useful to someone else, perhaps for the OffensEval 2021 tasks. The numbers represent normalized word counts for words classified as containing a specific sentiment or emotion for data from the Danish dataset.

4 Results

For Danish, using Nordic BERT for the final submission, we obtained an accuracy of 92.38% and F1-score of 81.18% (Figure 1

a). The confusion matrix shows that only 9 out of 34 offensive tweets (OFF) were misclassified, and 16 out of 294 not offensive tweets (NOT) were wrongly classified as offensive. Regarding the other languages, for Greek we obtained an F1-score of 82.6%, 77.2% for Turkish, and 73.1% for Arabic.

Figure 1: Official results of the LT@Helsinki Team.
Left: Danish, sub-task A, Right: Sub-task C (English)

In sub-task C our submission ranked second with an accuracy of 79.74% and F1-score of 66.99%. The relatively low F1-score is due to the high numbers of misclassifications of the minority class (only 33 OTH instances were correctly classified out of 82) (see Figure 1

b). Despite our efforts in balancing the dataset, it seems like the skewed class distribution inevitably added bias to the model.

It is impressive how good BERT performed overall even though we applied very few pre-processing steps to the data. Moreover, the results obtained on previously unseen data indicate that none of our models were heavily overfitted.

5 Conclusions

Our scores on evaluation data show that models pre-trained on general corpora can obtain competitive results when there is very little data available, as was the case of the Danish sub-task where we obtained the best results. BERT performed well on both tasks, but there is still room for improvement. In the future, we intend to experiment with those languages that we could not focus on this time. It should be noted that multilingual BERT works best with languages similar to English [19] so it is very likely that the other languages would have benefited from the use of language-specific models even more than Danish as Danish is the language most similar to English in comparison to Greek, Turkish, and Arabic.

A more thorough comparison between multilingual and language-specific BERT would yield more definitive answers as to whether language-specific is always best or whether that is language-dependent. Further examination and use of emotion-laden words in offensive texts could also help with the detection of offensive texts. Examining other baseline methods and combining them into an ensemble model with majority voting is another approach to consider for future work. With regards to the class imbalance problem, other techniques such as adjusting the classification thresholds or enriching the dataset with back and forth machine translation of the minority class could prove to be useful.


  • [1] P. Burnap and M. L. Williams (2015) Cyber hate speech on twitter: an application of machine classification and statistical modeling for policy and decision making. Policy & Internet 7 (2), pp. 223–242. Cited by: §2.
  • [2] Ç. Çöltekin (2020) A Corpus of Turkish Offensive Language on Social Media. In Proceedings of the 12th International Conference on Language Resources and Evaluation, Cited by: §2.
  • [3] T. Davidson, D. Warmsley, M. W. Macy, and I. Weber (2017) Automated hate speech detection and the problem of offensive language. In ICWSM, Cited by: §2.
  • [4] F. Del Vigna, A. Cimino, F. Dell’Orletta, M. Petrocchi, and M. Tesconi (2017) Hate me, hate me not: hate speech detection on facebook. In Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), pp. 86–95. Cited by: §1.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §2, §2.
  • [6] B. Gambäck and U. K. Sikdar (2017) Using convolutional neural networks to classify hate-speech. In Proceedings of the first workshop on abusive language online, pp. 85–90. Cited by: §2.
  • [7] R. Kumar, A. K. Ojha, M. Zampieri, and S. Malmasi (2018) Proceedings of the first workshop on trolling, aggression and cyberbullying (trac-2018). In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Cited by: §1.
  • [8] R. Kumar, A. Kr. Ojha, S. Malmasi, and M. Zampieri (2018) Benchmarking aggression identification in social media. In TRAC@COLING 2018, Cited by: §1.
  • [9] E. Lefever, B. Desmet, and G. De Pauw (2018) TA-cos 2018: 2nd workshop on text analytics for cybersecurity and online safety: proceedings. In TA-COS 2018–2nd Workshop on Text Analytics for Cybersecurity and Online Safety, collocated with LREC 2018, 11th edition of the Language Resources and Evaluation Conference, Cited by: §1.
  • [10] P. Liu, W. Li, and L. Zou (2019) NULI at semeval-2019 task 6: transfer learning for offensive language detection using bidirectional transformers. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 87–91. Cited by: §2.
  • [11] S. Malmasi and M. Zampieri (2018) Challenges in discriminating profanity from hate speech.

    Journal of Experimental & Theoretical Artificial Intelligence

    30 (2), pp. 187–202.
    Cited by: §2.
  • [12] B. Mathew, R. Dutt, P. Goyal, and A. Mukherjee (2019) Spread of hate speech in online social media. In Proceedings of the 10th ACM Conference on Web Science, WebSci ’19, New York, NY, USA, pp. 173–182. External Links: ISBN 9781450362023, Link, Document Cited by: §1.
  • [13] S. M. Mohammad and P. D. Turney (2013) Crowdsourcing a word-emotion association lexicon. Computational Intelligence 29 (3), pp. 436–465. Cited by: §3.3.
  • [14] H. Mubarak, A. Rashed, K. Darwish, Y. Samih, and A. Abdelali (2020) Arabic offensive language on twitter: analysis and experiments. arXiv preprint arXiv:2004.02192. Cited by: §2.
  • [15] A. Nikolov and V. Radivchev (2019) Nikolov-radivchev at semeval-2019 task 6: offensive tweet classification with bert and ensembles. In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 691–695. Cited by: §2, §3.2.
  • [16] P. K. Novak, J. Smailović, B. Sluban, and I. Mozetič (2015) Sentiment of emojis. PloS one 10 (12), pp. e0144296. Cited by: §3.1.
  • [17] A. Oksanen, J. Hawdon, E. Holkeri, M. Näsi, and P. Räsänen (2014) Exposure to online hate among young social media users. Sociological studies of children & youth 18 (1), pp. 253–273. Cited by: §1.
  • [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011)

    Scikit-learn: machine learning in python

    Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §3.1.
  • [19] T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual bert?. arXiv preprint arXiv:1906.01502. Cited by: §5.
  • [20] Z. Pitenis, M. Zampieri, and T. Ranasinghe (2020) Offensive Language Identification in Greek. In Proceedings of the 12th Language Resources and Evaluation Conference, Cited by: §2.
  • [21] S. Rosenthal, P. Atanasova, G. Karadzhov, M. Zampieri, and P. Nakov (2020) A Large-Scale Semi-Supervised Dataset for Offensive Language Identification. In arxiv, Cited by: §2.
  • [22] G. I. Sigurbergsson and L. Derczynski (2019) Offensive language and hate speech detection for danish. arXiv preprint arXiv:1908.04531. Cited by: §2.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • [24] Z. Waseem, W. H. K. Chung, D. Hovy, and J. Tetreault (2017) Proceedings of the first workshop on abusive language online. In Proceedings of the First Workshop on Abusive Language Online, Cited by: §1.
  • [25] M. Wiegand, M. Siegel, and J. Ruppenhofer (2018) Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. In Proceedings of GermEval, Cited by: §1.
  • [26] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019) Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666. Cited by: §2.
  • [27] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019) SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of The 13th International Workshop on Semantic Evaluation (SemEval), Cited by: §1, §2.
  • [28] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, and Ç. Çöltekin (2020) SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, Cited by: §1.
  • [29] Z. Zhang, D. Robinson, and J. Tepper (2018) Detecting hate speech on twitter using a convolution-gru based deep neural network. In European semantic web conference, pp. 745–760. Cited by: §2.