This work is licensed under a Creative Commons
Attribution 4.0 International License.
License details: http://creativecommons.org/licenses/by/4.0/.
With the development of IT, social media has become more and more popular for people to express their views and exchange ideas publicly. However, some people may take advantage of the anonymity in social media platform to express their comments rudely, and attack other people verbally with offensive language. To keep a healthy online environment for the adolescences  and to filter offensive messages for the users , it is necessary and significant for technology companies to develop an efficient and effective computational methods to identify offensive language automatically.
Transformer-based contextualized embedding approaches such as BERT , XLNet , RoBERTa , ALBERT  or ELECTRA  have re-established the state-of-the-art for many natural language classification tasks especially the GLUE Dataset . Their pre-trained models were pre-trained on different large datasets, for example, BERT was pre-trained on the BookCorpus  and English Wikipedia, and RoBERTa was pre-trained on CC-News , OpenWebText , and Stories  which enable their models to learn different language features.
This paper presents six transformer-based offensive language identification models that learn different features from the target utterance. To combine the distinctive learned language features, we introduce an ensemble strategy which concatenates the representations of the individual models and feed them into the linear decoder to make binary classification (Section 4.2). It largely improves the performance over the baseline on our dev set (Section 4.4).
2 Related Work
Offensive language in Twitter , Facebook , and Wikipedia  has been widely studied. In addition, different aspects of offensive language have been studied, like the type and target of offensive posts , cyberbullying [7, 10], aggression , toxic comments  and hate speech [1, 4, 15, 16].
Many deep learning approaches have been used to address the task. The Convolutional Neural Networks (CNNs), Long Short-Term Memory Networks (LSTMs) and FastText were applied on the hate speech detection task
. Gamback and Sikdar gamback-sikdar:2017 used four Convolutional Neural Network (CNN) models with random word vectors, word2vec word vectors, character n-gram, and concatenation of word2vec word embeddings and character n-grams as feature embeddings separately to categorize each tweet into four classes: racism, sexism, both (racism and sexism) and non-hate-speech.
3 Data Description
The datasets we use are Offensive Language Identification Dataset (OLID)  and Semi-Supervised Offensive Language Identification Dataset (SOLID) . Given a tweet, the task is to predict whether the content involves offensive language. Table 1 shows the examples of offensive and non-offensive tweets in these two datasets.
: confidences’ standard deviation
OLID is a collection of 14,100 English tweets annotated as OFF or NOT. It is divided into a training set of 13,240 tweets and test set of 860 tweets . SOLID is a collection of about 9 million English tweets labeled in a semi-supervised manner . The data are annotated with AVG_CONF and CONF_STD predicted by several supervised models . The test set provided by organizers this year has 3887 tweets. Table 2 shows the statistics of OLID and SOLID.
4.1 Data Split
For our experiments, a combination of OLID and SOLID (Section 3) is used. We find that about 1.0% of SOLID are duplicates, which have been removed before data splitting. For the dataset used for fine-tuning classification model, we set threshold of AVG_CONF (Section 3) to be 0.5 in SOLID, which means the data with AVG_CONF above 0.5 is labelled as OFF. 90% of the TRN of OLID is combined with the whole SOLID as the new training set TRN for default transformer-based models fine-tuning (FT). The remaining 10% of the TRN and the TST of OLID is used as the development set DEV of FT. All the existed datasets are combined together as the training set TRN for model pre-training (PT). After pre-training, 99.5% of the SOLID is randomly selected as the training set TRN and 0.5% of the SOLID is randomly selected to create the development set DEV for fine-tuning our pre-trained models into classification models and regression models (PT-C and PT-R). In PT-C, the data with AVG_CONF above 0.5 is labelled as OFF and in PT-R, original value of AVG_CONF is used. Furthermore, 90% of TRN in OLID is randomly selected as the new training set TRN, and 10% of TRN in OLID is combined with the TST of the OLID and become the development set DEV for classification models and regression models’ further fine-tuning (PT-C-C and PT-R-C). The ensemble model is fine-tuned on the same dataset as PT-C-C. Table 3 shows the detailed statistics of the data split in our experiments.
In general, default transformer-based models are fine-tuned as baseline models. The sequence of embeddings of input generated from the transformer encoder is fed into linear decoder to gain the output vector that makes the binary classification. Then we pre-train these default models and choose the models with lowest perplexity. Next, we fine-tune the pre-trained models into regression models and classification models based on corresponding dataset, respectively. Furthermore, the regression models and classification models are fine-tuned again into classification models. In the end, sentence presentation of individual models are concatenated and fed into linear decoder to generate the output vector that makes the binary decision of whether or not this tweet is offensive.
In our experiments, two types of transformer-based models are used as the default models, BERT-Base model  and RoBERTa-Base model . For the default model fine-tuning part, the default BERT-Base and RoBERTa-Base are fine-tuned on FT (Section 4.1) as baseline models. For the pre-training part, the BERT-Base and RoBERTa-Base are pre-trained on PT (Section 4.1). Then, the two pre-trained models which have the lowest perplexity are fine-tuned into regression models and classification models separately on PT-R and PT-C. Next, the fine-tuned pre-trained models are further fine-tuned into classification models on PT-R-C and PT-C-C. Finally, sentence presentation of six individual models are concatenated to form the ensemble model which is fine-tuned on E. Figure 1 shows the overview of the six individual models and the ensemble model.
4.3 Experimental Setup
According to our experiments, the data preprocessing doesn’t contribute significantly to the final prediction results on such huge dataset. Thus, we skip the data preprocessing. According to the analysis of sentence length in the dataset, we set max_length of the models to be . After an extensive hyper-parameter search, we set learning_rate to be , seed_value to be , and epochs to be for our six individual models and ensemble model. After that, we also experiment more on the ensemble model and find that the best result is gained by changing learning_rate to and dropout to .
Table 4 shows the results achieved by our individual models and ensemble model. The selected pre-trained BERT-base model and pre-trained RoBERTa-base model have the lowest perplexities, which are 21.3 and 47.5. Our fine-tuned pre-trained classification-classificaion BERT and RoBERTa models outperform their counterpart baseline by about 1.7% and 1.1%, respectively. In addition, our fine-tuned pre-trained regression-classification BERT and RoBERTa models show 2.1% and 1.8% improvements over their baselines. The ensemble model with learning_rate of and dropout of 0.5 (E_2) achieves significantly improvement on development set. It outperforms the BERT baseline and RoBERTa baseline by 8.5% and 8.6%, respectively. As a result, we use this ensemble model as our final model and submit the prediction results to the shared task’s CodaLab page.111https://competitions.codalab.org/competitions/23285 We achieve a macro-F1 score of 90.901% on the test set and rank 36th among 85 participants in sub-task A. After the release of the gold labels, we also calculate our other models’ performance on test set (Table 4) and make detailed comparison and analysis among them (Section 4.5.1).
4.5.1 Ablation Analysis
When we fine-tuned our pre-trained models, B-PT-C, B-PT-R, R-PT-C, and R-PT-R on only 10% of the PT-R and PT-C (Section 4.4) separately, the accuracy of models, B-PT-C-C, B-PT-R-C, R-PT-C-C, and R-PT-R-C we get is 82.822%, 83.326%, 83.280%, and 83.646%, which is lower than the results using total data (Table 4). It indicates that deep learning models which are trained on larger dataset perform better. For the ensemble model, when we decrease the learning_rate from (E) to (E_LL), the performance improves from 88.548% to 90.701%, which shows that the ensemble model is sensitive to the change in learning rates. By changing the default dropout from 0.1 (E_LL) to 0.5 (E_HD), the model performance increase to 90.884%, which indicates the influence of the dropout rate. After comparing the predicted labels from our unsubmitted models with the released gold labels (Table 4), we can see the model which achieves the highest accuracy on the development set doesn’t perform best on the test set. which may be caused by overfitting. Pure fine-tuned BERT-base model (B_FT) achieves the same accuracy as other two ensemble models. In addition, higher accuracy can’t guarantee the higher f1-score due to the data imbalance.
4.5.2 Error Analysis
The confusion matrix in Figure2
further displays the error pattern of our classifier on test set. As we can see, there are only three instances labeled withOFF are misclassified to NOT while more data labeled with NOT are classified to OFF. Table 5 shows these three misclassified offensive examples and other misclassified not offensive tweets.
One explanation of the results may be that the imbalance of the dataset leads to the classifier’s preference for the majority class. It is possible that our classifier may not capture some of the subtle nuances in meaning and contexts, and our system still needs some improvement for these subtle details.
This paper explores the performance of six individual transformer-based models and their ensemble model for the task of offensive language identification in social media. Default BERT-Base and RoBERTa-Base individual fine-tuning models are adapted to establish the strong baselines for the ensemble model. Sentence representations from six individual models are concatenated and fed into the linear decoder to make binary decision for the ensemble model. Our ensemble model with higher dropout shows significant improvements on accuracy, up to 8.6%, on the dev set than baseline models. However, it performs worse than the baseline model B-FT and original ensemble model E on the test set, which has a 92.153% accuracy. It may be caused by model overfitting and data imbalance, which are the problems we need to take into consideration in future experiments.
We gratefully acknowledge the support of the AWS Machine Learning Research Awards (MLRA). Any contents in this material are those of the authors and do not necessarily reflect the views of AWS.
-  (2017) Deep Learning for Hate Speech Detection in Tweets. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17 Companion, Republic and Canton of Geneva, CHE, pp. 759–760. External Links: Cited by: §2, §2.
-  (2012-Sep.) Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. In 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing, Vol. , pp. 71–80. External Links: Cited by: §1.
-  (2020) ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations, External Links: Cited by: §1.
-  (2017) Automated Hate Speech Detection and the Problem of Offensive Language. In International AAAI Conference on Web and Social Media, External Links: Cited by: §2.
-  (2019-06) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §1.
-  (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. External Links: Cited by: §4.2.
-  (2011) Modeling the Detection of Textual Cyberbullying. In International AAAI Conference on Web and Social Media, External Links: Cited by: §2.
-  (2018) Convolutional Neural Networks for Toxic Comment Classification. CoRR abs/1802.09957. External Links: Cited by: §2.
-  (2019) OpenWebText Corpus. External Links: Cited by: §1.
-  (2014) Cyber Bullying Detection Using Social and Textual Analysis. In Proceedings of the 3rd International Workshop on Socially-Aware Multimedia, SAM ’14, New York, NY, USA, pp. 3–6. External Links: Cited by: §2.
-  (2018-08) Benchmarking Aggression Identification in Social Media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, New Mexico, USA, pp. 1–11. External Links: Cited by: §2.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations, External Links: Cited by: §1.
-  (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. External Links: Cited by: §1.
-  (2020) RoBERTa: A Robustly Optimized BERT Pretraining Approach. In Proceedings of the International Conference on Learning Representations, External Links: Cited by: §4.2.
-  (2017) Detecting Hate Speech in Social Media. In RANLP, Cited by: §2.
Challenges in Discriminating Profanity from Hate Speech.
Journal of Experimental & Theoretical Artificial Intelligence30 (2), pp. 187–202. External Links: Cited by: §2.
-  (2016) News Dataset Available. External Links: Cited by: §1.
-  (2010) Offensive Language Detection Using Multi-level Classification. In Advances in Artificial Intelligence, A. Farzindar and V. Kešelj (Eds.), Berlin, Heidelberg, pp. 16–27. External Links: Cited by: §1.
-  (2020) A Large-Scale Weakly Supervised Dataset for Offensive Language Identification. In arxiv, Cited by: §3, §3.
-  (2018) A Simple Method for Commonsense Reasoning. arXiv 1806.02847. External Links: Cited by: §1.
-  (2018-11) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Cited by: §1.
Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language.
Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria – September 21, 2018, pp. 1 – 10. External Links: Cited by: §2.
-  (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32, pp. 5754–5764. External Links: Cited by: §1.
-  (2019-06) Predicting the Type and Target of Offensive Posts in Social Media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1415–1420. External Links: Cited by: §2, §3, §3.
-  (2020) SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020). In Proceedings of SemEval, Cited by: §3.
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.
The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.