IIITG-ADBU@HASOC-Dravidian-CodeMix-FIRE2020: Offensive Content Detection in Code-Mixed Dravidian Text

07/29/2021
by   Arup Baruah, et al.
Accenture
0

This paper presents the results obtained by our SVM and XLM-RoBERTa based classifiers in the shared task Dravidian-CodeMix-HASOC 2020. The SVM classifier trained using TF-IDF features of character and word n-grams performed the best on the code-mixed Malayalam text. It obtained a weighted F1 score of 0.95 (1st Rank) and 0.76 (3rd Rank) on the YouTube and Twitter dataset respectively. The XLM-RoBERTa based classifier performed the best on the code-mixed Tamil text. It obtained a weighted F1 score of 0.87 (3rd Rank) on the code-mixed Tamil Twitter dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/05/2020

Gauravarora@HASOC-Dravidian-CodeMix-FIRE2020: Pre-training ULMFiT on Synthetically Generated Code-Mixed Data for Hate Speech Detection

This paper describes the system submitted to Dravidian-Codemix-HASOC2020...
09/20/2021

Language Identification with a Reciprocal Rank Classifier

Language identification is a critical component of language processing p...
06/04/2018

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

In this paper, we present a customizable datacentric system that automat...
06/05/2020

Spoken dialect identification in Twitter using a multi-filter architecture

This paper presents our approach for SwissText KONVENS 2020 shared t...
09/10/2016

Using Spatial Pooler of Hierarchical Temporal Memory to classify noisy videos with predefined complexity

This paper examines the performance of a Spatial Pooler (SP) of a Hierar...
09/15/2018

Inferring Political Alignments of Twitter Users: A case study on 2017 Turkish constitutional referendum

Increasing popularity of Twitter in politics is subject to commercial an...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of offensive language in social media text has become a new social problem. Such language can have a negative psychological impact on the readers. It can have adverse effect on the emotion and behavior of people. Hate speech has fueled riots in many places around the world. As such, it is important to keep social media free from offensive language. Considerable research has been performed on automated techniques for detecting offensive language. Among the many challenges that such systems have to tackle, the use of code-mixed text is another. Code-mixing is the phenomena of mixing words from more than one language in the same sentence or between sentences.

The shared task “Dravidian-CodeMix-HASOC 2020” [1, 2] is an attempt to promote research on offensive language detection in code-mixed text. This shared task is held as a sub-track of “Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC)” at FIRE-2020. The shared task had two tasks. Task 1 required detection of offensive language in code-mixed Malayalam-English text from YouTube. Task 2 required detection of offensive language in code-mixed Tamil-English and Malayalam-English tweets. Both the tasks were binary classification problem where it was required to determine if the given text is offensive or not.

We participated in both the tasks. We used SVM and XLM-RoBERTa classifiers in our study. The SVM classifier was trained using TF-IDF features of character n-grams, word n-grams, and character and word n-grams combined.

2 Related Work

Offensive language detection in English has witnessed the use of SVM [3, 4, 5, 6, 7]

, Logistic Regression

[8, 9, 10, 6, 11]

, and deep learning techniques

[12, 13, 14, 15, 16, 17]. The main focus of [5]

was to tackle the use of code words for obfuscating the hate words. Traditional machine learning and deep learning techniques have also been used in the detection of offensive language in code-mixed Hindi-English text

[18, 19, 20, 21, 22, 23, 24]

. Work performed on code-mixed Tamil-English and Malayalam-English text includes corpus created for sentiment analysis for these two languages

[25, 26]. [27] focused on machine translation of code-mixed text in Dravidian languages. It was found that removal of code-mixing improves the quality of machine translation.

3 Dataset

Table 1 shows the statistics of the dataset provided as part of this shared task. The instances in the dataset were labeled as “not offensive” (NOT) or “offensive” (OFF). Task 1 was conducted for Malayalam language only. The source of the dataset for this task was YouTube. As can be seen from the table, this dataset is imbalanced with about 83% labeled as NOT. Task 2 was conducted for both Tamil and Malayalam languages. The source of the datasets for this task was Twitter. As can be seen from the tables, the dataset for this task was balanced. Train, development, and test set was provided for Task 1. For task 2, only train and test set was provided. We created the development set for Task 2, by doing a stratified split and retaining 85% of the dataset for training and 15% as development dataset.

Label Task 1 - Malayalam Task 2 - Tamil Task 2 - Malayalam
Train Dev Test Train Test Train Test
NOT 2633 328 334 2020 465 2047 488
(82.3%) (82%) (83.5%) (50.5%) (49.5%) (51.2%) (48.8%)
OFF 567 72 66 1980 475 1953 512
(17.7%) (18%) (16.5%) (49.5%) (50.5%) (48.4%) (51.2%)
Total 3200 400 400 4000 940 4000 1000
Table 1: Data set statistics

4 Methodology

In this study we used SVM and XLM-RoBERTa based classifiers. The SVM classifier was trained using TF-IDF features of character n-grams, word n-grams, and combination of character and word n-grams. In our study, we used character n-grams of size 1 to 6, and word n-grams of size 1 to 3.

XLM-RoBERTa model [28] is based on the RoBERTa model [29]. RoBERTa model is based on the transformer architecture. XLM-RoBERTa is a multi-lingual model trained on 100 different languages including Tamil and Malayalam. In our study, we used the pre-trained base model. The Adam optimizer with weight decay was used during training. The learning rate and epsilon parameter for the optimizer were set to 2e-5 and 1e-8 respectively. We used the class provided by HuggingFace Transformers library 111 https://huggingface.co/transformers/ for sequence classification in our study. This class provides a linear layer on top of the pooled output to perform the binary classification.

5 Results

Task System Precision Recall F1
(Weighted) (Weighted) (Weighted)
Task 1 Malayalam SVM (char) 0.9187 0.9175 0.9096
Task 1 Malayalam SVM (word) 0.9138 0.9075 0.8950
Task 1 Malayalam SVM (char + word) 0.9330 0.9325 0.9278
Task 1 Malayalam XLM-RoBERTa 0.9305 0.9325 0.9307
Task 2 Tamil SVM (char) 0.8650 0.8633 0.8630
Task 2 Tamil SVM (word) 0.8733 0.8717 0.8714
Task 2 Tamil SVM (char + word) 0.8617 0.8600 0.8597
Task 2 Tamil XLM-RoBERTa 0.8651 0.8650 0.8650
Task 2 Malayalam SVM (char) 0.7519 0.7500 0.7490
Task 2 Malayalam SVM (word) 0.7190 0.7100 0.7056
Task 2 Malayalam SVM (char + word) 0.7630 0.7617 0.7610
Task 2 Malayalam XLM-RoBERTa 0.5732 0.5483 0.5171
Table 2: Dev Set Results
Task System Precision Recall F1 Rank
(Weighted) (Weighted) (Weighted)
Task 1 Malayalam SVM (char + word) 0.9505 0.9500 0.9471 1st
Task 1 Malayalam XLM-RoBERTa 0.9241 0.9250 0.9245 -
Task 2 Tamil SVM (word) 0.8524 0.8521 0.8520 -
Task 2 Tamil XLM-RoBERTa 0.8680 0.8670 0.8669 3rd
Task 2 Malayalam SVM (char + word) 0.7686 0.7630 0.7623 3rd
Task 2 Malayalam XLM-RoBERTa 0.6181 0.5800 0.5337 -
Table 3: Test Set Results

Task 1 (Malayalam) Task 2 (Tamil) Task 2 (Malayalam)
SVM XLM- SVM XLM- SVM XLM-


char+word RoBERTa word RoBERTa char+word RoBERTa

pred pred pred pred pred pred

NOT OFF NOT OFF NOT OFF NOT OFF NOT OFF NOT OFF
NOT 332 2 320 14 389 76 390 75 403 85 127 361
OFF 18 48 16 50 63 412 50 425 152 360 59 453
Table 4: Confusion Matrices of the submitted classifiers on the Test Set

Table 2 shows the results obtained by our SVM and XLM-RoBERTa classifiers on the development set. For task 1, the development set was provided as part of the dataset. For task 2, the development set was created by performing a stratified split on the train set. 15% of the train set was set aside as the development set. The XLM-RoBERTa classifier performed the best with a weighted F1 score of 0.9307 in the development set for task 1. Among the SVM classifiers, the one trained using the combination of TF-IDF features of character and word n-grams performed the best with a weighted F1 score of 0.9278.

In task 2 dev set, the SVM classifier trained using the TF-IDF features of word n-grams performed the best for code-mixed Tamil-English text. It obtained a weighted F1 score of 0.8714. The XLM-RoBERTa classifier obtained a weighted F1 score of 0.8650 and was the second best performing classifier on the dev set for this task. For code-mixed Malayalam-English text of the task 2 dev set, the best performing classifier was the SVM classifier trained using the combination of TF-IDF features of character and word n-grams. It obtained a weighted F1 score of 0.7610. The XLM-RoBERTa classifier obtained a weighted F1 score of 0.5171 and was the worst performing classifier for this task.

Table 3 shows the results that our submitted classifiers obtained on the test set. The SVM classifiers mentioned in this table are the only one submitted for the tasks. These classifiers were selected based on their performance on the development set. As can be seen from the table, the SVM classifier trained on the combination of TF-IDF features of character and word n-grams performed the best in task 1 with as weighted F1 score of 0.9471. It obtained the 1st rank for the task. XLM-RoBERTa was the best performing classifier for the Tamil-English dataset of task 2. It was a weighted F1 score of 0.8669 and obtained the 3rd rank for the task. The SVM classifier trained on the combination of TF-IDF features of character and word n-grams again performed the best for the Malayalam-English dataset of task 2 with a weighted F1 score of 0.7623. It obtained the 3rd rank for the task. Table 4 shows the confusion matrices obtained on the test set by classifiers submitted for the shared task.

6 Conclusion

We used the SVM and XLM-RoBERTa based classifiers to detect offensive language in code-mixed Tamil-English and Malayalam-English text. In our study, the SVM classifier trained using combination of TF-IDF features of character and word n-grams performed the best for code-mixed Malayalam-English text (both YouTube and Twitter dataset). This classifier obtained the weighted F1 score of 0.95 (1st rank) and 0.76 (3rd rank) for Task 1 and Task 2 (Malayalam) respectively. The XLM-RoBERTa based classifier performed the best for the Tamil-English dataset of Task 2 and obtained an weighted F1 score of 0.87 (3rd rank) for the task. On comparing the performance of our SVM models on the YouTube and Twitter data for Malayalam language, we can observe that the performance of the classifier degraded considerably for the Twitter dataset. Whether this degradation is due to the type of language used in Twitter conversation, length of the text etc. can be performed as a future study.

Acknowledgements.
Supported by Visvesvaraya PhD Scheme, MeitY, Govt. of India, MEITY-PHD-3050.

References

  • Chakravarthi et al. [2020a] B. R. Chakravarthi, M. A. Kumar, J. P. McCrae, P. B, S. KP, T. Mandl, Overview of the track on "hasoc-offensive language identification- dravidiancodemix", in: Proceedings of the 12th Forum for Information Retrieval Evaluation, FIRE ’20, 2020a.
  • Chakravarthi et al. [2020b] B. R. Chakravarthi, M. A. Kumar, J. P. McCrae, P. B, S. KP, T. Mandl, Overview of the track on "hasoc-offensive language identification- dravidiancodemix", in: Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2020). CEUR Workshop Proceedings. In: CEUR-WS. org, Hyderabad, India, 2020b.
  • Malmasi and Zampieri [2017] S. Malmasi, M. Zampieri, Detecting Hate Speech in Social Media, in: RANLP 2017, Varna, Bulgaria, 2017, pp. 467–472.
  • Malmasi and Zampieri [2018] S. Malmasi, M. Zampieri, Challenges in discriminating profanity from hate speech,

    Journal of Experimental & Theoretical Artificial Intelligence 30 (2018) 187–202.

  • Magu et al. [2017] R. Magu, K. Joshi, J. J.Luo, Detecting the Hate Code on Social Media, in: AAAI ICWSM 2017, Montreal, 2017, pp. 608–611.
  • Davidson et al. [2017] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated Hate Speech Detection and the Problem of Offensive Language, in: AAAI ICWSM 2017, Montreal, 2017, pp. 512–515.
  • Samghabadi et al. [2017] N. Samghabadi, S. Maharjan, A. Sprague, R. Diaz-Sprague, T. Solorio, Detecting Nastiness in Social Media, in: ALW1 at ACL 2017, Vancouver, 2017, pp. 63–72.
  • Wulczyn et al. [2017] E. Wulczyn, N. Thain, L. Dixon, Ex Machina: Personal Attacks Seen at Scale, in: WWW 2017, Perth, 2017, pp. 1391–1399.
  • Waseem and Hovy [2016] Z. Waseem, D. Hovy, Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter, in: NAACL-HLT 2016, California, 2016, pp. 88–93.
  • Djuric et al. [2015] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, N. Bhamidipati, Hate Speech Detection with Comment Embeddings, in: WWW 2015, Florence, Italy, 2015, pp. 29–30.
  • Risch and Krestel [2018] J. Risch, R. Krestel, Delete or not Delete? Semi-Automatic Comment Moderation for the Newsroom, in: TRAC-1 at COLING 2018, Santa Fe, USA, 2018, pp. 166–176.
  • Badjatiya et al. [2017] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep Learning for Hate Speech Detection in Tweets, in: WWW 2017, Perth, 2017, pp. 759–760.
  • Gamback and Sikdar [2017] B. Gamback, U. Sikdar,

    Using Convolutional Neural Networks to Classify Hate-Speech,

    in: ALW1 at ACL 2017, Vancouver, 2017, pp. 85–90.
  • Park and Fung [2017] J. Park, P. Fung, One-step and Two-step Classification fro Abusive Language Detection on Twitter, in: ALW1 at ACL 2017, Vancouver, 2017, pp. 41–45.
  • Pavlopoulos et al. [017a] J. Pavlopoulos, P. Malakasiotis, I. Androutsopoulos, Deep Learning for User Comment Moderation, in: ALW1 at ACL 2017, Vancouver, 2017a, pp. 25–35.
  • Mehdad and Tetreault [2016] Y. Mehdad, J. Tetreault, Do Characters Abuse More Than Words?, in: SIGDIAL 2016, Los Angeles, 2016, pp. 299–303.
  • Baruah et al. [2019] A. Baruah, F. A. Barbhuiya, K. Dey, ABARUAH at semeval-2019 task 5 : Bi-directional LSTM for hate speech detection, in: Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, 2019, pp. 371–376.
  • Santosh and Aravind [2019] T. Y. S. S. Santosh, K. V. S. Aravind, Hate speech detection in hindi-english code-mixed social media text,

    in: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, COMAD/CODS 2019, Kolkata, India, January 3-5, 2019, ACM, 2019, pp. 310–313.

  • Bohra et al. [2018] A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, M. Shrivastava, A dataset of hindi-english code-mixed social media text for hate speech detection, in: Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, PEOPLES@NAACL-HTL 2018, New Orleans, Louisiana, USA, June 6, 2018, Association for Computational Linguistics, 2018, pp. 36–41.
  • Kamble and Joshi [2018] S. Kamble, A. Joshi, Hate speech detection from code-mixed hindi-english tweets using deep learning models, CoRR abs/1811.05145 (2018).
  • Sreelakshmi et al. [2020] K. Sreelakshmi, B. Premjith, K. Soman, Detection of hate speech text in hindi-english code-mixed data, in: Proceedings of the 3rd International Conference on Computing and Network Communications, 2019, India, Dec 18–21, 2019, Elsevier B.V., 2020, pp. 737–744.
  • Mathur et al. [2018] P. Mathur, R. R. Shah, R. Sawhney, D. Mahata, Detecting offensive tweets in hindi-english code-switched language,

    in: Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, SocialNLP@ACL 2018, Melbourne, Australia, July 20, 2018, Association for Computational Linguistics, 2018, pp. 18–26.

  • Baruah et al. [2019] A. Baruah, F. A. Barbhuiya, K. Dey, IIITG-ADBU at HASOC 2019: Automated hate speech and offensive content detection in english and code-mixed hindi text, in: Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019, volume 2517 of CEUR Workshop Proceedings, 2019, pp. 229–236.
  • Baruah et al. [2020] A. Baruah, K. A. Das, F. A. Barbhuiya, K. Dey, Aggression identification in english, hindi and bangla text using bert, roberta and SVM, in: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, TRAC@LREC 2020, Marseille, France, May 2020, European Language Resources Association (ELRA), 2020, pp. 76–82.
  • Chakravarthi et al. [2020a] B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, J. P. McCrae, Corpus creation for sentiment analysis in code-mixed Tamil-English text, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020a, pp. 202–210.
  • Chakravarthi et al. [2020b] B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, J. P. McCrae, A sentiment analysis dataset for code-mixed Malayalam-English, in: Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), European Language Resources association, Marseille, France, 2020b, pp. 177–184.
  • Chakravarthi [2020] B. R. Chakravarthi, Leveraging orthographic information to improve machine translation of under-resourced languages, Ph.D. thesis, NUI Galway, 2020.
  • Conneau et al. [2020] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for Computational Linguistics, 2020, pp. 8440–8451. URL: https://www.aclweb.org/anthology/2020.acl-main.747/.
  • Liu et al. [2019] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.