Toxic comments refer to rude, insulting and offensive comments that can severely affect a person’s mental health and severe cases can also be classified as cyber-bullying. It often instills insecurities in young people, leading them to develop low self-esteem and suicidal thoughts. This abusive environment also dissuades people from expressing their opinions in the comment section which is supposed to be a safe and supportive space for productive discussions. Young children learn the wrong idea that adapting profane language will help them in seeking attention and becoming more socially acceptable. Hence, the task of flagging inappropriate content on social media is extremely important to create a healthy social space.
In this paper, we discuss how we leveraged AI to solve this task for us. This task is unprecedented because of the lack of useful data to train powerful models. We have used state-of-the-art transformer models pretrained on MLM task, fine-tuned with different architectures. Furthermore, we discuss some creative post-processing techniques that help to enhance the scores.
Ii Task Overview
Ii-a Problem Formulation
IIIT-D Multilingual Abusive Comment Identification is an innovative challenge towards combating abusive comments on Moj, one of India’s largest social media platform, in multiple regional Indian languages. This paper is about a novel research problem focused on improving the social space for the members of the social media community. The focus area is improving abusive comment identification on social media in low-resource Indian languages.
There are multiple challenges that we need to combat while dealing with multilingual text data. Some of the main challenges in this task include:
There is a lack of resources about literature and grammar despite millions of native speakers using these languages. Building NLP algorithms with limited basic lexical resource is highly challenging.
Not all Indic languages fall into the same Linguistic Families. There’s (1) Indo-Aryan which includes Hindi, Marathi, Gujarati, Bengali, etc (2) The Dravidian family consists of Tamil, Telugu, Kannada and Malayalam (3) Tribes of Central India speak Austric languages (4) Sino-Tibetan languages are spoken by tribes of the North Eastern India.
The posts/comments on social media do not adhere to a particular format, grammar or sentence structure. They are short, incomplete, filled with slangs, emoticons and abbreviations.
Ii-B Data Description
In this challenge, the data provided was split into 2 parts: training data with 665k+ samples and test data with 74.3k+ samples. Key novelties around this dataset include:
Natural language comment text data in 13 Indic Languages labeled as Abusive(312k samples) or Not Abusive(352k samples). Fig. 1. tells us that Hindi is the most common language.
The data is human annotated.
Metadata based explicit feedback - post identification number, report count of the comment, report count of the post, count of likes on the comment and count of likes on the post.
Iii Model Building Approach and Evaluation
After the research in Attention Mechanism [vaswani2017attention] and the groundbreaking model - BERT [devlin2019bert], NLP space has been revolutionized and state-of-the-art Transformers have become the go-to option for almost all NLP tasks. For tackling this multilingual task, we chose the following models -
XLM-RoBERTa - It is a transformer-based masked language model trained on 100 languages, using more than two terabytes of filtered CommonCrawl data. [conneau2020unsupervised]
MuRIL - A multilingual language model specifically built for Indic languages trained on significantly large amounts of Indian text corpora with both transliterated and translated document pairs, that serve as supervised cross-lingual signals in training. [khanuja2021muril]
mBERT - MultilingualBERT (mBERT) is a transformer based language model trained on raw Wikipedia text of 104 languages. This model is contextual and its training requires no supervision - no alignment between the languages is done. [k2020crosslingual]
IndicBERT - This model is trained on IndicCorp and evaluated on IndicGLUE. Similar to mBERT, a single model is trained on all Indian languages with the hope of utilizing the relatedness amongst Indian languages. [kakwani-etal-2020-indicnlpsuite]
RemBERT - This model is pretrained on 110 languages using a Masked Language Modeling (MLM) objective. Its main difference with mBERT is that the input and output embeddings are not tied. Instead, RemBERT uses small input embeddings and large output embeddings. [DBLP:conf/iclr/ChungFTJR21]
The Evaluation Metric for this task is Mean F1-score.
Pre-training using Masked Language Model(MLM) [song2019mass] has been one of the most popular and successful methods in downstream tasks mostly due to their ability to model higher-order word co-occurrence statistics. [sinha2021masked]
Since MuRIL is already a BERT encoder model pre-trained on Indic languages with MLM objective, we performed pre-training only on XLM-RoBERTa model (3 epochs for Base variant and 2 epochs for Large variant). We separated 10% of the test data for evaluating the pre-train step. Table I shows the summary of this task. On average, the test F-1 score on downstream task was approximately 0.87 without performing MLM and 0.88 with MLM when we use the same settings to train the finetuned model in both the cases. This helped us to conclude that MLM on the given data certainly helps in the downstream tasks.
|Info||XLM-RoBERTa Base||XLM-RoBERTa Large|
|Accelerator||Tesla P100 16GB||Tesla T4 16GB|
|Time Taken (hr)||11.26||35.3|
Iii-B Data Augmentation
We performed data augmentation by adding transliterated data. We removed emojis from text and then we used uroman111https://github.com/isi-nlp/uroman to generate additional transliterated data of 219114 samples. We also transliterated the test dataset and made the original plus transliterated dataset available on Kaggle 222https://www.kaggle.com/harveenchadha/iitdtransliterated.
Iii-C Model training
Mainly, we used 2 different Model Architectures; one was the original architecture where we took the transformer outputs and found the probabilities and in the second one, we added a custom attention head for the transformer output before calculating the probabilities.
We also experimented with a number of different truncation sizes of the input text length. We tried different models with truncation length of 64, 96, 128 and 256 and finally decided to go ahead with 128.
From Table II, we found out that MLM and Transliterated data had a positive impact on the performance. Moreover, a custom attention head boosted the score too. XLM-RoBERTa was the best performer followed by MuRIL. We also observed a slight difference between GPU and TPU accelerators - models trained on GPU tend to perform much better. But since experimentation time on TPU was lower, we went ahead with it for most experiments. We used wandb[wandb] to track most of our experiments, we are also releasing our experimentation logs 333https://wandb.ai/harveenchadha/iitd-private
|Id||Model||Data||Accelerator||Validation Strategy||CV||Test Score|
|1||XLM-RoBERTa Base||Original||TPU v3-8||10% split||0.854||0.87674|
|2||XLM-RoBERTa Large||Original||TPU v3-8||10% split||0.8601||0.8819|
|3||XLM-RoBERTa Base||Original + Transliterated||TPU v3-8||10% split||0.86||0.87989|
|4||XLM-RoBERTa Large||Original + Transliterated||TPU v3-8||10% split||0.8626||0.88238|
|5||XLM-RoBERTa Large + MLM||Original||TPU v3-8||10% split||0.8631||0.88291|
|6||XLM-RoBERTa Large + MLM||Original + transliterated||TPU v3-8||10% split||0.863||0.88316|
|7||XLM-RoBERTa Base + MLM + Attention head||Original||Tesla P100 - 16GB||4-Fold||0.866075||0.8814|
|8||XLM-RoBERTa Large + MLM + Attention head||Original + transliterated||RTX5000 24GB||10% split||0.8669||0.88378|
|9||MuRIL Base||Original||TPU v3-8||10% split||0.8520||0.87539|
|10||MuRIL Large||Original||TPU v3-8||10% split||0.8546||0.87821|
|11||XLM-R Base + MLM + last 5 hidden states average||Original||Tesla P100 - 16GB||4-Fold||0.8662||0.88149|
|12||XLM-RoBERTa + MLM||Original + transliterated||TPU v3-8||4-Fold||0.86825||0.8872|
|13||XLM-RoBERTa Large||Original||TPU v3-8||10-Fold||0.85759||0.8829|
|14||RemBERT (Truncation Length = 64)||Original||Tesla P100||10% split||0.8529||0.877|
|15||RemBERT||Original||TPU v3-8||10% split||0.8397||0.8678|
|16||mBERT||Original||TPU v3-8||10% split||0.8474||0.8724|
|17||mBERT (Truncation Length = 64)||Original||Tesla P100||4-Fold||0.8435||0.86327|
|18||IndicBERT||Original||TPU v3-8||10% split||0.8306||0.8456|
As mentioned above, we also experimented with other transformer models but didn’t achieve satisfactory results. RemBERT has outperformed XLM-RoBERTa in various tasks but underperformed in this case. mBERT also gave meagre results. Surprisingly, IndicBERT is the worst performer for this task.
Note - We transliterated the test data as well. So, whenever we have trained a model with transliterated data, the inference was done on the original text and transliterated text and the probabilities were combined with 7:3 ratio to form the final prediction.
We wanted to select diverse models so that each model makes different mistakes and by combining the learning of diverse models, we can get a better result. Hence, after a few experiments, we went forward with the following Ids from Table II - 2, 6, 8, 9, 12, 13. We gave equal weights to all the models and our test score was 0.88756. After changing the probability inference threshold to 0.55, the score increased to 0.88827.
Iii-E Using Metadata
As mentioned previously, we were also provided with weak metadata which was utilized in the following way -
Iii-E1 Feature Engineering
Due to the variation in timestamps, there were different values of reports and likes for the same post in different samples.
For example, there is a post with post_index = 1, and it has 1 comment. When this comment was captured for the dataset, the like and report on the post were 5 and 10 respectively. After a few minutes, when a new comment was uploaded and captured for the dataset, the number of likes and reports had been changed to 10 and 20 respectively. Which means, now there are 2 samples with post_index = 1 in the dataset but the report count and like count recorded for them are different due to variation in the timestamp.
In order to deal with this, we created mean-value aggregated features and maximum-value aggregated features for report count and like count for posts as well as comments.
We also added length of characters in the comment, length of tokens in the comment and probabilities from the ensemble as a feature. We used this data to train boosting classifiers.
As per our findings, increasing the threshold gave us a boost and hence we decided to tweak the thresholds for each language. After experimenting with several thresholds, we found that the values in Table III gave the best score.
All the threshold values that we tested were multiples of 5. So, there must have been some edge cases that might be getting misclassified and in order to account for that, we increased the predicted probabilities by 0.01.
Table IV shows the performance of different boosting models. XGBoost and CatBoost were giving the highest score. We combined the probabilities from these models with XGBoost to CatBoost ratio of 6:4 and achieved the final best score of 0.90005.
|Model||CV Average||Test Score|
In recent times, social media has become a hub for information exchange and entertainment. Results from this paper can be used to create systems that can flag toxicity and provide users with a healthier experience. Surprisingly, MuRIL which is trained specifically on Indian text did not perform well individually (both large and base variants) but it gave us an edge in the ensemble because it added to model diversity. For the same reason, we added non-pretrained models to the ensemble. The metadata in itself was very weak but combining it with the transformer output probabilities improved our predictions substantially. Even small tweaks like increasing the probability by 0.01 and language-wise inference thresholds had an astonishing impact in making a better system.
We would like to thank Moj and ShareChat for providing the data and organizing the Multilingual Abusive Comment Identification Challenge along with IIIT-Delhi. We would also like to thank Kaggle for providing access to TPUs and jarvislabs.ai for providing credits to train the final model.