Toxicity Detection for Indic Multilingual Social Media Content

01/03/2022
by   Manan Jhaveri, et al.
15

Toxic content is one of the most critical issues for social media platforms today. India alone had 518 million social media users in 2020. In order to provide a good experience to content creators and their audience, it is crucial to flag toxic comments and the users who post that. But the big challenge is identifying toxicity in low resource Indic languages because of the presence of multiple representations of the same text. Moreover, the posts/comments on social media do not adhere to a particular format, grammar or sentence structure; this makes the task of abuse detection even more challenging for multilingual social media platforms. This paper describes the system proposed by team 'Moj Masti' using the data provided by ShareChat/Moj in IIIT-D Multilingual Abusive Comment Identification challenge. We focus on how we can leverage multilingual transformer based pre-trained and fine-tuned models to approach code-mixed/code-switched classification tasks. Our best performing system was an ensemble of XLM-RoBERTa and MuRIL which achieved a Mean F-1 score of 0.9 on the test data/leaderboard. We also observed an increase in the performance by adding transliterated data. Furthermore, using weak metadata, ensembling and some post-processing techniques boosted the performance of our system, thereby placing us 1st on the leaderboard.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

Pegasus@Dravidian-CodeMix-HASOC2021: Analyzing Social Media Content for Detection of Offensive Text

To tackle the conundrum of detecting offensive comments/posts which are ...
research
04/03/2023

Detection of Homophobia Transphobia in Dravidian Languages: Exploring Deep Learning Methods

The increase in abusive content on online social media platforms is impa...
research
10/25/2021

Battling Hateful Content in Indic Languages HASOC '21

The extensive rise in consumption of online social media (OSMs) by a lar...
research
02/18/2021

MUDES: Multilingual Detection of Offensive Spans

The interest in offensive content identification in social media has gro...
research
03/22/2019

Tending Unmarked Graves: Classification of Post-mortem Content on Social Media

User-generated content is central to social computing scholarship. Howev...
research
06/21/2022

muBoost: An Effective Method for Solving Indic Multilingual Text Classification Problem

Text Classification is an integral part of many Natural Language Process...
research
01/15/2021

Walk in Wild: An Ensemble Approach for Hostility Detection in Hindi Posts

As the reach of the internet increases, pejorative terms started floodin...

Please sign up or login with your details

Forgot password? Click here to reset