Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

09/08/2021
by   Saurabh Gaikwad, et al.
0

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2022

Predicting the Type and Target of Offensive Social Media Posts in Marathi

The presence of offensive language on social media is very common motiva...
research
09/05/2022

Cross-Lingual and Cross-Domain Crisis Classification for Low-Resource Scenarios

Social media data has emerged as a useful source of timely information a...
research
08/30/2023

Cyberbullying Detection for Low-resource Languages and Dialects: Review of the State of the Art

The struggle of social media platforms to moderate content in a timely m...
research
11/10/2020

Detecting Social Media Manipulation in Low-Resource Languages

Social media have been deliberately used for malicious purposes, includi...
research
08/06/2021

Cross-lingual Capsule Network for Hate Speech Detection in Social Media

Most hate speech detection research focuses on a single language, genera...
research
04/21/2022

Cross-Lingual Query-Based Summarization of Crisis-Related Social Media: An Abstractive Approach Using Transformers

Relevant and timely information collected from social media during crise...
research
11/15/2022

SexWEs: Domain-Aware Word Embeddings via Cross-lingual Semantic Specialisation for Chinese Sexism Detection in Social Media

The goal of sexism detection is to mitigate negative online content targ...

Please sign up or login with your details

Forgot password? Click here to reset