SOLD: Sinhala Offensive Language Dataset

12/01/2022
by   Tharindu Ranasinghe, et al.
0

The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple experiments on this dataset. SOLD is a manually annotated dataset containing 10,000 posts from Twitter annotated as offensive and not offensive at both sentence-level and token-level, improving the explainability of the ML models. SOLD is the first large publicly available offensive language dataset compiled for Sinhala. We also introduce SemiSOLD, a larger dataset containing more than 145,000 Sinhala tweets, annotated following a semi-supervised approach.

READ FULL TEXT

page 9

page 11

research
11/22/2022

Predicting the Type and Target of Offensive Social Media Posts in Marathi

The presence of offensive language on social media is very common motiva...
research
03/16/2020

Offensive Language Identification in Greek

As offensive language has become a rising issue for online communities a...
research
11/26/2018

What Should I Learn First: Introducing LectureBank for NLP Education and Prerequisite Chain Learning

Recent years have witnessed the rising popularity of Natural Language Pr...
research
05/11/2018

TutorialBank: A Manually-Collected Corpus for Prerequisite Chains, Survey Extraction and Resource Recommendation

The field of Natural Language Processing (NLP) is growing rapidly, with ...
research
11/30/2018

Detecting Offensive Content in Open-domain Conversations using Two Stage Semi-supervision

As open-ended human-chatbot interaction becomes commonplace, sensitive c...
research
06/16/2017

Active learning in annotating micro-blogs dealing with e-reputation

Elections unleash strong political views on Twitter, but what do people ...
research
11/16/2020

Don't Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities

In this paper, we introduce a new annotated dataset which is aimed at su...

Please sign up or login with your details

Forgot password? Click here to reset