A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

04/29/2020
by   Sara Rosenthal, et al.
0

The use of offensive language is a major problem in social media which has led to an abundance of research in detecting content such as hate speech, cyberbulling, and cyber-aggression. There have been several attempts to consolidate and categorize these efforts. Recently, the OLID dataset used at SemEval-2019 proposed a hierarchical three-level annotation taxonomy which addresses different types of offensive language as well as important information such as the target of such content. The categorization provides meaningful and important information for understanding offensive language. However, the OLID dataset is limited in size, especially for some of the low-level categories, which included only a few hundred instances, thus making it challenging to train robust deep learning models. Here, we address this limitation by creating the largest available dataset for this task, SOLID. SOLID contains over nine million English tweets labeled in a semi-supervised manner. We further demonstrate experimentally that using SOLID along with OLID yields improved performance on the OLID test set for two different models, especially for the lower levels of the taxonomy. Finally, we perform analysis of the models' performance on easy and hard examples of offensive language using data annotated in a semi-supervised way.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2022

Predicting the Type and Target of Offensive Social Media Posts in Marathi

The presence of offensive language on social media is very common motiva...
research
09/10/2021

FBERT: A Neural Transformer for Identifying Offensive Content

Transformer-based models such as BERT, XLNET, and XLM-R have achieved st...
research
04/07/2021

HumAID: Human-Annotated Disaster Incidents Data from Twitter with Deep Learning Benchmarks

Social networks are widely used for information consumption and dissemin...
research
01/09/2020

Offensive Language Detection: A Comparative Analysis

Offensive behaviour has become pervasive in the Internet community. Indi...
research
07/25/2021

On-Device Content Moderation

With the advent of internet, not safe for work(NSFW) content moderation ...
research
11/18/2022

Overview of the HASOC Subtrack at FIRE 2022: Offensive Language Identification in Marathi

The widespread of offensive content online has become a reason for great...

Please sign up or login with your details

Forgot password? Click here to reset