BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

06/01/2022
by   Nauros Romim, et al.
0

Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60 datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0 experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at github.com/naurosromim/hate-speech-dataset-for-Bengali-social-media

READ FULL TEXT
research
03/18/2021

Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

The rise of social media has led to the increasing of comments on online...
research
12/03/2021

HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla

In this paper, we present HS-BAN, a binary class hate speech (HS) datase...
research
08/31/2016

A Dictionary-based Approach to Racism Detection in Dutch Social Media

We present a dictionary-based approach to racism detection in Dutch soci...
research
12/17/2020

Hate Speech detection in the Bengali language: A dataset and its baseline evaluation

Social media sites such as YouTube and Facebook have become an integral ...
research
08/21/2023

BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service

Advances in automated detection of offensive language online, including ...
research
06/01/2022

Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data

Society needs to develop a system to detect hate and offense to build a ...
research
05/05/2020

Creating a Multimodal Dataset of Images and Text to Study Abusive Language

In order to study online hate speech, the availability of datasets conta...

Please sign up or login with your details

Forgot password? Click here to reset