Log In Sign Up

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

by   Nauros Romim, et al.

Social media platforms and online streaming services have spawned a new breed of Hate Speech (HS). Due to the massive amount of user-generated content on these sites, modern machine learning techniques are found to be feasible and cost-effective to tackle this problem. However, linguistically diverse datasets covering different social contexts in which offensive language is typically used are required to train generalizable models. In this paper, we identify the shortcomings of existing Bangla HS datasets and introduce a large manually labeled dataset BD-SHS that includes HS in different social contexts. The labeling criteria were prepared following a hierarchical annotation process, which is the first of its kind in Bangla HS to the best of our knowledge. The dataset includes more than 50,200 offensive comments crawled from online social networking sites and is at least 60 datasets. We present the benchmark result of our dataset by training different NLP models resulting in the best one achieving an F1-score of 91.0 experiments, we found that a word embedding trained exclusively using 1.47 million comments from social media and streaming sites consistently resulted in better modeling of HS detection in comparison to other pre-trained embeddings. Our dataset and all accompanying codes is publicly available at


Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

The rise of social media has led to the increasing of comments on online...

HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla

In this paper, we present HS-BAN, a binary class hate speech (HS) datase...

A Dictionary-based Approach to Racism Detection in Dutch Social Media

We present a dictionary-based approach to racism detection in Dutch soci...

Hate Speech detection in the Bengali language: A dataset and its baseline evaluation

Social media sites such as YouTube and Facebook have become an integral ...

Vietnamese Hate and Offensive Detection using PhoBERT-CNN and Social Media Streaming Data

Society needs to develop a system to detect hate and offense to build a ...

Creating a Multimodal Dataset of Images and Text to Study Abusive Language

In order to study online hate speech, the availability of datasets conta...

Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention

Abusive language is a massive problem in online social platforms. Existi...