HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla

12/03/2021
by   Nauros Romim, et al.
0

In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17 and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang currently people write using symbols, acronyms, or alternative spellings. These slang words were further categorized into traditional and non-traditional slang lists and included in the results of this paper. We explored traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection for the Bangla language. Our experimental results show that existing word embedding models trained with informal texts perform better than those trained with formal text. Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78 public use.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2022

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

Social media platforms and online streaming services have spawned a new ...
research
03/18/2021

Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

The rise of social media has led to the increasing of comments on online...
research
11/09/2019

Hate Speech Detection on Vietnamese Social Media Text using the Bi-GRU-LSTM-CNN Model

In recent years, Hate Speech Detection has become one of the interesting...
research
11/02/2021

Detection of Hate Speech using BERT and Hate Speech Word Embedding with Deep Model

The enormous amount of data being generated on the web and social media ...
research
12/17/2020

Hate Speech detection in the Bengali language: A dataset and its baseline evaluation

Social media sites such as YouTube and Facebook have become an integral ...
research
04/26/2017

Enriching Complex Networks with Word Embeddings for Detecting Mild Cognitive Impairment from Speech Transcripts

Mild Cognitive Impairment (MCI) is a mental disorder difficult to diagno...
research
07/30/2023

Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation

Tone is a crucial component of the prosody of Shanghainese, a Wu Chinese...

Please sign up or login with your details

Forgot password? Click here to reset