Hate Speech detection in the Bengali language: A dataset and its baseline evaluation

12/17/2020
by   Nauros Romim, et al.
0

Social media sites such as YouTube and Facebook have become an integral part of everyone's life and in the last few years, hate speech in the social media comment section has increased rapidly. Detection of hate speech on social media websites faces a variety of challenges including small imbalanced data sets, the findings of an appropriate model and also the choice of feature analysis method. further more, this problem is more severe for the Bengali speaking community due to the lack of gold standard labelled datasets. This paper presents a new dataset of 30,000 user comments tagged by crowd sourcing and varified by experts. All the comments are collected from YouTube and Facebook comment section and classified into seven categories: sports, entertainment, religion, politics, crime, celebrity and TikTok meme. A total of 50 annotators annotated each comment three times and the majority vote was taken as the final annotation. Nevertheless, we have conducted base line experiments and several deep learning models along with extensive pre-trained Bengali word embedding such as Word2Vec, FastText and BengFastText on this dataset to facilitate future research opportunities. The experiment illustrated that although all deep learning models performed well, SVM achieved the best result with 87.5 available and accessible to facilitate further research in the field of in the field of Bengali hate speech detection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2021

A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts

In recent years, Vietnam witnesses the mass development of social networ...
research
06/01/2022

BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

Social media platforms and online streaming services have spawned a new ...
research
08/24/2021

Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

Offensive Language detection in social media platforms has been an activ...
research
11/19/2021

The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

In this paper, we discuss the development of a multilingual dataset anno...
research
10/08/2019

Voice for the Voiceless: Active Sampling to Detect Comments Supporting the Rohingyas

The Rohingya refugee crisis is one of the biggest humanitarian crises of...
research
03/14/2018

Challenges in Discriminating Profanity from Hate Speech

In this study we approach the problem of distinguishing general profanit...
research
12/03/2021

HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla

In this paper, we present HS-BAN, a binary class hate speech (HS) datase...

Please sign up or login with your details

Forgot password? Click here to reset