Log In Sign Up

HS-BAN: A Benchmark Dataset of Social Media Comments for Hate Speech Detection in Bangla

In this paper, we present HS-BAN, a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17 and detailed annotation guideline was followed to reduce human annotation bias. The HS dataset was also preprocessed linguistically to extract different types of slang currently people write using symbols, acronyms, or alternative spellings. These slang words were further categorized into traditional and non-traditional slang lists and included in the results of this paper. We explored traditional linguistic features and neural network-based methods to develop a benchmark system for hate speech detection for the Bangla language. Our experimental results show that existing word embedding models trained with informal texts perform better than those trained with formal text. Our benchmark shows that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78 public use.


page 1

page 2

page 3

page 4


BD-SHS: A Benchmark Dataset for Learning to Detect Online Bangla Hate Speech in Different Social Contexts

Social media platforms and online streaming services have spawned a new ...

Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

The rise of social media has led to the increasing of comments on online...

Hate Speech Detection on Vietnamese Social Media Text using the Bi-GRU-LSTM-CNN Model

In recent years, Hate Speech Detection has become one of the interesting...

Detection of Hate Speech using BERT and Hate Speech Word Embedding with Deep Model

The enormous amount of data being generated on the web and social media ...

Cyberbullying Detection Using Deep Neural Network from Social Media Comments in Bangla Language

Cyberbullying or Online harassment detection on social media for various...

1 Introduction

Hate speech (HS) has been rapidly spreading in recent years through various social media sites. HS detection in the user’s comment section remains a difficult challenge due to lack of formal language syntax, spelling mistakes, and use of various slang and non-standard acronyms.[nobata2016abusive]. Researchers have started working in this very challenging domain, but much of their efforts concentrated on English language[schmidt2017survey]. There are almost 46 million Facebook111 and 29 million Youtube222 Bangladeshi users but there has been a severe lack of large, linguistically diverse Bangla HS datasets.

Existing datasets for Bangla HS have some significant drawbacks. First, as Table A.1 of Appendix A shows, the datasets are not large enough. Most of them contain less than 10k sentences. Secondly, the majority portions of those datasets come from one or two domains, which makes them domain-dependent [Ishmam2019_FB_pages], [Emon2019] and [Chakraborty_2019]. Thirdly, annotating HS is inherently a complex and challenging task[nobata2016abusive]. [waseem2016you] showed how annotator’s bias on HS can influence the annotation process and subsequently affect classification task. To combat this, [de2018hate] created a stringent and detailed annotation guideline, providing specific points on what constitutes HS and what not. Although [romim2021hate] followed a guideline to prepare a small Bangla HS dataset, it lacks inter-annotator scores, which makes it difficult to judge annotation quality. To the best of our knowledge, only [karim2020deephateexplainer] mentioned Cohen’s kappa score in their paper.

In this paper, we manifest HS-BAN, the largest binary class Bangla HS dataset comprised of more than 50,000 comments crawled from Facebook and YouTube. We present a few examples from the dataset in Table 1. To ensure cross-domain generalization, we collected comments from seven categories as presented in Section 2. A strict annotation guideline was followed to reduce human annotation bias for labeling the dataset achieving an inter-annotator agreement score of 0.658. In Section 3

, we present the different natural language processing approaches we used to extract linguistic features by processing noisy social media text. Finally, in

Section 4, we present the benchmarking outcome for hate speech classification showing that a Bi-LSTM model on top of the FastText informal word embedding achieved 86.78% F1-score. The dataset and source code will be made publicly available to foster future research.

2 Dataset

Table 1: Examples of the annotate HS-BAN dataset

Data Collection:

We collected more than 100,000 comments from different YouTube channels and Facebook pages in seven categories: sports, entertainment, crime, politics, religion, celebrity, and miscellaneous. For each category except miscellaneous

, we compiled a list of controversial events that happened recently (2017-2020) in Bangladesh that falls under those categories. Then public comments were extracted on those issues by using an open-source tool called Facepager

333 Furthermore, for the miscellaneous category, we searched for videos on YouTube related to Bangla TikTok, roasting, or similar videos as the comment section in these videos tend to be very toxic. We removed duplicate, highly similar comments based on Jaccard Index > 0.8 to reduce repetitiveness and ensure a diverse vocabulary.


We prepared a detailed annotation guideline based on the community guidelines of Facebook and YouTube (Appendix B). We worked with 50 annotators who are undergraduate students within the age range from 20 to 25 years. They know about the sensitive nature of the task and they willingly volunteered in this research. Three annotators tagged each comment, and the majority vote decided the final decision. The inter-annotator agreement score [fleiss1971measuring] of 0.658 indicates that our dataset has moderate agreement among annotators.

Dataset Statistics:

Our dataset has comments in total; among them, 20,209 of the dataset comments are HS and 30,105 are Not-HS, which implies that our dataset is moderately imbalanced. A more detailed statistics of our dataset is presented in Table 2. We can see that each category is also moderately imbalanced, especially the politics and celebrity categories, whereas crime has the highest hate speech percentage. Observing the average word count (AWC) for each category, we can find that, celebrity

category has the highest AWC per sentence with the most standard deviation while

miscellaneous has the lowest.

Category Total HS% AWC Vocabulary
Sports 5937 40.02 16299
Entertainment 6843 41.31 16915
Crime 4969 43.71 15272
Religion 4978 38.31 13717
Politics 2665 33.06 9059
Celebrity 2394 30.49 14220
Miscellaneous 22,528 41.35 33646
Total 50314 40.17 72660
Table 2: Category-wise Comment distribution

3 Experimental Setup

We removed all hashtags, numbers, and non-Bangla words. However, we kept emoji and punctuation (e&p) in order to confirm [nobata2016abusive]’s claim whether e&p can be good indicators for HS. F1 score and Matthews correlation coefficient (MCC) were used to evaluate all models. We split the dataset into train(80%) and test(20%) datasets using stratified sampling so that each category in both dataset contains an equal ratio of hate to not-hate

comments. The training dataset was further divided into train(80%) and development(20%) datasets for feature selection and hyper-parameter tuning purposes.

Machine Learning Models:

We trained linear Support Vector Machine (SVM)


with Term Frequency Inverse Document Frequency (TF-IDF) weighted score as a feature. Regularizer C, penalty, and loss were fine-tuned to find the best hyperparameter combination. And for the deep learning model, we fine-tuned Convolutional Neural Networks (CNN) as described in


to get the best F1 score in the development set. We also implemented Bi-directional Long Short Term Memory (Bi-LSTM) as described in


Word Embedding Models:

We experimented with BengFastText(BFT) [Karim2020] pretrained on 250 million Bangla articles and FastText[grave2018learning], a multilingual model pretrained on 157 languages. Note that both are pretrained on formal texts; thus denoted as formal embedding. Additionally, 1.47 million Bangla comments from Facebook and YouTube on eight different categories: education, entertainment, health, influencer, religion, politics, sports, technology were collected. Then four different word-embedding models were trained using those datasets. Two word2vec[mikolov2013distributed] models denoted as W2V(SG) (Word2Vec skip-gram) and W2V(CBOW) (Word2Vec continuous bag of words) and two FastText embeddings denoted as FT(SG) (FastText skip-gram) and FT(CBOW) (FastText continuous bag of words) were trained using informal texts, hence denoted as informal embedding.

Transformer Models:

We experimented with some transformer-based models: monolingual Bangla-BERT[Sagor_2020], multilingual BERT-cased and uncased[DBLP:journals/corr/abs-1810-04805]. The hyperparameters were set according to literature [karim2020deephateexplainer] except for mBERT-uncased. In this case, the learning rate was changed from its original value of 5e-5 to 3e-5 as the validation loss during the initial training phase was terrible.

4 Result and Analysis

Baseline analysis:

We present our experimental findings in table 3. We experimented with different combinations of char and word ngram, and only the most notable combinations were presented. It becomes apparent that in our dataset, e&p

do not serve as valuable features. Word n-gram with

e&p had the worst F1 score, and for char n-gram (1,6) and (2,6), when e&p are present, the F1 score drops. For this reason, in hyper-parameter tuning and neural network experiments, e&p were removed in the prepossessing step. The experimental results show that char n-gram as a feature performs better than word n-gram. Our dataset contains many spelling mistakes, which means the same words can have inconsistent and multiple spelling variations for which SVM peforms poorly [schmidt2017survey]. The performance drops with experiments combining char n-gram with word n-gram further validated the fact. Overall, char ngram (1,6) is the best performing feature with F1 score of 85.86.

Formal vs Informal Embeddings:

We observed that, SVM with char (1-6) gram as feature outperformed neural networks with FastText and BFT as embedding by a good margin. [kar-etal-2020-multiview]

found similar observations for their informal sentiment analysis dataset. They proposed that pre-trained models based on formal text do not perform well on the informal dataset. Two FastText informal embedding

FT(SG) and FT(CBOW) outperformed SVM in three out of four instances. Only CNN+FT(CBOW) achieved a lower F1 score. Most notably, they were trained on a small 1.47 million informal sentences compared to embeddings such as BFT and FastText but still managed to outperform the bigger embeddings by a large margin. From the experiments, we conclude, word embedding trained from scratch on informal corpus will perform better for the informal dataset collected from social media.

FT(SG) and FT(CBOW) performed comparatively better than both W2V(SG) and W2V(CBOW) as presented in Table 3. According to [joulin2016fasttext], FastText is generally a better choice compared to Word2Vec for our dataset. We can also see that W2V(SG) and FT(SG) outperformed W2V(CBOW) and FT(CBOW), respectively. From this, we can summarize that word embedding created on the SG method tends to perform better than the CBOW method for our dataset. Overall, combining Bi-LSTM with FT(SG) or Bi-LSTM+FT(SG) achieved the best result with an F1 score of 86.85%.

Transformer vs Other:

All BERT variants performed significantly worse compared to other models. Possible reasons could be that mBERT is under-tuned for Bangla informal text as it was trained with Bangla Wikipedia for which we observe the lower performance. The experimental success with informal word embedding over formal embedding support the last assumption.

Model F1 MCC
Baseline Unigram (U) 82.89 72.95
U+e&p 82.38 72.17
Bigram (B) 63.12 49.23
Trigram (T) 24.85 23.60
U+B 82.11 71.32
U+B+T 81.16 70.38
Char 1-6 gram(C(1,6)) 85.97 77.15
C(1,6)+e&p 85.86 77
C(2,6) 85.89 77.04
C(2,6)+e&p 85.76 76.86
U+B+C(1,6) 84.71 75.04
U+B+C(2,6) 84.73 75.10
Formal embedding CNN+BFT 71.58 53.06
Bi-LSTM+BFT 76.09 61.51
CNN+FastText 84 72.89
Bi-LSTM+FastText 83.89 73.43
Informal embedding CNN+W2V(CBOW) 80.03 68.42
Bi-LSTM+W2V(CBOW) 80.48 69.46
CNN+W2V(SG) 81.07 69.23
Bi-LSTM+W2V(SG) 80.93 69.25
CNN+FT(CBOW) 85.05 75.15
Bi-LSTM+FT(CBOW) 86.73 77.76
CNN+FT(SG) 86.58 77.33
Bi-LSTM+FT(SG) 86.85 77.90
Transformer mBERT-cased 81.66 72.81
mBERT-uncased 82.56 73.33
Bangla-BERT 83.68 74.63
Table 3: Benchmarking the hate speech classification models.
Category Total HS% F1 avg F1 improved
Sports 4748 40.01 84.99 87.70
Entertainment 5468 41.32 87.09 88.51
Crime 3973 43.71 92.27 92.77
Religion 3978 38.31 85.92 88.67
Politics 2137 33.04 80.69 86.39
Celebrity 1909 30.490 78.84 83.02
Miscellaneous 18,021 41.35 85.99 88.00
Table 4: Category wise result comparison

Category-wise Result:

From Table 4 we can see that for each category, the F1 scores of 5 top-performing models: SVM, Bi-LSTM+FT(SG), CNN+FT(SG), Bi-LSTM+FT(CBOW), and CNN+FT(CBOW) were averaged and presented in the F1 avg column. Politics and Celebrity categories achieved the lowest average F1 scores. Two reasons might contribute to this: both categories have the lowest training data and the lowest HS%. To check this hypothesis, we dropped some NHS in each category so that each of them has an equal number of HS and NHS. Then, the average F1 score for all five models was calculated and shown in the F1 improved column. Note that F1 scores for Politics and Celebrity improved the most. On the other hand, Crime has the best HS% and achieved the most F1 avg score. And so, it also improved the least (only 0.5) after balancing.

Influence of slang words:

The dataset vocabulary was extracted, and slang words were annotated with the help of a linguistic expert. These slang words were divided into two types: traditional slang (TS) and non-traditional slang (NTS). NTS is not usually used as slang words but can be interpreted as such depending on the context. Figure C.1 in Appendix C shows some examples of both types of slang words. Then we looked at how models predicted comments that contain at least one such slang word. For simplicity, we ignored comments that contained both TS and NTS words. Our initial hypothesis was that models would face more trouble detecting sentences NTS as these words are often used as HS as well as NHS. Table 5 shows this result. TS acc shows how models predicted comments containing TS and NTS acc is for NTS. We can see that models performed better when comments contained TS words rather than NTS. NTS words are often vague in meaning, and so they pose special difficulty for the models.

Model TS acc NTS acc
SVM 84.27 79.53
Bi-LSTM+FT(SG) 84.87 79.40
Bi-LSTM+FT(CBOW) 85.08 77.97
CNN+FT(SG) 84.95 77.59
CNN+FT(CBOW) 83.39 74.87
Table 5: Influence of traditional and non traditional slang words

5 Conclusion

In this paper, we present HS-BN, the largest Bangla HS dataset collected from social media comments. We followed various schemes to ensure linguistic diversity and reduce repetitiveness. We found that emoji and punctuation do not affect HS detection. We also found that word embedding trained on informal text outperforms available embedding trained on formal text. We also observed that traditional slangs make the detection of hate speech much easier than that of comments with non-traditional slangs. In the future, we plan to research on the interpretability and generalizability of hate speech detection model.



A: Bangla HS dataset comparision

Dataset size No of HS
(Chakraborty and Seddiqui, 2019) [Chakraborty_2019] 5,644 2500 No No
(Emon et al., 2019) [Emon2019] 4,700 3137 No No
(Awal et al., 2018) [Awal2018] 2665 1214 No No
(Ishmam and Sharmin, 2019) [Ishmam2019_FB_pages] 5,126 3178 No No
(Banik, 2019) [Banik2019] 10,219 4255 No No
(Karim et al., 2020a) [karim2020deephateexplainer] 6115 6115 No Yes
(Romim et al., 2021) [romim2021hate] 30,000 10,000 Yes No
Our 50,314 20,209 Yes Yes
Table A.1: Overview of binary hate speech dataset of Bangla social media

B: Annotation Guidelines

We chose three criteria for a comment to be labeled as HS. These are:

  • Deliberate attack

  • Directed towards a specific group of people

  • Motivated by an aspect of groups identity

But bellow we also present a more nuanced guideline for annotation, based on the community standard of Facebook and YouTube.

Criteria for HS

Criteria for not HS

C: Traditional and non traditional slang words

(a) Traditional Slang WordCloud
(b) Non Traditional Slang WordCloud
Figure C.1: Comparison of traditional and non-traditional slang words.