HateBERT: Retraining BERT for Abusive Language Detection in English

10/23/2020
by   Tommaso Caselli, et al.
1

In this paper, we introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we have collected and made available to the public. We present the results of a detailed comparison between a general pre-trained language model and the abuse-inclined version obtained by retraining with posts from the banned communities on three English datasets for offensive, abusive language and hate speech detection tasks. In all datasets, HateBERT outperforms the corresponding general BERT model. We also discuss a battery of experiments comparing the portability of the general pre-trained language model and its corresponding abusive language-inclined counterpart across the datasets, indicating that portability is affected by compatibility of the annotated phenomena.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2021

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Recent work in language modeling has shown that training large-scale Tra...
research
11/15/2022

Relationship of the language distance to English ability of a country

Language difference is one of the factors that hinder the acquisition of...
research
11/07/2019

The LIG system for the English-Czech Text Translation Task of IWSLT 2019

In this paper, we present our submission for the English to Czech Text T...
research
08/14/2020

Hate Speech Detection and Racial Bias Mitigation in Social Media based on BERT model

Disparate biases associated with datasets and trained classifiers in hat...
research
11/16/2020

Don't Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities

In this paper, we introduce a new annotated dataset which is aimed at su...
research
02/12/2021

Characterizing English Variation across Social Media Communities with BERT

Much previous work characterizing language variation across Internet soc...
research
08/27/2021

Exploring the Capacity of a Large-scale Masked Language Model to Recognize Grammatical Errors

In this paper, we explore the capacity of a language model-based method ...

Please sign up or login with your details

Forgot password? Click here to reset