ViHOS: Hate Speech Spans Detection for Vietnamese

01/24/2023
by   Phu Gia Hoang, et al.
0

The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus containing 26k spans on 11k comments. We also provide definitions of hateful and offensive spans in Vietnamese comments as well as detailed annotation guidelines. Besides, we conduct experiments with various state-of-the-art models. Specifically, XLM-R_Large achieved the best F1-scores in Single span detection and All spans detection, while PhoBERT_Large obtained the highest in Multiple spans detection. Finally, our error analysis demonstrates the difficulties in detecting specific types of spans in our data for future research. Disclaimer: This paper contains real comments that could be considered profane, offensive, or abusive.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2021

Constructive and Toxic Speech Detection for Open-domain Social Media Comments in Vietnamese

The rise of social media has led to the increasing of comments on online...
research
05/26/2020

BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection

Toxic comments in online platforms are an unavoidable social issue under...
research
03/16/2020

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

In this paper, we discuss the development of a multilingual annotated co...
research
06/10/2021

Ruddit: Norms of Offensiveness for English Reddit Comments

On social media platforms, hateful and offensive language negatively imp...
research
05/05/2020

Creating a Multimodal Dataset of Images and Text to Study Abusive Language

In order to study online hate speech, the availability of datasets conta...
research
03/10/2021

Identifying bot activity in GitHub pull request and issue comments

Development bots are used on Github to automate repetitive activities. S...
research
05/24/2021

Abusive Language Detection in Heterogeneous Contexts: Dataset Collection and the Role of Supervised Attention

Abusive language is a massive problem in online social platforms. Existi...

Please sign up or login with your details

Forgot password? Click here to reset