LAHM : Large Annotated Dataset for Multi-Domain and Multilingual Hate Speech Identification

04/03/2023
by   Ankit Yadav, et al.
0

Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual hate speech analysis dataset for English, Hindi, Arabic, French, German and Spanish languages for multiple domains across hate speech - Abuse, Racism, Sexism, Religious Hate and Extremism. To the best of our knowledge, this paper is the first to address the problem of identifying various types of hate speech in these five wide domains in these six languages. In this work, we describe how we created the dataset, created annotations at high level and low level for different domains and how we use it to test the current state-of-the-art multilingual and multitask learning approaches. We evaluate our dataset in various monolingual, cross-lingual and machine translation classification settings and compare it against open source English datasets that we aggregated and merged for this task. Then we discuss how this approach can be used to create large scale hate-speech datasets and how to leverage our annotations in order to improve hate speech detection and classification in general.

READ FULL TEXT
research
08/29/2019

Multilingual and Multi-Aspect Hate Speech Analysis

Current research on hate speech analysis is typically oriented towards m...
research
04/29/2022

Czech Dataset for Cross-lingual Subjectivity Classification

In this paper, we introduce a new Czech subjectivity dataset of 10k manu...
research
01/27/2022

Highly Generalizable Models for Multilingual Hate Speech Detection

Hate speech detection has become an important research topic within the ...
research
04/30/2021

Cross-lingual hate speech detection based on multilingual domain-specific word embeddings

Automatic hate speech detection in online social networks is an importan...
research
07/26/2023

Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems

Creating high-quality annotated data for task-oriented dialog (ToD) is k...
research
08/11/2023

Large-Scale Learning on Overlapped Speech Detection: New Benchmark and New General System

Overlapped Speech Detection (OSD) is an important part of speech applica...
research
10/19/2022

Leveraging a New Spanish Corpus for Multilingual and Crosslingual Metaphor Detection

The lack of wide coverage datasets annotated with everyday metaphorical ...

Please sign up or login with your details

Forgot password? Click here to reset