Offensive Hebrew Corpus and Detection using BERT

09/06/2023
by   Nagham Hamad, et al.
0

Offensive language detection has been well studied in many languages, but it is lagging behind in low-resource languages, such as Hebrew. In this paper, we present a new offensive language corpus in Hebrew. A total of 15,881 tweets were retrieved from Twitter. Each was labeled with one or more of five classes (abusive, hate, violence, pornographic, or none offensive) by Arabic-Hebrew bilingual speakers. The annotation process was challenging as each annotator is expected to be familiar with the Israeli culture, politics, and practices to understand the context of each tweet. We fine-tuned two Hebrew BERT models, HeBERT and AlephBERT, using our proposed dataset and another published dataset. We observed that our data boosts HeBERT performance by 2 D_OLaH. Fine-tuning AlephBERT on our data and testing on D_OLaH yields 69 accuracy, while fine-tuning on D_OLaH and testing on our data yields 57 accuracy, which may be an indication to the generalizability our data offers. Our dataset and fine-tuned models are available on GitHub and Huggingface.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/28/2023

Tutorials on Stance Detection using Pre-trained Language Models: Fine-tuning BERT and Prompting Large Language Models

This paper presents two self-contained tutorials on stance detection in ...
research
06/08/2020

On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines

Fine-tuning pre-trained transformer-based language models such as BERT h...
research
09/15/2022

Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties

The study of language variation examines how language varies between and...
research
06/05/2022

Speech Detection Task Against Asian Hate: BERT the Central, While Data-Centric Studies the Crucial

With the epidemic continuing, hatred against Asians is intensifying in c...
research
04/29/2022

ExaASC: A General Target-Based Stance Detection Corpus in Arabic Language

Target-based Stance Detection is the task of finding a stance toward a t...
research
01/15/2022

Automatic Correction of Syntactic Dependency Annotation Differences

Annotation inconsistencies between data sets can cause problems for low-...
research
08/07/2019

Fine-Tuning Models Comparisons on Garbage Classification for Recyclability

In this study, it is aimed to develop a deep learning application which ...

Please sign up or login with your details

Forgot password? Click here to reset