Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

07/31/2023
by   João A. Leite, et al.
0

Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating high-quality human-labelled datasets for this task is difficult and costly, especially because non-offensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT architectures varying in size. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5 noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-of-the-art augmentations such as backtranslation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2023

HULAT at SemEval-2023 Task 10: Data augmentation for pre-trained transformers applied to the detection of sexism in social media

This paper describes our participation in SemEval-2023 Task 10, whose go...
research
09/19/2023

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

In this paper, we explored how to boost speech emotion recognition (SER)...
research
04/06/2022

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Direct speech-to-speech translation (S2ST) models suffer from data scarc...
research
06/16/2023

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Many neural text-to-speech architectures can synthesize nearly natural s...
research
04/28/2020

Adversarial Feature Learning and Unsupervised Clustering based Speech Synthesis for Found Data with Acoustic and Textual Noise

Attention-based sequence-to-sequence (seq2seq) speech synthesis has achi...
research
12/05/2020

Enhanced Offensive Language Detection Through Data Augmentation

Detecting offensive language on social media is an important task. The I...

Please sign up or login with your details

Forgot password? Click here to reset