A little goes a long way: Improving toxic language classification despite data scarcity

09/25/2020
by   Mika Juuti, et al.
0

Detection of some types of toxic language is hampered by extreme scarcity of labeled training data. Data augmentation - generating new synthetic data from a labeled seed dataset - can help. The efficacy of data augmentation on toxic language classification has not been fully explored. We present the first systematic study on how data augmentation techniques impact performance across toxic language classifiers, ranging from shallow logistic regression architectures to BERT - a state-of-the-art pre-trained Transformer network. We compare the performance of eight techniques on very scarce seed datasets. We show that while BERT performed the best, shallow classifiers performed comparably when trained on data augmented with a combination of three techniques, including GPT-2-generated sentences. We discuss the interplay of performance and computational overhead, which can inform the choice of techniques under different constraints.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2022

Data Augmentation for Intent Classification

Training accurate intent classifiers requires labeled data, which can be...
research
03/04/2020

Data Augmentation using Pre-trained Transformer Models

Language model based pre-trained models such as BERT have provided signi...
research
11/08/2019

Not Enough Data? Deep Learning to the Rescue!

Based on recent advances in natural language modeling and those in text ...
research
12/19/2021

Data Augmentation for Mental Health Classification on Social Media

The mental disorder of online users is determined using social media pos...
research
01/12/2021

Data augmentation and feature selection for automatic model recommendation in computational physics

Classification algorithms have recently found applications in computatio...
research
12/17/2018

Conditional BERT Contextual Augmentation

We propose a novel data augmentation method for labeled sentences called...
research
12/05/2020

Enhanced Offensive Language Detection Through Data Augmentation

Detecting offensive language on social media is an important task. The I...

Please sign up or login with your details

Forgot password? Click here to reset