Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations

01/17/2022
by   Chris Emmery, et al.
0

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora. The perturbed data, models, and code are available for reproduction at https://github.com/cmry/augtox

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2020

Stereopagnosia: Fooling Stereo Networks with Adversarial Perturbations

We study the effect of adversarial perturbations of images on the estima...
research
03/09/2023

Learning the Legibility of Visual Text Perturbations

Many adversarial attacks in NLP perturb inputs to produce visually simil...
research
12/14/2019

Towards Robust Toxic Content Classification

Toxic content detection aims to identify content that can offend or harm...
research
09/29/2020

Inverse Classification with Limited Budget and Maximum Number of Perturbed Samples

Most recent machine learning research focuses on developing new classifi...
research
09/12/2022

DoubleMix: Simple Interpolation-Based Data Augmentation for Text Classification

This paper proposes a simple yet effective interpolation-based data augm...
research
08/29/2022

Reducing Certified Regression to Certified Classification

Adversarial training instances can severely distort a model's behavior. ...
research
01/04/2023

UniHD at TSAR-2022 Shared Task: Is Compute All We Need for Lexical Simplification?

Previous state-of-the-art models for lexical simplification consist of c...

Please sign up or login with your details

Forgot password? Click here to reset