On The Robustness of Offensive Language Classifiers

03/21/2022
by   Jonathan Rusert, et al.
0

Social media platforms are deploying machine learning based offensive language classification systems to combat hateful, racist, and other forms of offensive speech at scale. However, despite their real-world deployment, we do not yet comprehensively understand the extent to which offensive language classifiers are robust against adversarial attacks. Prior work in this space is limited to studying robustness of offensive language classifiers against primitive attacks such as misspellings and extraneous spaces. To address this gap, we systematically analyze the robustness of state-of-the-art offensive language classifiers against more crafty adversarial attacks that leverage greedy- and attention-based word selection and context-aware embeddings for word replacement. Our results on multiple datasets show that these crafty adversarial attacks can degrade the accuracy of offensive language classifiers by more than 50 of the modified text.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2021

Exploring Misclassifications of Robust Neural Networks to Enhance Adversarial Attacks

Progress in making neural networks more robust against adversarial attac...
research
03/06/2020

Automatic Generation of Adversarial Examples for Interpreting Malware Classifiers

Recent advances in adversarial attacks have shown that machine learning ...
research
11/19/2019

Towards non-toxic landscapes: Automatic toxic comment detection using DNN

The spectacular expansion of the Internet led to the development of a ne...
research
06/19/2020

Analyzing the Real-World Applicability of DGA Classifiers

Separating benign domains from domains generated by DGAs with the help o...
research
05/29/2023

Exploiting Explainability to Design Adversarial Attacks and Evaluate Attack Resilience in Hate-Speech Detection Models

The advent of social media has given rise to numerous ethical challenges...
research
01/05/2018

Shielding Google's language toxicity model against adversarial attacks

Lack of moderation in online communities enables participants to incur i...
research
07/20/2023

Adversarial attacks for mixtures of classifiers

Mixtures of classifiers (a.k.a. randomized ensembles) have been proposed...

Please sign up or login with your details

Forgot password? Click here to reset