Identifying Adversarial Attacks on Text Classifiers

01/21/2022
by   Zhouhang Xie, et al.
0

The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach – we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling: 1.5 million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. As our second contribution, we use this dataset to develop and benchmark a number of classifiers for attack identification – determining if a given text has been adversarially manipulated and by which attack. As a third contribution, we demonstrate the effectiveness of three classes of features for these tasks: text properties, capturing content and presentation of text; language model properties, determining which tokens are more or less probable throughout the input; and target model properties, representing how the text classifier is influenced by the attack, including internal node activations. Overall, this represents a first step towards forensics for adversarial attacks against text classifiers.

READ FULL TEXT

page 6

page 7

page 17

research
10/21/2022

TCAB: A Large-Scale Text Classification Attack Benchmark

We introduce the Text Classification Attack Benchmark (TCAB), a dataset ...
research
01/05/2018

Shielding Google's language toxicity model against adversarial attacks

Lack of moderation in online communities enables participants to incur i...
research
05/03/2022

Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks

Deep learning (DL) is being used extensively for text classification. Ho...
research
07/20/2023

Adversarial attacks for mixtures of classifiers

Mixtures of classifiers (a.k.a. randomized ensembles) have been proposed...
research
10/04/2019

Adversarial Examples for Cost-Sensitive Classifiers

Motivated by safety-critical classification problems, we investigate adv...
research
10/24/2021

Towards A Conceptually Simple Defensive Approach for Few-shot classifiers Against Adversarial Support Samples

Few-shot classifiers have been shown to exhibit promising results in use...
research
10/22/2020

Adversarial Attacks on Binary Image Recognition Systems

We initiate the study of adversarial attacks on models for binary (i.e. ...

Please sign up or login with your details

Forgot password? Click here to reset