Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification

09/06/2019
by   Yichao Zhou, et al.
0

Adversarial attacks against machine learning models have threatened various real-world applications such as spam filtering and sentiment analysis. In this paper, we propose a novel framework, learning to DIScriminate Perturbations (DISP), to identify and adjust malicious perturbations, thereby blocking adversarial attacks for text classification models. To identify adversarial attacks, a perturbation discriminator validates how likely a token in the text is perturbed and provides a set of potential perturbations. For each potential perturbation, an embedding estimator learns to restore the embedding of the original word based on the context and a replacement token is chosen based on approximate kNN search. DISP can block adversarial attacks for any NLP model without modifying the model structure or training procedure. Extensive experiments on two benchmark datasets demonstrate that DISP significantly outperforms baseline methods in blocking adversarial attacks for text classification. In addition, in-depth analysis shows the robustness of DISP across different situations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/01/2020

Evaluating Neural Machine Comprehension Model Robustness to Noisy Inputs and Adversarial Attacks

We evaluate machine comprehension models' robustness to noise and advers...
research
12/20/2021

Unifying Model Explainability and Robustness for Joint Text Classification and Rationale Extraction

Recent works have shown explainability and robustness are two crucial in...
research
03/09/2023

Learning the Legibility of Visual Text Perturbations

Many adversarial attacks in NLP perturb inputs to produce visually simil...
research
02/11/2022

Using Random Perturbations to Mitigate Adversarial Attacks on Sentiment Analysis Models

Attacks on deep learning models are often difficult to identify and ther...
research
09/22/2021

BFClass: A Backdoor-free Text Classification Framework

Backdoor attack introduces artificial vulnerabilities into the model by ...
research
10/30/2021

AdvCodeMix: Adversarial Attack on Code-Mixed Data

Research on adversarial attacks are becoming widely popular in the recen...
research
09/03/2019

Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

Neural networks are part of many contemporary NLP systems, yet their emp...

Please sign up or login with your details

Forgot password? Click here to reset