Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

05/08/2021
by   Jiehang Zeng, et al.
0

Recently, few certified defense methods have been developed to provably guarantee the robustness of a text classifier to adversarial synonym substitutions. However, all existing certified defense methods assume that the defenders are informed of how the adversaries generate synonyms, which is not a realistic scenario. In this paper, we propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text, in which the above unrealistic assumption is no longer necessary. The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations. We can certify the classifications of over 50 texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on SST2 dataset. The experimental results show that our randomized smoothing method significantly outperforms recently proposed defense methods across multiple datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

03/27/2022

Rebuild and Ensemble: Exploring Defense Against Text Adversaries

Adversarial attacks can mislead strong neural models; as such, in NLP ta...
11/21/2019

Robustness Certificates for Sparse Adversarial Attacks by Randomized Ablation

Recently, techniques have been developed to provably guarantee the robus...
06/20/2020

Defense against Adversarial Attacks in NLP via Dirichlet Neighborhood Ensemble

Despite neural networks have achieved prominent performance on many natu...
07/28/2021

Towards Robustness Against Natural Language Word Substitutions

Robustness against word substitutions has a well-defined and widely acce...
02/19/2022

Data-Driven Mitigation of Adversarial Text Perturbation

Social networks have become an indispensable part of our lives, with bil...
05/29/2020

SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions

State-of-the-art NLP models can often be fooled by human-unaware transfo...
03/19/2022

Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense

We proposes a novel algorithm, ANTHRO, that inductively extracts over 60...