ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation

10/22/2022
by   Fan Yin, et al.
0

Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks and has drawn increasing attention from the Natural Language Processing (NLP) community. Despite the surge of new AED methods, our studies show that existing methods heavily rely on a shortcut to achieve good performance. In other words, current search-based adversarial attacks in NLP stop once model predictions change, and thus most adversarial examples generated by those attacks are located near model decision boundaries. To surpass this shortcut and fairly evaluate AED methods, we propose to test AED methods with Far Boundary (FB) adversarial examples. Existing methods show worse than random guess performance under this scenario. To overcome this limitation, we propose a new technique, ADDMU, adversary detection with data and model uncertainty, which combines two types of uncertainty estimation for both regular and FB adversarial example detection. Our new method outperforms previous methods by 3.6 and 6.0 AUC points under each scenario. Finally, our analysis shows that the two types of uncertainty provided by ADDMU can be leveraged to characterize adversarial examples and identify the ones that contribute most to model's robustness in adversarial training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/06/2023

Reactive Perturbation Defocusing for Textual Adversarial Defense

Recent studies have shown that large pre-trained language models are vul...
research
09/15/2019

Natural Language Adversarial Attacks and Defenses in Word Level

Up until recent two years, inspired by the big amount of research about ...
research
07/16/2022

Towards the Desirable Decision Boundary by Moderate-Margin Adversarial Training

Adversarial training, as one of the most effective defense methods again...
research
04/17/2022

Residue-Based Natural Language Adversarial Attack Detection

Deep learning based systems are susceptible to adversarial attacks, wher...
research
12/11/2020

Random Projections for Adversarial Attack Detection

Whilst adversarial attack detection has received considerable attention,...
research
07/03/2023

Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)

Transformer-based text classifiers like BERT, Roberta, T5, and GPT-3 hav...
research
04/16/2021

Towards Variable-Length Textual Adversarial Attacks

Adversarial attacks have shown the vulnerability of machine learning mod...

Please sign up or login with your details

Forgot password? Click here to reset