ML-LOO: Detecting Adversarial Examples with Feature Attribution

06/08/2019
by   Puyudi Yang, et al.
3

Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to input. The perturbation is often human imperceptible on image data. We observe a significant difference in feature attributions of adversarially crafted examples from those of original ones. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle the attacks with mixed confidence levels. Through vast experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets among state-of-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods.

READ FULL TEXT
research
02/23/2021

Adversarial Examples Detection beyond Image Space

Deep neural networks have been proved that they are vulnerable to advers...
research
05/22/2019

Detecting Adversarial Examples and Other Misclassifications in Neural Networks by Introspection

Despite having excellent performances for a wide variety of tasks, moder...
research
03/14/2019

Attribution-driven Causal Analysis for Detection of Adversarial Examples

Attribution methods have been developed to explain the decision of a mac...
research
12/31/2020

Beating Attackers At Their Own Games: Adversarial Example Detection Using Adversarial Gradient Directions

Adversarial examples are input examples that are specifically crafted to...
research
05/18/2021

Detecting Adversarial Examples with Bayesian Neural Network

In this paper, we propose a new framework to detect adversarial examples...
research
04/26/2023

SHIELD: Thwarting Code Authorship Attribution

Authorship attribution has become increasingly accurate, posing a seriou...
research
03/07/2021

Detecting Adversarial Examples from Sensitivity Inconsistency of Spatial-Transform Domain

Deep neural networks (DNNs) have been shown to be vulnerable against adv...

Please sign up or login with your details

Forgot password? Click here to reset