Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

10/27/2018
by   Guanhong Tao, et al.
0

Adversarial sample attacks perturb benign inputs to induce DNN misbehaviors. Recent research has demonstrated the widespread presence and the devastating consequences of such attacks. Existing defense techniques either assume prior knowledge of specific attacks or may not work well on complex models due to their underlying assumptions. We argue that adversarial sample attacks are deeply entangled with interpretability of DNN models: while classification results on benign inputs can be reasoned based on the human perceptible features/attributes, results on adversarial samples can hardly be explained. Therefore, we propose a novel adversarial sample detection technique for face recognition models, based on interpretability. It features a novel bi-directional correspondence inference between attributes and internal neurons to identify neurons critical for individual attributes. The activation values of critical neurons are enhanced to amplify the reasoning part of the computation and the values of other neurons are weakened to suppress the uninterpretable part. The classification results after such transformation are compared with those of the original model to detect adversaries. Results show that our technique can achieve 94 attacks with 9.91 state-of-the-art feature squeezing technique can only achieve 55 23.3

READ FULL TEXT
research
02/07/2020

RAID: Randomized Adversarial-Input Detection for Neural Networks

In recent years, neural networks have become the default choice for imag...
research
02/13/2020

Identifying Audio Adversarial Examples via Anomalous Pattern Detection

Audio processing models based on deep neural networks are susceptible to...
research
12/03/2020

Detecting Trojaned DNNs Using Counterfactual Attributions

We target the problem of detecting Trojans or backdoors in DNNs. Such mo...
research
10/29/2020

Can the state of relevant neurons in a deep neural networks serve as indicators for detecting adversarial attacks?

We present a method for adversarial attack detection based on the inspec...
research
08/23/2020

Ptolemy: Architecture Support for Robust Deep Learning

Deep learning is vulnerable to adversarial attacks, where carefully-craf...
research
08/04/2023

ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP

Backdoor attacks have emerged as a prominent threat to natural language ...

Please sign up or login with your details

Forgot password? Click here to reset