Reliable Classification Explanations via Adversarial Attacks on Robust Networks

by   Walt Woods, et al.
Portland State University

Neural Networks (NNs) have been found vulnerable to a class of imperceptible attacks, called adversarial examples, which arbitrarily alter the output of the network. These attacks have called the validity of NNs into question, particularly on sensitive problems such as medical imaging or fraud detection. We further argue that the fields of explainable AI and Human-In-The-Loop (HITL) algorithms are impacted by adversarial attacks, as attacks result in perturbations outside of the salient regions highlighted by state-of-the-art techniques such as LIME or Grad-CAM. This work accomplishes three things which greatly reduce the impact of adversarial examples, and pave the way for future HITL workflows: we propose a novel regularization technique inspired by the Lipschitz constraint which greatly improves an NN's resistance to adversarial examples; we propose a collection of novel network and training changes to complement the proposed regularization technique, including a Half-Huber activation function and an integrator-based controller for regularization strength; and we demonstrate that networks trained with this technique may be deliberately attacked to generate rich explanations. Our techniques led to networks more robust than the previous state of the art: using the Accuracy-Robustness Area (ARA), our most robust ImageNet classification network scored 42.2 of 0.0053, an ARA 2.4x greater than the previous state-of-the-art at the same level of accuracy on clean data, achieved with a network one-third the size. A far-reaching benefit of this technique is its ability to intuitively demonstrate decision boundaries to a human observer, allowing for improved debugging of NN decisions, and providing a means for improving the underlying model.



There are no comments yet.


page 20

page 21

page 22

page 23

page 24

page 25

page 29

page 31


A New Family of Neural Networks Provably Resistant to Adversarial Attacks

Adversarial attacks add perturbations to the input features with the int...

JumpReLU: A Retrofit Defense Strategy for Adversarial Attacks

It has been demonstrated that very simple attacks can fool highly-sophis...

Generating Adversarial Inputs Using A Black-box Differential Technique

Neural Networks (NNs) are known to be vulnerable to adversarial attacks....

Towards Understanding the Regularization of Adversarial Robustness on Neural Networks

The problem of adversarial examples has shown that modern Neural Network...

Feedback Learning for Improving the Robustness of Neural Networks

Recent research studies revealed that neural networks are vulnerable to ...

Removing Adversarial Noise in Class Activation Feature Space

Deep neural networks (DNNs) are vulnerable to adversarial noise. Preproc...

Adversarial Attacks on Machinery Fault Diagnosis

Despite the great progress of neural network-based (NN-based) machinery ...

Code Repositories


Code example for the paper, "Adversarial Explanations for Understanding Image Classification Decisions and Improved Neural Network Robustness."

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.