Bridging Adversarial Robustness and Gradient Interpretability

03/27/2019
by   Beomsu Kim, et al.
0

Adversarial training is a training scheme designed to counter adversarial attacks by augmenting the training dataset with adversarial examples. Surprisingly, several studies have observed that loss gradients from adversarially trained DNNs are visually more interpretable than those from standard DNNs. Although this phenomenon is interesting, there are only few works that have offered an explanation. In this paper, we attempted to bridge this gap between adversarial robustness and gradient interpretability. To this end, we identified that loss gradients from adversarially trained DNNs align better with human perception because adversarial training restricts gradients closer to the image manifold. We then demonstrated that adversarial training causes loss gradients to be quantitatively meaningful. Finally, we showed that under the adversarial training framework, there exists an empirical trade-off between test accuracy and loss gradient interpretability and proposed two potential approaches to resolving this trade-off.

READ FULL TEXT

page 4

page 8

research
10/01/2018

Improved robustness to adversarial examples using Lipschitz regularization of the loss

Adversarial training is an effective method for improving robustness to ...
research
09/10/2020

Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent

Adversarial training, especially projected gradient descent (PGD), has b...
research
03/19/2021

Noise Modulation: Let Your Model Interpret Itself

Given the great success of Deep Neural Networks(DNNs) and the black-box ...
research
07/22/2022

Do Perceptually Aligned Gradients Imply Adversarial Robustness?

In the past decade, deep learning-based networks have achieved unprecede...
research
06/15/2020

On the Loss Landscape of Adversarial Training: Identifying Challenges and How to Overcome Them

We analyze the influence of adversarial training on the loss landscape o...
research
02/25/2021

Do Input Gradients Highlight Discriminative Features?

Interpretability methods that seek to explain instance-specific model pr...
research
07/21/2021

Fast and Scalable Adversarial Training of Kernel SVM via Doubly Stochastic Gradients

Adversarial attacks by generating examples which are almost indistinguis...

Please sign up or login with your details

Forgot password? Click here to reset