DeepAI AI Chat
Log In Sign Up

Resilience of Bayesian Layer-Wise Explanations under Adversarial Attacks

by   Ginevra Carbone, et al.

We consider the problem of the stability of saliency-based explanations of Neural Network predictions under adversarial attacks in a classification task. We empirically show that, for deterministic Neural Networks, saliency interpretations are remarkably brittle even when the attacks fail, i.e. for attacks that do not change the classification label. By leveraging recent results, we provide a theoretical explanation of this result in terms of the geometry of adversarial attacks. Based on these theoretical considerations, we suggest and demonstrate empirically that saliency explanations provided by Bayesian Neural Networks are considerably more stable under adversarial perturbations. Our results not only confirm that Bayesian Neural Networks are more robust to adversarial attacks, but also demonstrate that Bayesian methods have the potential to provide more stable and interpretable assessments of Neural Network predictions.


page 6

page 7

page 8


Identifying Untrustworthy Predictions in Neural Networks by Geometric Gradient Analysis

The susceptibility of deep neural networks to untrustworthy predictions,...

Explaining Away Attacks Against Neural Networks

We investigate the problem of identifying adversarial attacks on image-b...

On the Connection Between Adversarial Robustness and Saliency Map Interpretability

Recent studies on the adversarial vulnerability of neural networks have ...

A simple defense against adversarial attacks on heatmap explanations

With machine learning models being used for more sensitive applications,...

Adversarial Counterfactual Visual Explanations

Counterfactual explanations and adversarial attacks have a related goal:...