Towards Robust Explanations for Deep Neural Networks

by   Ann-Kathrin Dombrowski, et al.

Explanation methods shed light on the decision process of black-box classifiers such as deep neural networks. But their usefulness can be compromised because they are susceptible to manipulations. With this work, we aim to enhance the resilience of explanations. We develop a unified theoretical framework for deriving bounds on the maximal manipulability of a model. Based on these theoretical insights, we present three different techniques to boost robustness against manipulation: training with weight decay, smoothing activation functions, and minimizing the Hessian of the network. Our experimental results confirm the effectiveness of these approaches.


page 3

page 10

page 28

page 29

page 30


Don't Paint It Black: White-Box Explanations for Deep Learning in Computer Security

Deep learning is increasingly used as a basic building block of security...

Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions

Recent studies have shown that deep neural networks (DNNs) are vulnerabl...

Explanations can be manipulated and geometry is to blame

Explanation methods aim to make neural networks more trustworthy and int...

Do Explanations make VQA Models more Predictable to a Human?

A rich line of research attempts to make deep neural networks more trans...

Fairwashing Explanations with Off-Manifold Detergent

Explanation methods promise to make black-box classifiers more transpare...

FitAct: Error Resilient Deep Neural Networks via Fine-Grained Post-Trainable Activation Functions

Deep neural networks (DNNs) are increasingly being deployed in safety-cr...

Iterative and Adaptive Sampling with Spatial Attention for Black-Box Model Explanations

Deep neural networks have achieved great success in many real-world appl...