Do Input Gradients Highlight Discriminative Features?

02/25/2021
by   Harshay Shah, et al.
3

Interpretability methods that seek to explain instance-specific model predictions [Simonyan et al. 2014, Smilkov et al. 2017] are often based on the premise that the magnitude of input-gradient – gradient of the loss with respect to input – highlights discriminative features that are relevant for prediction over non-discriminative features that are irrelevant for prediction. In this work, we introduce an evaluation framework to study this hypothesis for benchmark image classification tasks, and make two surprising observations on CIFAR-10 and Imagenet-10 datasets: (a) contrary to conventional wisdom, input gradients of standard models (i.e., trained on the original data) actually highlight irrelevant features over relevant features; (b) however, input gradients of adversarially robust models (i.e., trained on adversarially perturbed data) starkly highlight relevant features over irrelevant features. To better understand input gradients, we introduce a synthetic testbed and theoretically justify our counter-intuitive empirical findings. Our observations motivate the need to formalize and verify common assumptions in interpretability, while our evaluation framework and synthetic dataset serve as a testbed to rigorously analyze instance-specific interpretability methods.

READ FULL TEXT
research
06/22/2020

What shapes feature representations? Exploring datasets, architectures, and training

In naturalistic learning problems, a model's input contains a wide range...
research
10/03/2019

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Recent research shows that the following two models are equivalent: (a) ...
research
01/26/2022

IMACS: Image Model Attribution Comparison Summaries

Developing a suitable Deep Neural Network (DNN) often requires significa...
research
03/27/2019

Bridging Adversarial Robustness and Gradient Interpretability

Adversarial training is a training scheme designed to counter adversaria...
research
06/13/2022

Geometrically Guided Integrated Gradients

Interpretability methods for deep neural networks mainly focus on the se...
research
01/13/2020

Towards Interpretable and Robust Hand Detection via Pixel-wise Prediction

The lack of interpretability of existing CNN-based hand detection method...

Please sign up or login with your details

Forgot password? Click here to reset