Interpretation of Neural Networks is Fragile

10/29/2017
by   Amirata Ghorbani, et al.
0

In order for machine learning to be deployed and trusted in many applications, it is crucial to be able to reliably explain why the machine learning algorithm makes certain predictions. For example, if an algorithm classifies a given pathology image to be a malignant tumor, then the doctor may need to know which parts of the image led the algorithm to this classification. How to interpret black-box predictors is thus an important and active area of research. A fundamental question is: how much can we trust the interpretation itself? In this paper, we show that interpretation of deep learning predictions is extremely fragile in the following sense: two perceptively indistinguishable inputs with the same predicted label can be assigned very different interpretations. We systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10. Our experiments show that even small random perturbation can change the feature importance and new systematic perturbations can lead to dramatically different interpretations without changing the label. We extend these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile. Our analysis of the geometry of the Hessian matrix gives insight on why fragility could be a fundamental challenge to the current interpretation approaches.

READ FULL TEXT

page 2

page 7

page 12

page 13

page 16

research
09/08/2018

Interpreting Neural Networks With Nearest Neighbors

Local model interpretation methods explain individual predictions by ass...
research
08/16/2021

Synthesizing Pareto-Optimal Interpretations for Black-Box Models

We present a new multi-objective optimization approach for synthesizing ...
research
11/17/2020

Learning outside the Black-Box: The pursuit of interpretable models

Machine Learning has proved its ability to produce accurate models but t...
research
07/31/2019

Local Interpretation Methods to Machine Learning Using the Domain of the Feature Space

As machine learning becomes an important part of many real world applica...
research
08/11/2021

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Interpretability methods like Integrated Gradient and LIME are popular c...
research
04/12/2021

Evaluating Saliency Methods for Neural Language Models

Saliency methods are widely used to interpret neural network predictions...
research
10/31/2022

Consistent and Truthful Interpretation with Fourier Analysis

For many interdisciplinary fields, ML interpretations need to be consist...

Please sign up or login with your details

Forgot password? Click here to reset