Gradient-based Analysis of NLP Models is Manipulable

10/12/2020
by   Junlin Wang, et al.
11

Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade that overwhelms the gradients without affecting the predictions. This Facade can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (text classification, NLI, and QA), we show that our method can manipulate numerous gradient-based analysis techniques: saliency maps, input reduction, and adversarial perturbations all identify unimportant or targeted tokens as being highly important. The code and a tutorial of this paper is available at http://ucinlp.github.io/facade.

READ FULL TEXT
research
02/21/2023

Tell Model Where to Attend: Improving Interpretability of Aspect-Based Sentiment Classification via Small Explanation Annotations

Gradient-based explanation methods play an important role in the field o...
research
07/25/2023

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Chain-of-thought (CoT) prompting has been shown to empirically improve t...
research
05/14/2020

Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions

Modern deep learning models for NLP are notoriously opaque. This has mot...
research
06/09/2019

Is Attention Interpretable?

Attention mechanisms have recently boosted performance on a range of NLP...
research
07/22/2019

Sparsity Emerges Naturally in Neural Language Models

Concerns about interpretability, computational resources, and principled...
research
12/01/2020

Rethinking Positive Aggregation and Propagation of Gradients in Gradient-based Saliency Methods

Saliency methods interpret the prediction of a neural network by showing...
research
09/26/2014

Gradient-based Taxis Algorithms for Network Robotics

Finding the physical location of a specific network node is a prototypic...

Please sign up or login with your details

Forgot password? Click here to reset