Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

08/11/2021
by   Sanchit Sinha, et al.
0

Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the resulting text semantically and spatially similar to its seed input (therefore sharing similar interpretations). Simultaneously, the generated examples achieve the same prediction label as the seed yet are given a substantially different explanation by the interpretation methods. Our experiments generate fragile interpretations to attack two SOTA interpretation methods, across three popular Transformer models and on two different NLP datasets. We observe that the rank order correlation drops by over 20 on average. Further, rank-order correlation keeps decreasing as more words get perturbed. Furthermore, we demonstrate that candidates generated from our method have good quality metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2021

On the Faithfulness Measurements for Model Interpretations

Recent years have witnessed the emergence of a variety of post-hoc inter...
research
05/29/2020

SLAM-Inspired Simultaneous Contextualization and Interpreting for Incremental Conversation Sentences

Distributed representation of words has improved the performance for man...
research
12/23/2021

More Than Words: Towards Better Quality Interpretations of Text Classifiers

The large size and complex decision mechanisms of state-of-the-art text ...
research
02/20/2022

Hierarchical Interpretation of Neural Text Classification

Recent years have witnessed increasing interests in developing interpret...
research
10/27/2020

Interpretation of NLP models through input marginalization

To demystify the "black box" property of deep neural networks for natura...
research
10/29/2017

Interpretation of Neural Networks is Fragile

In order for machine learning to be deployed and trusted in many applica...
research
03/06/2023

Multi-resolution Interpretation and Diagnostics Tool for Natural Language Classifiers

Developing explainability methods for Natural Language Processing (NLP) ...

Please sign up or login with your details

Forgot password? Click here to reset