Faithful to Whom? Questioning Interpretability Measures in NLP

08/13/2023
by   Evan Crothers, et al.
0

A common approach to quantifying model interpretability is to calculate faithfulness metrics based on iteratively masking input tokens and measuring how much the predicted label changes as a result. However, we show that such metrics are generally not suitable for comparing the interpretability of different neural text classifiers as the response to masked inputs is highly model-specific. We demonstrate that iterative masking can produce large variation in faithfulness scores between comparable models, and show that masked samples are frequently outside the distribution seen during training. We further investigate the impact of adversarial attacks and adversarial training on faithfulness scores, and demonstrate the relevance of faithfulness measures for analyzing feature salience in text adversarial attacks. Our findings provide new insights into the limitations of current faithfulness metrics and key considerations to utilize them appropriately.

READ FULL TEXT
research
03/16/2023

Robust Evaluation of Diffusion-Based Adversarial Purification

We question the current evaluation practice on diffusion-based purificat...
research
07/04/2023

Interpretable Computer Vision Models through Adversarial Training: Unveiling the Robustness-Interpretability Connection

With the perpetual increase of complexity of the state-of-the-art deep n...
research
10/14/2021

Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models

Recently, NLP models have achieved remarkable progress across a variety ...
research
04/27/2020

Adversarial Fooling Beyond "Flipping the Label"

Recent advancements in CNNs have shown remarkable achievements in variou...
research
09/10/2020

Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent

Adversarial training, especially projected gradient descent (PGD), has b...
research
06/19/2020

Differentiable Language Model Adversarial Attacks on Categorical Sequence Classifiers

An adversarial attack paradigm explores various scenarios for the vulner...
research
06/08/2021

On the Lack of Robust Interpretability of Neural Text Classifiers

With the ever-increasing complexity of neural language models, practitio...

Please sign up or login with your details

Forgot password? Click here to reset