On the Faithfulness Measurements for Model Interpretations

by   Fan Yin, et al.

Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretations, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent they conform to the reasoning process behind the model. To tackle these issues, we start with three criteria: the removal-based criterion, the sensitivity of interpretations, and the stability of interpretations, that quantify different notions of faithfulness, and propose novel paradigms to systematically evaluate interpretations in NLP. Our results show that the performance of interpretations under different criteria of faithfulness could vary substantially. Motivated by the desideratum of these faithfulness notions, we introduce a new class of interpretation methods that adopt techniques from the adversarial robustness domain. Empirical results show that our proposed methods achieve top performance under all three criteria. Along with experiments and analysis on both the text classification and the dependency parsing tasks, we come to a more comprehensive understanding of the diverse set of interpretations.


Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

Interpretability methods like Integrated Gradient and LIME are popular c...

Decision problem of some bundled FOML fragments

Over increasing domain interpretations, ∃and ∀bundled fragments are deci...

Exploring the Relationship Between "Positive Risk Balance" and "Absence of Unreasonable Risk"

International discussions on the overarching topic of how to define and ...

The Logic Traps in Evaluating Post-hoc Interpretations

Post-hoc interpretation aims to explain a trained model and reveal how t...

A Question-Answer Driven Approach to Reveal Affirmative Interpretations from Verbal Negations

This paper explores a question-answer driven approach to reveal affirmat...

On the Lack of Robust Interpretability of Neural Text Classifiers

With the ever-increasing complexity of neural language models, practitio...

A Comparative Study of Faithfulness Metrics for Model Interpretability Methods

Interpretation methods to reveal the internal reasoning processes behind...