Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

09/02/2021
by   Sandipan Sikdar, et al.
0

In this paper, we introduce Integrated Directional Gradients (IDG), a method for attributing importance scores to groups of features, indicating their relevance to the output of a neural network model for a given input. The success of Deep Neural Networks has been attributed to their ability to capture higher level feature interactions. Hence, in the last few years capturing the importance of these feature interactions has received increased prominence in ML interpretability literature. In this paper, we formally define the feature group attribution problem and outline a set of axioms that any intuitive feature group attribution method should satisfy. Earlier, cooperative game theory inspired axiomatic methods only borrowed axioms from solution concepts (such as Shapley value) for individual feature attributions and introduced their own extensions to model interactions. In contrast, our formulation is inspired by axioms satisfied by characteristic functions as well as solution concepts in cooperative game theory literature. We believe that characteristic functions are much better suited to model importance of groups compared to just solution concepts. We demonstrate that our proposed method, IDG, satisfies all the axioms. Using IDG we analyze two state-of-the-art text classifiers on three benchmark datasets for sentiment analysis. Our experiments show that IDG is able to effectively capture semantic interactions in linguistic models via negations and conjunctions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2023

Four Axiomatic Characterizations of the Integrated Gradients Attribution Method

Deep neural networks have produced significant progress among machine le...
research
08/31/2023

Unsupervised discovery of Interpretable Visual Concepts

Providing interpretability of deep-learning models to non-experts, while...
research
05/16/2023

The Weighted Möbius Score: A Unified Framework for Feature Attribution

Feature attribution aims to explain the reasoning behind a black-box mod...
research
05/12/2023

Asymmetric feature interaction for interpreting model predictions

In natural language processing (NLP), deep neural networks (DNNs) could ...
research
02/23/2022

Training Characteristic Functions with Reinforcement Learning: XAI-methods play Connect Four

One of the goals of Explainable AI (XAI) is to determine which input com...
research
06/21/2023

Feature Interactions Reveal Linguistic Structure in Language Models

We study feature interactions in the context of feature attribution meth...
research
12/22/2022

Impossibility Theorems for Feature Attribution

Despite a sea of interpretability methods that can produce plausible exp...

Please sign up or login with your details

Forgot password? Click here to reset