Feature Interactions Reveal Linguistic Structure in Language Models

06/21/2023
by   Jaap Jumelet, et al.
0

We study feature interactions in the context of feature attribution methods for post-hoc interpretability. In interpretability research, getting to grips with feature interactions is increasingly recognised as an important challenge, because interacting features are key to the success of neural networks. Feature interactions allow a model to build up hierarchical representations for its input, and might provide an ideal starting point for the investigation into linguistic structure in language models. However, uncovering the exact role that these interactions play is also difficult, and a diverse range of interaction attribution methods has been proposed. In this paper, we focus on the question which of these methods most faithfully reflects the inner workings of the target models. We work out a grey box methodology, in which we train models to perfection on a formal language classification task, using PCFGs. We show that under specific configurations, some methods are indeed able to uncover the grammatical rules acquired by a model. Based on these findings we extend our evaluation to a case study on language models, providing novel insights into the linguistic structure that these models have acquired.

READ FULL TEXT
research
05/28/2021

Language Models Use Monotonicity to Assess NPI Licensing

We investigate the semantic knowledge of language models (LMs), focusing...
research
06/19/2020

How does this interaction affect me? Interpretable attribution for feature interactions

Machine learning transparency calls for interpretable explanations of ho...
research
05/12/2023

Asymmetric feature interaction for interpreting model predictions

In natural language processing (NLP), deep neural networks (DNNs) could ...
research
10/17/2022

RARR: Researching and Revising What Language Models Say, Using Language Models

Language models (LMs) now excel at many tasks such as few-shot learning,...
research
02/11/2019

LS-Tree: Model Interpretation When the Data Are Linguistic

We study the problem of interpreting trained classification models in th...
research
07/11/2020

Feature Interactions in XGBoost

In this paper, we investigate how feature interactions can be identified...
research
09/02/2021

Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

In this paper, we introduce Integrated Directional Gradients (IDG), a me...

Please sign up or login with your details

Forgot password? Click here to reset