Quantifying Context Mixing in Transformers

01/30/2023
by   Hosein Mohebbi, et al.
0

Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models' decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis.

READ FULL TEXT

page 6

page 13

page 14

page 15

page 16

page 18

page 19

page 21

research
03/08/2022

Measuring the Mixing of Contextual Information in the Transformer

The Transformer architecture aggregates input information through the se...
research
05/06/2022

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

There has been a growing interest in interpreting the underlying dynamic...
research
06/05/2023

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition

An emerging solution for explaining Transformer-based models is to use v...
research
09/15/2021

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language p...
research
10/06/2021

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vis...
research
05/24/2022

Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT

We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with ...
research
11/22/2021

PointMixer: MLP-Mixer for Point Cloud Understanding

MLP-Mixer has newly appeared as a new challenger against the realm of CN...

Please sign up or login with your details

Forgot password? Click here to reset