Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

09/15/2021
by   Goro Kobayashi, et al.
12

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers' progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.

READ FULL TEXT

page 7

page 16

page 17

page 18

page 19

page 20

page 21

research
02/01/2023

Feed-Forward Blocks Control Contextualization in Masked Language Models

Understanding the inner workings of neural network models is a crucial s...
research
03/08/2022

Measuring the Mixing of Contextual Information in the Transformer

The Transformer architecture aggregates input information through the se...
research
12/12/2022

A Neural ODE Interpretation of Transformer Layers

Transformer layers, which use an alternating pattern of multi-head atten...
research
09/13/2023

Traveling Words: A Geometric Interpretation of Transformers

Transformers have significantly advanced the field of natural language p...
research
08/07/2023

RCMHA: Relative Convolutional Multi-Head Attention for Natural Language Modelling

The Attention module finds common usage in language modeling, presenting...
research
09/15/2023

Attention-Only Transformers and Implementing MLPs with Attention Heads

The transformer architecture is widely used in machine learning models a...
research
01/30/2023

Quantifying Context Mixing in Transformers

Self-attention weights and their transformed variants have been the main...

Please sign up or login with your details

Forgot password? Click here to reset