Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

04/21/2020
by   Goro Kobayashi, et al.
0

Because attention modules are core components of Transformer-based models that have recently achieved considerable success in natural language processing, the community has a great deal of interest in why attention modules are successful and what kind of linguistic information they capture. In particular, previous studies have mainly analyzed attention weights to see how much information the attention modules gather from each input to produce an output. In this study, we point out that attention weights alone are only one of the two factors determining the output of self-attention modules, and we propose to incorporate the other factor as well, namely, the transformed input vectors into the analysis. That is, we measure the norm of the weighted vectors as the contribution of each input to an output. Our analysis of self-attention modules in BERT and the Transformer-based neural machine translation system shows that the attention modules behave very intuitively, contrary to previous findings. That is, our analysis reveals that (1) BERT's attention modules do not pay so much attention to special tokens, and (2) Transformer's attention modules capture word alignment quite well.

READ FULL TEXT
research
12/02/2019

Multi-Scale Self-Attention for Text Classification

In this paper, we introduce the prior knowledge, multi-scale structure, ...
research
05/02/2020

Quantifying Attention Flow in Transformers

In the Transformer model, "self-attention" combines information from att...
research
06/17/2022

Local Slot Attention for Vision-and-Language Navigation

Vision-and-language navigation (VLN), a frontier study aiming to pave th...
research
10/14/2020

DA-Transformer: Distance-aware Transformer

Transformer has achieved great success in the NLP field by composing var...
research
10/08/2021

Speeding up Deep Model Training by Sharing Weights and Then Unsharing

We propose a simple and efficient approach for training the BERT model. ...
research
07/10/2020

BISON:BM25-weighted Self-Attention Framework for Multi-Fields Document Search

Recent breakthrough in natural language processing has advanced the info...
research
05/18/2021

Effective Attention Sheds Light On Interpretability

An attention matrix of a transformer self-attention sublayer can provabl...

Please sign up or login with your details

Forgot password? Click here to reset