The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

01/22/2021
by   Madhura Pande, et al.
5

Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically analyze attention heads and confidently comment on many commonly posed questions on analyzing the BERT model. In particular, we comment on the co-location of multiple functional roles in the same attention head, the distribution of attention heads across layers, and effect of fine-tuning for specific NLP tasks on these functional roles.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/13/2020

On the Importance of Local Information in Transformer Based Models

The self-attention module is a key component of Transformer-based models...
research
06/22/2022

Towards Unsupervised Content Disentanglement in Sentence Representations via Syntactic Roles

Linking neural representations to linguistic factors is crucial in order...
research
11/10/2019

Syntax-Infused Transformer and BERT models for Machine Translation and Natural Language Understanding

Attention-based models have shown significant improvement over tradition...
research
12/28/2020

Syntax-Enhanced Pre-trained Model

We study the problem of leveraging the syntactic structure of text to en...
research
03/25/2021

BERT4SO: Neural Sentence Ordering by Fine-tuning BERT

Sentence ordering aims to arrange the sentences of a given text in the c...
research
12/22/2020

Multi-Head Self-Attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of ...
research
05/25/2019

Are Sixteen Heads Really Better than One?

Attention is a powerful and ubiquitous mechanism for allowing neural mod...

Please sign up or login with your details

Forgot password? Click here to reset