What Does BERT Look At? An Analysis of BERT's Attention

06/11/2019
by   Kevin Clark, et al.
0

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT's attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT's attention.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2021

Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees

Pre-trained language models like BERT achieve superior performances in v...
research
11/02/2020

How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT's Attention

Recent research on the multi-head attention mechanism, especially that i...
research
10/11/2020

SMYRF: Efficient Attention using Asymmetric Clustering

We propose a novel type of balanced clustering algorithm to approximate ...
research
12/30/2020

Improving BERT with Syntax-aware Local Attention

Pre-trained Transformer-based neural language models, such as BERT, have...
research
05/12/2020

On the Robustness of Language Encoders against Grammatical Errors

We conduct a thorough study to diagnose the behaviors of pre-trained lan...
research
03/02/2022

Discontinuous Constituency and BERT: A Case Study of Dutch

In this paper, we set out to quantify the syntactic capacity of BERT in ...
research
07/06/2022

The Role of Complex NLP in Transformers for Text Ranking?

Even though term-based methods such as BM25 provide strong baselines in ...

Please sign up or login with your details

Forgot password? Click here to reset