Do Attention Heads in BERT Track Syntactic Dependencies?

11/27/2019
by   Phu Mon Htut, et al.
0

We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods—taking the maximum attention weight and computing the maximum spanning tree—to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets—the syntax-oriented CoLA and the semantics-oriented MNLI—to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods. Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn.

READ FULL TEXT
research
04/30/2020

Universal Dependencies according to BERT: both more specific and more general

This work focuses on analyzing the form and extent of syntactic abstract...
research
01/26/2021

Attention Can Reflect Syntactic Structure (If You Let It)

Since the popularization of the Transformer as a general-purpose feature...
research
02/16/2021

Have Attention Heads in BERT Learned Constituency Grammar?

With the success of pre-trained language models in recent years, more an...
research
01/27/2021

On the Evolution of Syntactic Information Encoded by BERT's Contextualized Representations

The adaptation of pretrained language models to solve supervised tasks h...
research
05/22/2023

GATology for Linguistics: What Syntactic Dependencies It Knows

Graph Attention Network (GAT) is a graph neural network which is one of ...
research
06/05/2019

From Balustrades to Pierre Vinken: Looking for Syntax in Transformer Self-Attentions

We inspect the multi-head self-attention in Transformer NMT encoders for...
research
11/11/2022

The Architectural Bottleneck Principle

In this paper, we seek to measure how much information a component in a ...

Please sign up or login with your details

Forgot password? Click here to reset