How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

11/07/2022
by   Michael Hassid, et al.
0

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones – the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance – an average relative drop of only 8 no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.

READ FULL TEXT
research
04/10/2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to the...
research
02/04/2022

Temporal Attention for Language Models

Pretrained language models based on the transformer architecture have sh...
research
01/26/2022

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Attention mechanism has been widely believed as the key to success of vi...
research
08/16/2020

Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

Fine-tuning a pretrained transformer for a downstream task has become a ...
research
12/07/2022

Multimodal Vision Transformers with Forced Attention for Behavior Analysis

Human behavior understanding requires looking at minute details in the l...
research
03/24/2021

Finetuning Pretrained Transformers into RNNs

Transformers have outperformed recurrent neural networks (RNNs) in natur...
research
05/25/2023

Explainability Techniques for Chemical Language Models

Explainability techniques are crucial in gaining insights into the reaso...

Please sign up or login with your details

Forgot password? Click here to reset