Memorization Capacity of Multi-Head Attention in Transformers

06/03/2023
by   Sadegh Mahdavi, et al.
0

In this paper, we investigate the memorization capabilities of multi-head attention in Transformers, motivated by the central role attention plays in these models. Under a mild linear independence assumption on the input data, we present a theoretical analysis demonstrating that an H-head attention layer with a context size n, dimension d, and O(Hd^2) parameters can memorize O(Hn) examples. We conduct experiments that verify our assumptions on the image classification task using Vision Transformer. To validate our theoretical findings, we perform synthetic experiments and show a linear relationship between memorization capacity and the number of attention heads.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2021

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Multi-head attention plays a crucial role in the recent success of Trans...
research
07/18/2022

Multi-manifold Attention for Vision Transformers

Vision Transformer are very popular nowadays due to their state-of-the-a...
research
06/30/2021

On the Power of Saturated Transformers: A View from Circuit Complexity

Transformers have become a standard architecture for many NLP problems. ...
research
06/17/2022

SimA: Simple Softmax-free Attention for Vision Transformers

Recently, vision transformers have become very popular. However, deployi...
research
05/31/2021

Cascaded Head-colliding Attention

Transformers have advanced the field of natural language processing (NLP...
research
06/05/2023

Representational Strengths and Limitations of Transformers

Attention layers, as commonly used in transformers, form the backbone of...
research
09/15/2023

Attention-Only Transformers and Implementing MLPs with Attention Heads

The transformer architecture is widely used in machine learning models a...

Please sign up or login with your details

Forgot password? Click here to reset