Memorization Capacity of Multi-Head Attention in Transformers

06/03/2023
by   Sadegh Mahdavi, et al.
0

In this paper, we investigate the memorization capabilities of multi-head attention in Transformers, motivated by the central role attention plays in these models. Under a mild linear independence assumption on the input data, we present a theoretical analysis demonstrating that an H-head attention layer with a context size n, dimension d, and O(Hd^2) parameters can memorize O(Hn) examples. We conduct experiments that verify our assumptions on the image classification task using Vision Transformer. To validate our theoretical findings, we perform synthetic experiments and show a linear relationship between memorization capacity and the number of attention heads.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset