Memorization Capacity of Multi-Head Attention in Transformers

06/03/2023

∙

In this paper, we investigate the memorization capabilities of multi-head attention in Transformers, motivated by the central role attention plays in these models. Under a mild linear independence assumption on the input data, we present a theoretical analysis demonstrating that an H-head attention layer with a context size n, dimension d, and O(Hd^2) parameters can memorize O(Hn) examples. We conduct experiments that verify our assumptions on the image classification task using Vision Transformer. To validate our theoretical findings, we perform synthetic experiments and show a linear relationship between memorization capacity and the number of attention heads.

READ FULL TEXT

Memorization Capacity of Multi-Head Attention in Transformers

Sign in with Google

Consider DeepAI Pro