Max-Margin Token Selection in Attention Mechanism

06/23/2023
by   Davoud Ataee Tarzanagh, et al.
1

Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model f(X)=⟨Xv, (XWp)⟩, where X is the token sequence and (v,W,p) are trainable parameters. We prove that running gradient descent on p, or equivalently W, converges in direction to a max-margin solution that separates locally-optimal tokens from non-optimal ones. This clearly formalizes attention as an optimal token selection mechanism. Remarkably, our results are applicable to general data and precisely characterize optimality of tokens in terms of the value embeddings Xv and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing v and p simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where v separates the input features based on their labels. Interestingly, the SVM formulation of p is influenced by the support vector geometry of v. Finally, we verify our theoretical findings via numerical experiments and provide insights.

READ FULL TEXT
research
08/31/2023

Transformers as Support Vector Machines

Since its inception in "Attention Is All You Need", transformer architec...
research
03/20/2023

Robustifying Token Attention for Vision Transformers

Despite the success of vision transformers (ViTs), they still suffer fro...
research
06/06/2023

On the Role of Attention in Prompt-tuning

Prompt-tuning is an emerging strategy to adapt large language models (LL...
research
10/29/2020

The Performance Analysis of Generalized Margin Maximizer (GMM) on Separable Data

Logistic models are commonly used for binary classification tasks. The s...
research
05/17/2019

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

With an eye toward understanding complexity control in deep learning, we...
research
11/19/2020

On the Dynamics of Training Attention Models

The attention mechanism has been widely used in deep neural networks as ...
research
12/13/2021

A Study on Token Pruning for ColBERT

The ColBERT model has recently been proposed as an effective BERT based ...

Please sign up or login with your details

Forgot password? Click here to reset