The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

04/26/2023
by   Zhao Song, et al.
0

Large language models (LLMs) are known for their exceptional performance in natural language processing, making them highly effective in many human life-related or even job-related tasks. The attention mechanism in the Transformer architecture is a critical component of LLMs, as it allows the model to selectively focus on specific input parts. The softmax unit, which is a key part of the attention mechanism, normalizes the attention scores. Hence, the performance of LLMs in various NLP tasks depends significantly on the crucial role played by the attention mechanism with the softmax unit. In-context learning, as one of the celebrated abilities of recent LLMs, is an important concept in querying LLMs such as ChatGPT. Without further parameter updates, Transformers can learn to predict based on few in-context examples. However, the reason why Transformers becomes in-context learners is not well understood. Recently, several works [ASA+22,GTLV22,ONR+22] have studied the in-context learning from a mathematical perspective based on a linear regression formulation min_x Ax - b _2, which show Transformers' capability of learning linear functions in context. In this work, we study the in-context learning based on a softmax regression formulation min_x⟨exp(Ax), 1_n ⟩^-1exp(Ax) - b _2 of Transformer's attention mechanism. We show the upper bounds of the data transformations induced by a single self-attention layer and by gradient-descent on a ℓ_2 regression loss for softmax prediction function, which imply that when training self-attention-only Transformers for fundamental regression tasks, the models learned by gradient-descent and Transformers show great similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

Transformers learn in-context by gradient descent

Transformers have become the state-of-the-art neural network architectur...
research
06/06/2023

On the Role of Attention in Prompt-tuning

Prompt-tuning is an emerging strategy to adapt large language models (LL...
research
08/14/2023

CausalLM is not optimal for in-context learning

Recent empirical evidence indicates that transformer based in-context le...
research
05/26/2023

A Closer Look at In-Context Learning under Distribution Shifts

In-context learning, a capability that enables a model to learn from inp...
research
07/05/2023

In-Context Learning for Attention Scheme: from Single Softmax Regression to Multiple Softmax Regression via a Tensor Trick

Large language models (LLMs) have brought significant and transformative...
research
12/27/2021

Transformer Uncertainty Estimation with Hierarchical Stochastic Attention

Transformers are state-of-the-art in a wide range of NLP tasks and have ...
research
06/01/2023

Birth of a Transformer: A Memory Viewpoint

Large language models based on transformers have achieved great empirica...

Please sign up or login with your details

Forgot password? Click here to reset