Trained Transformers Learn Linear Models In-Context

06/16/2023
by   Ruiqi Zhang, et al.
0

Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/26/2023

A Closer Look at In-Context Learning under Distribution Shifts

In-context learning, a capability that enables a model to learn from inp...
research
06/01/2023

Transformers learn to implement preconditioned gradient descent for in-context learning

Motivated by the striking ability of transformers for in-context learnin...
research
06/08/2023

In-Context Learning through the Bayesian Prism

In-context learning is one of the surprising and useful features of larg...
research
07/07/2023

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

Recent works have empirically analyzed in-context learning and shown tha...
research
06/07/2023

Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection

Neural sequence models based on the transformer architecture have demons...
research
11/28/2022

What learning algorithm is in-context learning? Investigations with linear models

Neural sequence models, especially transformers, exhibit a remarkable ca...
research
08/31/2023

Transformers as Support Vector Machines

Since its inception in "Attention Is All You Need", transformer architec...

Please sign up or login with your details

Forgot password? Click here to reset