What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

by   Yufeng Zhang, et al.

In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned within language models? (b) What are suitable performance metrics to evaluate ICL accurately and what are the error rates? (c) How does the transformer architecture enable ICL? To answer (a), we take a Bayesian view and demonstrate that ICL implicitly implements the Bayesian model averaging algorithm. This Bayesian model averaging algorithm is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a regret bound π’ͺ(1/T), where T is the ICL input sequence length. To address (c), in addition to the encoded Bayesian model averaging algorithm in attention, we show that during pertaining, the total variation distance between the learned model and the nominal model is bounded by a sum of an approximation error and a generalization error of π’ͺΜƒ(1/√(N_pT_p)), where N_p and T_p are the number of token sequences and the length of each sequence in pretraining, respectively. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.


page 1

page 2

page 3

page 4

βˆ™ 05/31/2023

Online-to-PAC Conversions: Generalization Bounds via Regret Analysis

We present a new framework for deriving bounds on the generalization bou...
βˆ™ 07/11/2022

Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones i...
βˆ™ 09/19/2023

A Family of Pretrained Transformer Language Models for Russian

Nowadays, Transformer language models (LMs) represent a fundamental comp...
βˆ™ 08/20/2021

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is i...
βˆ™ 01/30/2022

Fast Monte-Carlo Approximation of the Attention Mechanism

We introduce Monte-Carlo Attention (MCA), a randomized approximation met...
βˆ™ 11/04/2021

How Do Neural Sequence Models Generalize? Local and Global Context Cues for Out-of-Distribution Prediction

After a neural sequence model encounters an unexpected token, can its be...
βˆ™ 04/17/2019

Samplers and extractors for unbounded functions

Blasiok (SODA'18) recently introduced the notion of a subgaussian sample...

Please sign up or login with your details

Forgot password? Click here to reset