An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

12/30/2022
by   Yufeng Zhang, et al.
0

With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2022

On Compressing Sequences for Self-Supervised Speech Models

Compressing self-supervised models has become increasingly necessary, as...
research
10/01/2021

Stochastic Contrastive Learning

While state-of-the-art contrastive Self-Supervised Learning (SSL) models...
research
01/31/2021

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Recently multimodal transformer models have gained popularity because th...
research
02/18/2022

Masked prediction tasks: a parameter identifiability view

The vast majority of work in self-supervised learning, both theoretical ...
research
03/02/2023

On the Provable Advantage of Unsupervised Pretraining

Unsupervised pretraining, which learns a useful representation using a l...
research
03/30/2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the rec...

Please sign up or login with your details

Forgot password? Click here to reset