A Theoretical Understanding of shallow Vision Transformers: Learning, Generalization, and Sample Complexity

02/12/2023
by   Hongkang Li, et al.
4

Vision Transformers (ViTs) with self-attention modules have recently achieved great empirical success in many vision tasks. Due to non-convex interactions across layers, however, theoretical learning and generalization analysis is mostly elusive. Based on a data model characterizing both label-relevant and label-irrelevant tokens, this paper provides the first theoretical analysis of training a shallow ViT, i.e., one self-attention layer followed by a two-layer perceptron, for a classification task. We characterize the sample complexity to achieve a zero generalization error. Our sample complexity bound is positively correlated with the inverse of the fraction of label-relevant tokens, the token noise level, and the initial model error. We also prove that a training process using stochastic gradient descent (SGD) leads to a sparse attention map, which is a formal verification of the general intuition about the success of attention. Moreover, this paper indicates that a proper token sparsification can improve the test performance by removing label-irrelevant and/or noisy tokens, including spurious correlations. Empirical experiments on synthetic data and CIFAR-10 dataset justify our theoretical results and generalize to deeper ViTs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/06/2023

On the Role of Attention in Prompt-tuning

Prompt-tuning is an emerging strategy to adapt large language models (LL...
research
08/03/2023

Dynamic Token-Pass Transformers for Semantic Segmentation

Vision transformers (ViT) usually extract features via forwarding all th...
research
11/21/2022

Vision Transformer with Super Token Sampling

Vision transformer has achieved impressive performance for many vision t...
research
09/15/2022

Hydra Attention: Efficient Attention with Many Heads

While transformers have begun to dominate many tasks in vision, applying...
research
06/07/2023

Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks

In deep learning, mixture-of-experts (MoE) activates one or few experts ...
research
01/16/2022

Do not rug on me: Zero-dimensional Scam Detection

Uniswap, like other DEXs, has gained much attention this year because it...
research
05/30/2022

Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity

This paper studies the robustness of data valuation to noisy model perfo...

Please sign up or login with your details

Forgot password? Click here to reset