SP-ViT: Learning 2D Spatial Priors for Vision Transformers

06/15/2022
by   Yuxuan Zhou, et al.
0

Recently, transformers have shown great potential in image classification and established state-of-the-art results on the ImageNet benchmark. However, compared to CNNs, transformers converge slowly and are prone to overfitting in low-data regimes due to the lack of spatial inductive biases. Such spatial inductive biases can be especially beneficial since the 2D structure of an input image is not well preserved in transformers. In this work, we present Spatial Prior-enhanced Self-Attention (SP-SA), a novel variant of vanilla Self-Attention (SA) tailored for vision transformers. Spatial Priors (SPs) are our proposed family of inductive biases that highlight certain groups of spatial relations. Unlike convolutional inductive biases, which are forced to focus exclusively on hard-coded local regions, our proposed SPs are learned by the model itself and take a variety of spatial relations into account. Specifically, the attention score is calculated with emphasis on certain kinds of spatial relations at each head, and such learned spatial foci can be complementary to each other. Based on SP-SA we propose the SP-ViT family, which consistently outperforms other ViT models with similar GFlops or parameters. Our largest model SP-ViT-L achieves a record-breaking 86.3 a reduction in the number of parameters by almost 50 state-of-the-art model (150M for SP-ViT-L vs 271M for CaiT-M-36) among all ImageNet-1K models trained on 224x224 and fine-tuned on 384x384 resolution w/o extra data.

READ FULL TEXT

page 2

page 3

page 4

page 6

research
03/19/2021

ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases

Convolutional architectures have proven extremely successful for vision ...
research
12/07/2021

Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

Recently, vision Transformers (ViTs) are developing rapidly and starting...
research
04/13/2023

Remote Sensing Change Detection With Transformers Trained from Scratch

Current transformer-based change detection (CD) approaches either employ...
research
07/06/2022

MaiT: Leverage Attention Masks for More Efficient Image Transformers

Though image transformers have shown competitive results with convolutio...
research
06/23/2021

Co-advise: Cross Inductive Bias Distillation

Transformers recently are adapted from the community of natural language...
research
06/09/2021

CoAtNet: Marrying Convolution and Attention for All Data Sizes

Transformers have attracted increasing interests in computer vision, but...
research
10/12/2021

Trivial or impossible – dichotomous data difficulty masks model differences (on ImageNet and beyond)

"The power of a generalization system follows directly from its biases" ...

Please sign up or login with your details

Forgot password? Click here to reset