Can Vision Transformers Perform Convolution?

11/02/2021
by   Shanda Li, et al.
0

Several recent studies have demonstrated that attention-based networks, such as Vision Transformer (ViT), can outperform Convolutional Neural Networks (CNNs) on several computer vision tasks without using convolutional layers. This naturally leads to the following questions: Can a self-attention layer of ViT express any convolution operation? In this work, we prove that a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles. We further provide a lower bound on the number of heads for Vision Transformers to express CNNs. Corresponding with our analysis, experimental results show that the construction in our proof can help inject convolutional bias into Transformers and significantly improve the performance of ViT in low data regimes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2023

A survey of the Vision Transformers and its CNN-Transformer based Variants

Vision transformers have recently become popular as a possible alternati...
research
11/08/2019

On the Relationship between Self-Attention and Convolutional Layers

Recent trends of incorporating attention mechanisms in vision have led r...
research
04/26/2021

Improve Vision Transformers Training by Suppressing Over-smoothing

Introducing the transformer structure into computer vision tasks holds t...
research
07/22/2022

An Impartial Take to the CNN vs Transformer Robustness Contest

Following the surge of popularity of Transformers in Computer Vision, se...
research
06/09/2021

CoAtNet: Marrying Convolution and Attention for All Data Sizes

Transformers have attracted increasing interests in computer vision, but...
research
11/14/2022

ParCNetV2: Oversized Kernel with Enhanced Attention

Transformers have achieved tremendous success in various computer vision...
research
06/13/2023

Reviving Shift Equivariance in Vision Transformers

Shift equivariance is a fundamental principle that governs how we percei...

Please sign up or login with your details

Forgot password? Click here to reset