Less is More: Pay Less Attention in Vision Transformers

05/29/2021
by   Zizheng Pan, et al.
0

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks.

READ FULL TEXT

page 7

page 8

research
04/27/2023

Vision Conformer: Incorporating Convolutions into Vision Transformer Layers

Transformers are popular neural network models that use layers of self-a...
research
05/04/2022

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer...
research
04/04/2022

BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning

Attention mechanisms have been very popular in deep neural networks, whe...
research
05/25/2023

Making Vision Transformers Truly Shift-Equivariant

For computer vision tasks, Vision Transformers (ViTs) have become one of...
research
05/28/2021

KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for ...
research
12/21/2021

Learned Queries for Efficient Local Attention

Vision Transformers (ViT) serve as powerful vision models. Unlike convol...
research
10/05/2021

Transformer Assisted Convolutional Network for Cell Instance Segmentation

Region proposal based methods like R-CNN and Faster R-CNN models have pr...

Please sign up or login with your details

Forgot password? Click here to reset