Less is More: Pay Less Attention in Vision Transformers

by   Zizheng Pan, et al.

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks.


page 7

page 8


UFO-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer...

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

The task of action detection aims at deducing both the action category a...

Global Interaction Modelling in Vision Transformer via Super Tokens

With the popularity of Transformer architectures in computer vision, the...

BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning

Attention mechanisms have been very popular in deep neural networks, whe...

MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multi-layer perceptron (MLP) models ...

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Following the success in language domain, the self-attention mechanism (...