Less is More: Pay Less Attention in Vision Transformers

05/29/2021
by   Zizheng Pan, et al.
0

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works can be prohibitively expensive due to the quadratic complexity of self-attention over a long sequence of representations, especially for high-resolution dense prediction tasks. To this end, we present a novel Less attention vIsion Transformer (LIT), building upon the fact that convolutions, fully-connected (FC) layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences. Specifically, we propose a hierarchical Transformer where we use pure multi-layer perceptrons (MLPs) to encode rich local patterns in the early stages while applying self-attention modules to capture longer dependencies in deeper layers. Moreover, we further propose a learned deformable token merging module to adaptively fuse informative patches in a non-uniform manner. The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks.

READ FULL TEXT

page 7

page 8

09/29/2021

UFO-ViT: High Performance Linear Vision Transformer without Softmax

Vision transformers have become one of the most important models for com...
05/04/2022

Sequencer: Deep LSTM for Image Classification

In recent computer vision research, the advent of the Vision Transformer...
07/21/2022

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

The task of action detection aims at deducing both the action category a...
11/25/2021

Global Interaction Modelling in Vision Transformer via Super Tokens

With the popularity of Transformer architectures in computer vision, the...
04/04/2022

BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning

Attention mechanisms have been very popular in deep neural networks, whe...
01/09/2022

MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multi-layer perceptron (MLP) models ...
08/24/2022

gSwin: Gated MLP Vision Model with Hierarchical Structure of Shifted Window

Following the success in language domain, the self-attention mechanism (...