Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

by   Xuran Pan, et al.

Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window attention to reduce the computation complexity, which may compromise the local feature learning or subject to some handcrafted designs. In contrast, local attention, which restricts the receptive field of each query to its own neighboring pixels, enjoys the benefits of both convolution and self-attention, namely local inductive bias and dynamic feature selection. Nevertheless, current local attention modules either use inefficient Im2Col function or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based perspective and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks. Code is available at


page 1

page 7

page 10

page 11


FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persis...

Conformer: Local Features Coupling Global Representations for Visual Recognition

Within Convolutional Neural Network (CNN), the convolution operations ar...

Are we ready for a new paradigm shift? A Survey on Visual Deep MLP

Multilayer perceptron (MLP), as the first neural network structure to ap...

Towards an Effective and Efficient Transformer for Rain-by-snow Weather Removal

Rain-by-snow weather removal is a specialized task in weather-degraded i...

Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction

Classical multiple instance learning (MIL) methods are often based on th...

Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting

Class-agnostic counting (CAC) aims to count objects of interest from a q...

Long Document Ranking with Query-Directed Sparse Transformer

The computing cost of transformer self-attention often necessitates brea...

Please sign up or login with your details

Forgot password? Click here to reset