DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

02/03/2023
by   Jiayu Jiao, et al.
0

As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1K classification task, DilateFormer achieves comparable performance with 70 fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6 task, 53.5 segmentation task and 51.1

READ FULL TEXT

page 2

page 3

page 5

page 11

page 12

research
08/27/2023

MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing

In recent years, Transformer networks are beginning to replace pure conv...
research
10/30/2021

PatchFormer: A Versatile 3D Transformer Based on Patch Attention

The 3D vision community is witnesses a modeling shift from CNNs to Trans...
research
10/21/2022

Face Pyramid Vision Transformer

A novel Face Pyramid Vision Transformer (FPVT) is proposed to learn a di...
research
09/04/2023

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Transformers have shown superior performance on various vision tasks. Th...
research
08/30/2021

Exploring and Improving Mobile Level Vision Transformers

We study the vision transformer structure in the mobile level in this pa...
research
09/20/2022

Graph Reasoning Transformer for Image Parsing

Capturing the long-range dependencies has empirically proven to be effec...
research
08/14/2023

SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

This paper presents a module, Spatial Cross-scale Convolution (SCSC), wh...

Please sign up or login with your details

Forgot password? Click here to reset