What Makes for Hierarchical Vision Transformer?

07/05/2021
by   Yuxin Fang, et al.
0

Recent studies show that hierarchical Vision Transformer with interleaved non-overlapped intra window self-attention & shifted window self-attention is able to achieve state-of-the-art performance in various visual recognition tasks and challenges CNN's dense sliding window paradigm. Most follow-up works try to replace shifted window operation with other kinds of cross window communication while treating self-attention as the de-facto standard for intra window information aggregation. In this short preprint, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and what makes for hierarchical Vision Transformer? We replace self-attention layers in Swin Transformer and Shuffle Transformer with simple linear mapping and keep other components unchanged. The resulting architecture with 25.4M parameters and 4.2G FLOPs achieves 80.5% Top-1 accuracy, compared to 81.3% for Swin Transformer with 28.3M parameters and 4.5G FLOPs. We also experiment with other alternatives to self-attention for context aggregation inside each non-overlapped window, which all give similar competitive results under the same architecture. Our study reveals that the macro architecture of Swin model families (i.e., interleaved intra window & cross window communications), other than specific aggregation layers or specific means of cross window communication, may be more responsible for its strong performance and is the real challenger to CNN's dense sliding window paradigm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/24/2022

Cross Aggregation Transformer for Image Restoration

Recently, Transformer architecture has been introduced into image restor...
research
12/24/2021

Raw Produce Quality Detection with Shifted Window Self-Attention

Global food insecurity is expected to worsen in the coming decades with ...
research
07/01/2021

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

We present CSWin Transformer, an efficient and effective Transformer-bas...
research
03/04/2022

Characterizing Renal Structures with 3D Block Aggregate Transformers

Efficiently quantifying renal structures can provide distinct spatial co...
research
08/06/2023

Multi-scale Alternated Attention Transformer for Generalized Stereo Matching

Recent stereo matching networks achieves dramatic performance by introdu...
research
09/19/2022

Integrative Feature and Cost Aggregation with Transformers for Dense Correspondence

We present a novel architecture for dense correspondence. The current st...
research
11/30/2022

Pattern Attention Transformer with Doughnut Kernel

We present in this paper a new architecture, the Pattern Attention Trans...

Please sign up or login with your details

Forgot password? Click here to reset