Twins: Revisiting Spatial Attention Design in Vision Transformers

by   Xiangxiang Chu, et al.

Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including imagelevel classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code will be released soon at .


page 1

page 2

page 3

page 4


Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Very recently, Window-based Transformers, which computed self-attention ...

Vision Transformer Architecture Search

Recently, transformers have shown great superiority in solving computer ...

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Attention mechanism has been widely believed as the key to success of vi...

QuadTree Attention for Vision Transformers

Transformers have been successful in many vision tasks, thanks to their ...

k-means Mask Transformer

The rise of transformers in vision tasks not only advances network backb...

MDMLP: Image Classification from Scratch on Small Datasets with MLP

The attention mechanism has become a go-to technique for natural languag...

On Vision Features in Multimodal Machine Translation

Previous work on multimodal machine translation (MMT) has focused on the...

Code Repositories


Two simple and effective designs of vision transformer, which is on par with the Swin transformer

view repo