Log In Sign Up

LocalViT: Bringing Locality to Vision Transformers

by   Yawei Li, et al.

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6% and 3.1% with a negligible increase in the number of parameters and computational effort. Code is available at <>.


page 1

page 2

page 3

page 4


Transformers Solve the Limited Receptive Field for Monocular Depth Prediction

While convolutional neural networks have shown a tremendous impact on va...

Locality Guidance for Improving Vision Transformers on Tiny Datasets

While the Vision Transformer (VT) architecture is becoming trendy in com...

Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution ...

MaiT: Leverage Attention Masks for More Efficient Image Transformers

Though image transformers have shown competitive results with convolutio...

Not all parameters are born equal: Attention is mostly what you need

Transformers are widely used in state-of-the-art machine translation, bu...

Incorporating Convolution Designs into Visual Transformers

Motivated by the success of Transformers in natural language processing ...

Efficient Frequency Domain-based Transformers for High-Quality Image Deblurring

We present an effective and efficient method that explores the propertie...