LocalViT: Bringing Locality to Vision Transformers

by   Yawei Li, et al.

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6% and 3.1% with a negligible increase in the number of parameters and computational effort. Code is available at <https://github.com/ofsoundof/LocalViT>.


page 1

page 2

page 3

page 4


Transformers Solve the Limited Receptive Field for Monocular Depth Prediction

While convolutional neural networks have shown a tremendous impact on va...

Locality Guidance for Improving Vision Transformers on Tiny Datasets

While the Vision Transformer (VT) architecture is becoming trendy in com...

Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution ...

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Recently, Transformer-based architectures have been explored for speaker...

Preserving Locality in Vision Transformers for Class Incremental Learning

Learning new classes without forgetting is crucial for real-world applic...

Not all parameters are born equal: Attention is mostly what you need

Transformers are widely used in state-of-the-art machine translation, bu...

Robust Transformer with Locality Inductive Bias and Feature Normalization

Vision transformers have been demonstrated to yield state-of-the-art res...

Please sign up or login with your details

Forgot password? Click here to reset