LocalViT: Bringing Locality to Vision Transformers

04/12/2021
by   Yawei Li, et al.
0

We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6% and 3.1% with a negligible increase in the number of parameters and computational effort. Code is available at <https://github.com/ofsoundof/LocalViT>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2021

Transformers Solve the Limited Receptive Field for Monocular Depth Prediction

While convolutional neural networks have shown a tremendous impact on va...
research
07/20/2022

Locality Guidance for Improving Vision Transformers on Tiny Datasets

While the Vision Transformer (VT) architecture is becoming trendy in com...
research
06/12/2021

Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution ...
research
02/17/2023

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Recently, Transformer-based architectures have been explored for speaker...
research
04/14/2023

Preserving Locality in Vision Transformers for Class Incremental Learning

Learning new classes without forgetting is crucial for real-world applic...
research
10/22/2020

Not all parameters are born equal: Attention is mostly what you need

Transformers are widely used in state-of-the-art machine translation, bu...
research
01/27/2023

Robust Transformer with Locality Inductive Bias and Feature Normalization

Vision transformers have been demonstrated to yield state-of-the-art res...

Please sign up or login with your details

Forgot password? Click here to reset