LightViT: Towards Light-Weight Convolution-Free Vision Transformers

07/12/2022
by   Tao Huang, et al.
0

Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7 FLOPs, outperforming PVTv2-B0 by 8.2 available at https://github.com/hunto/LightViT.

READ FULL TEXT
research
10/05/2021

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Light-weight convolutional neural networks (CNNs) are the de-facto for m...
research
10/12/2022

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

There still remains an extreme performance gap between Vision Transforme...
research
07/15/2022

Lightweight Vision Transformer with Cross Feature Attention

Recent advances in vision transformers (ViTs) have achieved great perfor...
research
05/15/2023

Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation

Vision transformers (ViTs) achieve remarkable performance on large datas...
research
11/10/2022

Demystify Transformers Convolutions in Modern Image Deep Networks

Recent success of vision transformers has inspired a series of vision ba...
research
08/14/2023

SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

This paper presents a module, Spatial Cross-scale Convolution (SCSC), wh...
research
05/06/2022

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Self-attention based models such as vision transformers (ViTs) have emer...

Please sign up or login with your details

Forgot password? Click here to reset