MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

10/05/2021
by   Sachin Mehta, et al.
30

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4 which is 3.2 (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7 parameters.

READ FULL TEXT
research
07/15/2022

Lightweight Vision Transformer with Cross Feature Attention

Recent advances in vision transformers (ViTs) have achieved great perfor...
research
03/08/2022

EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers

Recently, vision transformers started to show impressive results which o...
research
05/06/2022

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Self-attention based models such as vision transformers (ViTs) have emer...
research
05/25/2022

MoCoViT: Mobile Convolutional Vision Transformer

Recently, Transformer networks have achieved impressive results on a var...
research
09/30/2022

MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features

MobileViT (MobileViTv1) combines convolutional neural networks (CNNs) an...
research
07/12/2022

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weigh...
research
12/15/2022

Rethinking Vision Transformers for MobileNet Size and Speed

With the success of Vision Transformers (ViTs) in computer vision tasks,...

Please sign up or login with your details

Forgot password? Click here to reset