Lightweight Vision Transformer with Cross Feature Attention

07/15/2022
by   Youpeng Zhao, et al.
0

Recent advances in vision transformers (ViTs) have achieved great performance in visual recognition tasks. Convolutional neural networks (CNNs) exploit spatial inductive bias to learn visual representations, but these networks are spatially local. ViTs can learn global representations with their self-attention mechanism, but they are usually heavy-weight and unsuitable for mobile devices. In this paper, we propose cross feature attention (XFA) to bring down computation cost for transformers, and combine efficient mobile CNNs to form a novel efficient light-weight CNN-ViT hybrid model, XFormer, which can serve as a general-purpose backbone to learn both global and local representation. Experimental results show that XFormer outperforms numerous CNN and ViT-based models across different tasks and datasets. On ImageNet1K dataset, XFormer achieves top-1 accuracy of 78.5 which is 2.2 (ViT-based) for similar number of parameters. Our model also performs well when transferring to object detection and semantic segmentation tasks. On MS COCO dataset, XFormer exceeds MobileNetV2 by 10.5 AP (22.7 -> 33.2 AP) in YOLOv3 framework with only 6.3M parameters and 3.8G FLOPs. On Cityscapes dataset, with only a simple all-MLP decoder, XFormer achieves mIoU of 78.5 and FPS of 15.3, surpassing state-of-the-art lightweight segmentation networks.

READ FULL TEXT

page 6

page 8

research
10/05/2021

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Light-weight convolutional neural networks (CNNs) are the de-facto for m...
research
09/16/2022

ConvFormer: Closing the Gap Between CNN and Vision Transformers

Vision transformers have shown excellent performance in computer vision ...
research
07/18/2023

RepViT: Revisiting Mobile CNN From ViT Perspective

Recently, lightweight Vision Transformers (ViTs) demonstrate superior pe...
research
02/24/2023

Spatial Bias for Attention-free Non-local Neural Networks

In this paper, we introduce the spatial bias to learn global knowledge w...
research
05/29/2022

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Vision Transformer (ViT) has achieved remarkable performance in many vis...
research
05/06/2022

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Self-attention based models such as vision transformers (ViTs) have emer...
research
07/12/2022

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weigh...

Please sign up or login with your details

Forgot password? Click here to reset