EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

05/29/2022
by   Han Cai, et al.
0

Vision Transformer (ViT) has achieved remarkable performance in many vision tasks. However, ViT is inferior to convolutional neural networks (CNNs) when targeting high-resolution mobile vision applications. The key computational bottleneck of ViT is the softmax attention module which has quadratic computational complexity with the input resolution. It is essential to reduce the cost of ViT to deploy it on edge devices. Existing methods (e.g., Swin, PVT) restrict the softmax attention within local windows or reduce the resolution of key/value tensors to reduce the cost, which sacrifices ViT's core advantages on global feature extractions. In this work, we present EfficientViT, an efficient ViT architecture for high-resolution low-computation visual recognition. Instead of restricting the softmax attention, we propose to replace softmax attention with linear attention while enhancing its local feature extraction ability with depthwise convolution. EfficientViT maintains global and local feature extraction capability while enjoying linear computational complexity. Extensive experiments on COCO object detection and Cityscapes semantic segmentation demonstrate the effectiveness of our method. On the COCO dataset, EfficientViT achieves 42.6 AP with 4.4G MACs, surpassing EfficientDet-D1 by 2.4 AP while having 27.9 EfficientViT reaches 78.7 mIoU with 19.1G MACs, outperforming SegFormer by 2.5 mIoU while requiring less than 1/3 the computational cost. On Qualcomm Snapdragon 855 CPU, EfficientViT is 3x faster than EfficientNet while achieving higher ImageNet accuracy.

READ FULL TEXT
research
07/15/2022

Lightweight Vision Transformer with Cross Feature Attention

Recent advances in vision transformers (ViTs) have achieved great perfor...
research
11/18/2022

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference

Vision Transformers (ViTs) have shown impressive performance but still r...
research
07/27/2022

Rethinking Efficacy of Softmax for Lightweight Non-Local Neural Networks

Non-local (NL) block is a popular module that demonstrates the capabilit...
research
03/29/2022

SepViT: Separable Vision Transformer

Vision Transformers have witnessed prevailing success in a series of vis...
research
02/04/2020

Selective Convolutional Network: An Efficient Object Detector with Ignoring Background

It is well known that attention mechanisms can effectively improve the p...
research
03/30/2023

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

High-resolution images enable neural networks to learn richer visual rep...
research
11/18/2020

End-to-End Object Detection with Adaptive Clustering Transformer

End-to-end Object Detection with Transformer (DETR)proposes to perform o...

Please sign up or login with your details

Forgot password? Click here to reset