EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

05/11/2023
by   Xinyu Liu, et al.
0

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9 V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8 running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models are available at https://github.com/microsoft/Cream/tree/main/EfficientViT.

READ FULL TEXT
research
03/25/2022

Efficient Visual Tracking via Hierarchical Cross-Attention Transformer

In recent years, target tracking has made great progress in accuracy. Th...
research
10/23/2020

LightSeq: A High Performance Inference Library for Transformers

Transformer, BERT and their variants have achieved great success in natu...
research
04/02/2021

LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference

We design a family of image classification architectures that optimize t...
research
01/10/2022

GhostNets on Heterogeneous Devices via Cheap Operations

Deploying convolutional neural networks (CNNs) on mobile devices is diff...
research
09/10/2023

DeViT: Decomposing Vision Transformers for Collaborative Inference in Edge Devices

Recent years have witnessed the great success of vision transformer (ViT...
research
09/17/2023

LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

The recent advancements in transformer-based visual trackers have led to...
research
03/07/2023

Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks

To design fast neural networks, many works have been focusing on reducin...

Please sign up or login with your details

Forgot password? Click here to reset