Compressing Vision Transformers for Low-Resource Visual Learning

09/05/2023
by   Eric Youn, et al.
0

Vision transformer (ViT) and its variants have swept through visual learning leaderboards and offer state-of-the-art accuracy in tasks such as image classification, object detection, and semantic segmentation by attending to different parts of the visual input and capturing long-range spatial dependencies. However, these models are large and computation-heavy. For instance, the recently proposed ViT-B model has 86M parameters making it impractical for deployment on resource-constrained devices. As a result, their deployment on mobile and edge scenarios is limited. In our work, we aim to take a step toward bringing vision transformers to the edge by utilizing popular model compression techniques such as distillation, pruning, and quantization. Our chosen application environment is an unmanned aerial vehicle (UAV) that is battery-powered and memory-constrained, carrying a single-board computer on the scale of an NVIDIA Jetson Nano with 4GB of RAM. On the other hand, the UAV requires high accuracy close to that of state-of-the-art ViTs to ensure safe object avoidance in autonomous navigation, or correct localization of humans in search-and-rescue. Inference latency should also be minimized given the application requirements. Hence, our target is to enable rapid inference of a vision transformer on an NVIDIA Jetson Nano (4GB) with minimal accuracy loss. This allows us to deploy ViTs on resource-constrained devices, opening up new possibilities in surveillance, environmental monitoring, etc. Our implementation is made available at https://github.com/chensy7/efficient-vit.

READ FULL TEXT

page 3

page 7

research
04/05/2023

Training Strategies for Vision Transformers for Object Detection

Vision-based Transformer have shown huge application in the perception m...
research
06/30/2021

Improving the Efficiency of Transformers for Resource-Constrained Devices

Transformers provide promising accuracy and have become popular and used...
research
06/06/2022

Separable Self-attention for Mobile Vision Transformers

Mobile vision transformers (MobileViT) can achieve state-of-the-art perf...
research
03/22/2023

Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction

Recently, vision transformers (ViT) have replaced convolutional neural n...
research
11/30/2021

A Unified Pruning Framework for Vision Transformers

Recently, vision transformer (ViT) and its variants have achieved promis...
research
09/15/2021

PnP-DETR: Towards Efficient Visual Analysis with Transformers

Recently, DETR pioneered the solution of vision tasks with transformers,...
research
01/26/2022

Auto-Compressing Subset Pruning for Semantic Image Segmentation

State-of-the-art semantic segmentation models are characterized by high ...

Please sign up or login with your details

Forgot password? Click here to reset