SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

03/30/2023
by   Xuanyao Chen, et al.
1

High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e.,  50 should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
01/07/2018

SBNet: Sparse Blocks Network for Fast Inference

Conventional deep convolutional neural networks (CNNs) apply convolution...
research
09/27/2019

A closer look at network resolution for efficient network design

There is growing interest in designing lightweight neural networks for m...
research
11/25/2022

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

The formidable accomplishment of Transformers in natural language proces...
research
04/09/2019

High-Resolution Representations for Labeling Pixels and Regions

High-resolution representation learning plays an essential role in many ...
research
05/29/2022

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Vision Transformer (ViT) has achieved remarkable performance in many vis...
research
10/01/2022

EAPruning: Evolutionary Pruning for Vision Transformers and CNNs

Structured pruning greatly eases the deployment of large neural networks...
research
06/25/2023

Adaptive Window Pruning for Efficient Local Motion Deblurring

Local motion blur commonly occurs in real-world photography due to the m...

Please sign up or login with your details

Forgot password? Click here to reset