Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

05/18/2023
by   Chong Yu, et al.
0

The transformer extends its success from the language to the vision domain. Because of the stacked self-attention and cross-attention blocks, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparse-distillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by 1.39-1.79 times and 3.22-3.43 times of latency and throughput on A100 GPU, and 1.57-1.69 times and 2.11-2.51 times improvement of latency and throughput on AGX Orin.

READ FULL TEXT

page 4

page 5

page 6

page 7

research
07/20/2023

Quantized Feature Distillation for Network Quantization

Neural network quantization aims to accelerate and trim full-precision n...
research
11/20/2022

Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

Knowledge distillation (KD) has been a ubiquitous method for model compr...
research
10/11/2022

SaiT: Sparse Vision Transformers through Adaptive Token Pruning

While vision transformers have achieved impressive results, effectively ...
research
01/27/2023

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Improving the deployment efficiency of transformer-based language models...
research
11/15/2022

HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

While vision transformers (ViTs) have continuously achieved new mileston...
research
07/01/2023

Variation-aware Vision Transformer Quantization

Despite the remarkable performance of Vision Transformers (ViTs) in vari...
research
07/16/2022

S4: a High-sparsity, High-performance AI Accelerator

Exploiting sparsity underlying neural networks has become one of the mos...

Please sign up or login with your details

Forgot password? Click here to reset