Bi-ViT: Pushing the Limit of Vision Transformer Quantization

05/21/2023
by   Yanjing Li, et al.
0

Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1 respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet.

READ FULL TEXT

page 2

page 3

research
10/13/2022

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

The large pre-trained vision transformers (ViTs) have demonstrated remar...
research
06/27/2021

Post-Training Quantization for Vision Transformer

Recently, transformer has achieved remarkable performance on a variety o...
research
01/19/2022

Q-ViT: Fully Differentiable Quantization for Vision Transformer

In this paper, we propose a fully differentiable quantization method for...
research
05/24/2023

BinaryViT: Towards Efficient and Accurate Binary Vision Transformers

Vision Transformers (ViTs) have emerged as the fundamental architecture ...
research
03/11/2021

Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation

BERT-based Neural Ranking Models (NRMs) can be classified according to h...
research
11/20/2022

Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

Knowledge distillation (KD) has been a ubiquitous method for model compr...
research
06/19/2019

Back to Simplicity: How to Train Accurate BNNs from Scratch?

Binary Neural Networks (BNNs) show promising progress in reducing comput...

Please sign up or login with your details

Forgot password? Click here to reset