DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

04/27/2022
by   Chen Xianing, et al.
0

Transformers are successfully applied to computer vision due to their powerful modeling capacity with self-attention. However, the excellent performance of transformers heavily depends on enormous training images. Thus, a data-efficient transformer solution is urgently needed. In this work, we propose an early knowledge distillation framework, which is termed as DearKD, to improve the data efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation. Further, our DearKD can be readily applied to the extreme data-free case where no real images are available. In this case, we propose a boundary-preserving intra-divergence loss based on DeepInversion to further close the performance gap against the full-data counterpart. Extensive experiments on ImageNet, partial ImageNet, data-free setting and other downstream tasks prove the superiority of DearKD over its baselines and state-of-the-art methods.

READ FULL TEXT

page 3

page 5

page 12

research
07/03/2021

Efficient Vision Transformers via Fine-Grained Manifold Distillation

This paper studies the model compression problem of vision transformers....
research
07/17/2023

Cumulative Spatial Knowledge Distillation for Vision Transformers

Distilling knowledge from convolutional neural networks (CNNs) is a doub...
research
05/31/2020

Transferring Inductive Biases through Knowledge Distillation

Having the right inductive biases can be crucial in many tasks or scenar...
research
08/01/2023

LGViT: Dynamic Early Exiting for Accelerating Vision Transformer

Recently, the efficient deployment and acceleration of powerful vision t...
research
11/30/2022

HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

Transformers have attained superior performance in natural language proc...
research
12/10/2021

Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization

The transformer multi-head self-attention mechanism has been thoroughly ...
research
11/20/2022

Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

Knowledge distillation (KD) has been a ubiquitous method for model compr...

Please sign up or login with your details

Forgot password? Click here to reset