Generic-to-Specific Distillation of Masked Autoencoders

02/28/2023
by   Wei Huang, et al.
0

Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7 classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/03/2023

TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

Masked image modeling (MIM) performs strongly in pre-training large visi...
research
09/08/2022

Exploring Target Representations for Masked Autoencoders

Masked autoencoders have become popular training paradigms for self-supe...
research
05/28/2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Self-supervised learning (SSL) has achieved notable success in many spee...
research
06/08/2021

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

While deep and large pre-trained models are the state-of-the-art for var...
research
04/26/2018

Network Transplanting

This paper focuses on a novel problem, i.e., transplanting a category-an...
research
07/24/2023

CLIP-KD: An Empirical Study of Distilling CLIP Models

CLIP has become a promising language-supervised visual pre-training fram...
research
10/10/2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Teacher-student knowledge distillation is a popular technique for compre...

Please sign up or login with your details

Forgot password? Click here to reset