Masked Autoencoders Enable Efficient Knowledge Distillers

08/25/2022
by   Yutong Bai, et al.
55

This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the (cumbersome) teacher model only needs to be partially executed, , forward propagate inputs through the first few layers, for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves downstream performance. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves 84.0 outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2 More intriguingly, our method can robustly distill knowledge from teacher models even with extremely high masking ratios: e.g., with 95 where merely TEN patches are visible during distillation, our ViT-B competitively attains a top-1 ImageNet accuracy of 83.6 still secure 82.4 FOUR visible patches (98 available at https://github.com/UCSC-VLAA/DMAE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2023

Revisiting Data-Free Knowledge Distillation with Poisoned Teachers

Data-free knowledge distillation (KD) helps transfer knowledge from a pr...
research
02/19/2023

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Knowledge distillation has been shown to be a powerful model compression...
research
01/03/2023

TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models

Masked image modeling (MIM) performs strongly in pre-training large visi...
research
11/16/2022

Stare at What You See: Masked Image Modeling without Reconstruction

Masked Autoencoders (MAE) have been prevailing paradigms for large-scale...
research
05/16/2023

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Knowledge Distillation (KD) is a powerful technique for transferring kno...
research
05/23/2023

NORM: Knowledge Distillation via N-to-One Representation Matching

Existing feature distillation methods commonly adopt the One-to-one Repr...
research
11/26/2016

Semi-supervised Learning using Denoising Autoencoders for Brain Lesion Detection and Segmentation

The work presented explores the use of denoising autoencoders (DAE) for ...

Please sign up or login with your details

Forgot password? Click here to reset