DisWOT: Student Architecture Search for Distillation WithOut Training

03/28/2023
by   Peijie Dong, et al.
0

Knowledge distillation (KD) is an effective training strategy to improve the lightweight student models under the guidance of cumbersome teachers. However, the large architecture difference across the teacher-student pairs limits the distillation gains. In contrast to previous adaptive distillation methods to reduce the teacher-student gap, we explore a novel training-free framework to search for the best student architectures for a given teacher. Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation. Secondly, we find that the similarity of feature semantics and sample relations between random-initialized teacher-student networks have good correlations with final distillation performances. Thus, we efficiently measure similarity matrixs conditioned on the semantic activation maps to select the optimal student via an evolutionary algorithm without any training. In this way, our student architecture search for Distillation WithOut Training (DisWOT) significantly improves the performance of the model in the distillation stage with at least 180× training acceleration. Additionally, we extend similarity metrics in DisWOT as new distillers and KD-based zero-proxies. Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces. Our project and code are available at https://lilujunai.github.io/DisWOT-CVPR2023/.

READ FULL TEXT

page 4

page 10

research
11/20/2019

Search to Distill: Pearls are Everywhere but not the Eyes

Standard Knowledge Distillation (KD) approaches distill the knowledge of...
research
03/26/2022

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teache...
research
08/02/2020

Differentiable Feature Aggregation Search for Knowledge Distillation

Knowledge distillation has become increasingly important in model compre...
research
09/12/2022

Switchable Online Knowledge Distillation

Online Knowledge Distillation (OKD) improves the involved models by reci...
research
06/27/2022

Revisiting Architecture-aware Knowledge Distillation: Smaller Models and Faster Search

Knowledge Distillation (KD) has recently emerged as a popular method for...
research
05/16/2020

Generalized Bayesian Posterior Expectation Distillation for Deep Neural Networks

In this paper, we present a general framework for distilling expectation...
research
09/17/2020

MEAL V2: Boosting Vanilla ResNet-50 to 80 without Tricks

In this paper, we introduce a simple yet effective approach that can boo...

Please sign up or login with your details

Forgot password? Click here to reset