AUTOKD: Automatic Knowledge Distillation Into A Student Architecture Family

11/05/2021
by   Roy Henha Eyono, et al.
0

State-of-the-art results in deep learning have been improving steadily, in good part due to the use of larger models. However, widespread use is constrained by device hardware limitations, resulting in a substantial performance gap between state-of-the-art models and those that can be effectively deployed on small devices. While Knowledge Distillation (KD) theoretically enables small student models to emulate larger teacher models, in practice selecting a good student architecture requires considerable human expertise. Neural Architecture Search (NAS) appears as a natural solution to this problem but most approaches can be inefficient, as most of the computation is spent comparing architectures sampled from the same distribution, with negligible differences in performance. In this paper, we propose to instead search for a family of student architectures sharing the property of being good at learning from a given teacher. Our approach AutoKD, powered by Bayesian Optimization, explores a flexible graph-based search space, enabling us to automatically learn the optimal student architecture distribution and KD parameters, while being 20x more sample efficient compared to existing state-of-the-art. We evaluate our method on 3 datasets; on large images specifically, we reach the teacher performance while using 3x less memory and 10x less parameters. Finally, while AutoKD uses the traditional KD loss, it outperforms more advanced KD variants using hand-designed students.

READ FULL TEXT
research
11/20/2019

Search to Distill: Pearls are Everywhere but not the Eyes

Standard Knowledge Distillation (KD) approaches distill the knowledge of...
research
06/27/2022

Revisiting Architecture-aware Knowledge Distillation: Smaller Models and Faster Search

Knowledge Distillation (KD) has recently emerged as a popular method for...
research
03/16/2023

Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

Large pre-trained language models have achieved state-of-the-art results...
research
01/29/2022

AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Knowledge distillation (KD) methods compress large models into smaller s...
research
01/30/2020

Search for Better Students to Learn Distilled Knowledge

Knowledge Distillation, as a model compression technique, has received g...
research
11/08/2019

Deep geometric knowledge distillation with graphs

In most cases deep learning architectures are trained disregarding the a...
research
07/15/2021

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

Sequential recommender systems (SRS) have become a research hotspot due ...

Please sign up or login with your details

Forgot password? Click here to reset