Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

03/16/2023
by   Aashka Trivedi, et al.
0

Large pre-trained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) of a smaller student model addresses their inefficiency, allowing for deployment in resource-constraint environments. KD however remains ineffective, as the student is manually selected from a set of existing options already pre-trained on large corpora, a sub-optimal choice within the space of all possible student architectures. This paper proposes KD-NAS, the use of Neural Architecture Search (NAS) guided by the Knowledge Distillation process to find the optimal student model for distillation from a teacher, for a given natural language task. In each episode of the search process, a NAS controller predicts a reward based on a combination of accuracy on the downstream task and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full downstream task training set. When distilling on the MNLI task, our KD-NAS model produces a 2 point improvement in accuracy on GLUE tasks with equivalent GPU latency with respect to a hand-crafted student architecture available in the literature. Using Knowledge Distillation, this model also achieves a 1.4x speedup in GPU Latency (3.2x speedup on CPU) with respect to a BERT-Base Teacher, while maintaining 97 performance on GLUE Tasks (without CoLA). We also obtain an architecture with equivalent performance as the hand-crafted student model on the GLUE benchmark, but with a 15 times the number of parameters

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/20/2019

Search to Distill: Pearls are Everywhere but not the Eyes

Standard Knowledge Distillation (KD) approaches distill the knowledge of...
research
01/29/2022

AutoDistil: Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Knowledge distillation (KD) methods compress large models into smaller s...
research
07/15/2021

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

Sequential recommender systems (SRS) have become a research hotspot due ...
research
11/05/2021

AUTOKD: Automatic Knowledge Distillation Into A Student Architecture Family

State-of-the-art results in deep learning have been improving steadily, ...
research
12/23/2019

FasterSeg: Searching for Faster Real-time Semantic Segmentation

We present FasterSeg, an automatically designed semantic segmentation ne...
research
10/14/2020

AutoADR: Automatic Model Design for Ad Relevance

Large-scale pre-trained models have attracted extensive attention in the...
research
06/12/2021

LE-NAS: Learning-based Ensenble with NAS for Dose Prediction

Radiation therapy treatment planning is a complex process, as the target...

Please sign up or login with your details

Forgot password? Click here to reset