Understanding the Effect of Data Augmentation on Knowledge Distillation

05/21/2023
by   Ziqi Wang, et al.
0

Knowledge distillation (KD) requires sufficient data to transfer knowledge from large-scale teacher models to small-scale student models. Therefore, data augmentation has been widely used to mitigate the shortage of data under specific scenarios. Classic data augmentation techniques, such as synonym replacement and k-nearest-neighbors, are initially designed for fine-tuning. To avoid severe semantic shifts and preserve task-specific labels, those methods prefer to change only a small proportion of tokens (e.g., changing 10 is generally the best option for fine-tuning). However, such data augmentation methods are sub-optimal for knowledge distillation since the teacher model could provide label distributions and is more tolerant to semantic shifts. We first observe that KD prefers as much data as possible, which is different from fine-tuning that too much data will not gain more performance. Since changing more tokens leads to more semantic shifts, we use the proportion of changed tokens to reflect semantic shift degrees. Then we find that KD prefers augmented data with a larger semantic shift degree (e.g., changing 30 is generally the best option for KD) than fine-tuning (changing 10 Besides, our findings show that smaller datasets prefer larger degrees until the out-of-distribution problem occurs (e.g., datasets with less than 10k inputs may prefer the 50 prefer the 10 data augmentation between fine-tuning and knowledge distillation and encourages the community to explore KD-specific data augmentation methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2022

Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation

Knowledge distillation is one of the primary methods of transferring kno...
research
07/03/2021

Isotonic Data Augmentation for Knowledge Distillation

Knowledge distillation uses both real hard labels and soft labels predic...
research
11/30/2022

Explicit Knowledge Transfer for Weakly-Supervised Code Generation

Large language models (LLMs) can acquire strong code-generation capabili...
research
05/02/2023

Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models

Fine-tuning large models is highly effective, however, inference using t...
research
12/06/2019

Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation

Sequence-level knowledge distillation (SLKD) is a model compression tech...
research
01/01/2022

Role of Data Augmentation Strategies in Knowledge Distillation for Wearable Sensor Data

Deep neural networks are parametrized by several thousands or millions o...
research
03/13/2022

CEKD:Cross Ensemble Knowledge Distillation for Augmented Fine-grained Data

Data augmentation has been proved effective in training deep models. Exi...

Please sign up or login with your details

Forgot password? Click here to reset