Knowledge Distillation Thrives on Data Augmentation

12/05/2020
by   Huan Wang, et al.
0

Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model. Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far. In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not. We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA. By this explanation, we propose to enhance KD via a stronger data augmentation scheme (e.g., mixup, CutMix). Furthermore, an even stronger new DA approach is developed specifically for KD based on the idea of active learning. The findings and merits of the proposed method are validated by extensive experiments on CIFAR-100, Tiny ImageNet, and ImageNet datasets. We can achieve improved performance simply by using the original KD loss combined with stronger augmentation schemes, compared to existing state-of-the-art methods, which employ more advanced distillation losses. In addition, when our approaches are combined with more advanced distillation losses, we can advance the state-of-the-art performance even more. On top of the encouraging performance, this paper also sheds some light on explaining the success of knowledge distillation. The discovered interplay between KD and DA may inspire more advanced KD algorithms.

READ FULL TEXT

page 3

page 8

research
06/24/2022

Online Distillation with Mixed Sample Augmentation

Mixed Sample Regularization (MSR), such as MixUp or CutMix, is a powerfu...
research
04/15/2022

CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge Distillation

Knowledge distillation (KD) is an efficient framework for compressing la...
research
02/21/2023

Evaluating the effect of data augmentation and BALD heuristics on distillation of Semantic-KITTI dataset

Active Learning (AL) has remained relatively unexplored for LiDAR percep...
research
05/29/2022

A General Multiple Data Augmentation Based Framework for Training Deep Neural Networks

Deep neural networks (DNNs) often rely on massive labelled data for trai...
research
05/25/2023

VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale

The tremendous success of large models trained on extensive datasets dem...
research
05/23/2023

Decoupled Kullback-Leibler Divergence Loss

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence...
research
03/10/2020

SuperMix: Supervising the Mixing Data Augmentation

In this paper, we propose a supervised mixing augmentation method, terme...

Please sign up or login with your details

Forgot password? Click here to reset