Leveraging Angular Distributions for Improved Knowledge Distillation

02/27/2023
by   Eun Som Jeon, et al.
0

Knowledge distillation as a broad class of methods has led to the development of lightweight and memory efficient models, using a pre-trained model with a large capacity (teacher network) to train a smaller model (student network). Recently, additional variations for knowledge distillation, utilizing activation maps of intermediate layers as the source of knowledge, have been studied. Generally, in computer vision applications, it is seen that the feature activation learned by a higher capacity model contains richer knowledge, highlighting complete objects while focusing less on the background. Based on this observation, we leverage the dual ability of the teacher to accurately distinguish between positive (relevant to the target object) and negative (irrelevant) areas. We propose a new loss function for distillation, called angular margin-based distillation (AMD) loss. AMD loss uses the angular distance between positive and negative features by projecting them onto a hypersphere, motivated by the near angular distributions seen in many feature extractors. Then, we create a more attentive feature that is angularly distributed on the hypersphere by introducing an angular margin to the positive feature. Transferring such knowledge from the teacher network enables the student model to harness the higher discrimination of positive and negative features for the teacher, thus distilling superior student models. The proposed method is evaluated for various student-teacher network pairs on four public datasets. Furthermore, we show that the proposed method has advantages in compatibility with other learning techniques, such as using fine-grained features, augmentation, and other distillation methods.

READ FULL TEXT

page 13

page 14

page 15

research
07/23/2019

Similarity-Preserving Knowledge Distillation

Knowledge distillation is a widely applicable technique for training a s...
research
03/26/2022

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teache...
research
09/23/2022

Descriptor Distillation: a Teacher-Student-Regularized Framework for Learning Local Descriptors

Learning a fast and discriminative patch descriptor is a challenging top...
research
11/03/2020

In Defense of Feature Mimicking for Knowledge Distillation

Knowledge distillation (KD) is a popular method to train efficient netwo...
research
06/01/2021

Natural Statistics of Network Activations and Implications for Knowledge Distillation

In a matter that is analog to the study of natural image statistics, we ...
research
05/04/2022

Impact of a DCT-driven Loss in Attention-based Knowledge-Distillation for Scene Recognition

Knowledge Distillation (KD) is a strategy for the definition of a set of...
research
05/20/2022

InDistill: Transferring Knowledge From Pruned Intermediate Layers

Deploying deep neural networks on hardware with limited resources, such ...

Please sign up or login with your details

Forgot password? Click here to reset