DeepAI AI Chat
Log In Sign Up

Embedded Knowledge Distillation in Depth-level Dynamic Neural Network

by   Shuchang Lyu, et al.
Beihang University

In real applications, different computation-resource devices need different-depth networks (e.g., ResNet-18/34/50) with high-accuracy. Usually, existing strategies either design multiple networks (nets) and train them independently, or utilize compression techniques (e.g., low-rank decomposition, pruning, and teacher-to-student) to evolve a trained large model into a small net. These methods are subject to the low-accuracy of small nets, or complicated training processes induced by the dependence of accompanying assistive large models. In this article, we propose an elegant Depth-level Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar architectures. Instead of training individual nets with different-depth configurations, we only train a DDNN to dynamically switch different-depth sub-nets at runtime using one set of shared weight parameters. To improve the generalization of sub-nets, we design the Embedded-Knowledge-Distillation (EKD) training mechanism for the DDNN to implement semantic knowledge transfer from the teacher (full) net to multiple sub-nets. Specifically, the Kullback-Leibler divergence is introduced to constrain the posterior class probability consistency between full-net and sub-net, and self-attention on the same resolution feature of different depth is addressed to drive more abundant feature representations of sub-nets. Thus, we can obtain multiple high accuracy sub-nets simultaneously in a DDNN via the online knowledge distillation in each training iteration without extra computation cost. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that sub-nets in DDNN with EKD training achieves better performance than the depth-level pruning or individually training while preserving the original performance of full-net.


page 1

page 4

page 9

page 11


Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Knowledge distillation field delicately designs various types of knowled...

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Knowledge distillation (KD) has gained a lot of attention in the field o...

Dynamic Rectification Knowledge Distillation

Knowledge Distillation is a technique which aims to utilize dark knowled...

Deep discriminant analysis for task-dependent compact network search

Most of today's popular deep architectures are hand-engineered for gener...

Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

Yes, they do. This paper provides the first empirical demonstration that...

Network compression and faster inference using spatial basis filters

We present an efficient alternative to the convolutional layer through u...

Initialization and Regularization of Factorized Neural Layers

Factorized layers–operations parameterized by products of two or more ma...