Born Again Neural Networks

05/12/2018
by   Tommaso Furlanello, et al.
0

Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student's compactness. teacher's. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5 experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and non-predicted classes. We present experiments with students of various capacities, focusing on the under-explored case where students overpower teachers. Our experiments show significant advantages from transferring knowledge between DenseNets and ResNets in either direction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/12/2022

Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Knowledge Distillation (KD) has been extensively used for natural langua...
research
07/12/2022

Knowledge Condensation Distillation

Knowledge Distillation (KD) transfers the knowledge from a high-capacity...
research
10/08/2022

Sparse Teachers Can Be Dense with Knowledge

Recent advances in distilling pretrained language models have discovered...
research
10/21/2021

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique ...
research
07/18/2022

Learning Knowledge Representation with Meta Knowledge Distillation for Single Image Super-Resolution

Knowledge distillation (KD), which can efficiently transfer knowledge fr...
research
08/23/2019

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

Recent developments in NLP have been accompanied by large, expensive mod...
research
05/15/2018

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

This paper studies teacher-student optimization on neural networks, i.e....

Please sign up or login with your details

Forgot password? Click here to reset