Born Again Neural Networks

by   Tommaso Furlanello, et al.
University of Southern California
Carnegie Mellon University
California Institute of Technology

Knowledge distillation (KD) consists of transferring knowledge from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student's compactness. teacher's. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5 experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating a role of the teacher outputs on both predicted and non-predicted classes. We present experiments with students of various capacities, focusing on the under-explored case where students overpower teachers. Our experiments show significant advantages from transferring knowledge between DenseNets and ResNets in either direction.


page 1

page 2

page 3

page 4


Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Knowledge Distillation (KD) has been extensively used for natural langua...

Knowledge Condensation Distillation

Knowledge Distillation (KD) transfers the knowledge from a high-capacity...

Sparse Teachers Can Be Dense with Knowledge

Recent advances in distilling pretrained language models have discovered...

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique ...

Learning Knowledge Representation with Meta Knowledge Distillation for Single Image Super-Resolution

Knowledge distillation (KD), which can efficiently transfer knowledge fr...

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

Recent developments in NLP have been accompanied by large, expensive mod...

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

This paper studies teacher-student optimization on neural networks, i.e....

Please sign up or login with your details

Forgot password? Click here to reset