Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation

11/01/2022
by   Cody Blakeney, et al.
0

Methods for improving the efficiency of deep network training (i.e. the resources required to achieve a given level of model quality) are of immediate benefit to deep learning practitioners. Distillation is typically used to compress models or improve model quality, but it's unclear if distillation actually improves training efficiency. Can the quality improvements of distillation be converted into training speed-ups, or do they simply increase final model quality with no resource savings? We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100). We found that distillation can speed up training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal results when it is only performed for the first 20-50 observed that training with distillation is almost always more efficient than training without distillation, even when using the poorest-quality model as a teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to gain the benefit of distilling from an ensemble of teacher models, which has O(n) runtime cost, by randomly sampling a single teacher from the pool of teacher models on each step, which only has a O(1) runtime cost. Taken together, these results show that distillation can substantially improve training efficiency in both image classification and language modeling, and that a few simple optimizations to distillation protocols can further enhance these efficiency improvements.

READ FULL TEXT
research
12/05/2021

Causal Distillation for Language Models

Distillation efforts have led to language models that are more compact a...
research
04/09/2018

Large scale distributed neural network training through online distillation

Techniques such as ensembling and distillation promise model quality imp...
research
12/07/2022

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

Recent advances on deep learning models come at the price of formidable ...
research
08/03/2023

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

We present our proposed solution to the BabyLM challenge [arXiv:2301.117...
research
06/12/2018

Knowledge Distillation by On-the-Fly Native Ensemble

Knowledge distillation is effective to train small and generalisable net...
research
02/13/2023

Dataset Distillation with Convexified Implicit Gradients

We propose a new dataset distillation algorithm using reparameterization...
research
10/27/2016

Professor Forcing: A New Algorithm for Training Recurrent Networks

The Teacher Forcing algorithm trains recurrent networks by supplying obs...

Please sign up or login with your details

Forgot password? Click here to reset