Pipe-BD: Pipelined Parallel Blockwise Distillation

01/29/2023
by   Hongsun Jang, et al.
0

Training large deep neural network models is highly challenging due to their tremendous computational and memory requirements. Blockwise distillation provides one promising method towards faster convergence by splitting a large model into multiple smaller models. In state-of-the-art blockwise distillation methods, training is performed block-by-block in a data-parallel manner using multiple GPUs. To produce inputs for the student blocks, the teacher model is executed from the beginning until the current block under training. However, this results in a high overhead of redundant teacher execution, low GPU utilization, and extra data loading. To address these problems, we propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation, eliminating redundant teacher block execution and increasing per-device batch size for better resource utilization. We also extend to hybrid parallelism for efficient workload balancing. As a result, Pipe-BD achieves significant acceleration without modifying the mathematical formulation of blockwise distillation. We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.

READ FULL TEXT

page 1

page 3

research
11/07/2017

Moonshine: Distilling with Cheap Convolutions

Model distillation compresses a trained machine learning model, such as ...
research
02/10/2020

Subclass Distillation

After a large "teacher" neural network has been trained on labeled data,...
research
10/04/2022

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...
research
12/28/2022

OVO: One-shot Vision Transformer Search with Online distillation

Pure transformers have shown great potential for vision tasks recently. ...
research
04/09/2018

Large scale distributed neural network training through online distillation

Techniques such as ensembling and distillation promise model quality imp...
research
06/28/2023

Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners

Representation learning has been evolving from traditional supervised tr...
research
09/18/2023

Speculative Progressive Raycasting for Memory Constrained Isosurface Visualization of Massive Volumes

New web technologies have enabled the deployment of powerful GPU-based c...

Please sign up or login with your details

Forgot password? Click here to reset