VirtualFlow: Decoupling Deep Learning Model Execution from Underlying Hardware

09/20/2020 ∙ by Andrew Or, et al. ∙ 0

State-of-the-art deep learning systems tightly couple model execution with the underlying hardware. This coupling poses important challenges in a world where the scale of deep learning workloads is growing rapidly: workloads with high resource requirements are inaccessible to most users, experimentation on smaller test beds is impossible, and results are difficult to reproduce across different hardware. We propose VirtualFlow, a novel system approach leveraging virtual node processing to decouple model execution from the hardware. In each execution step, the batch is divided and processed with data parallelism on many virtual nodes instead of physical devices (GPUs, TPUs), and the gradients are aggregated and applied to the model after all virtual nodes finish processing their data. With multiple virtual nodes mapped to each device, the system allows users to run models at much larger batch sizes that would otherwise exceed the memory limits of the underlying physical resources. VirtualFlow significantly improves model training reproducibility across different hardware, and enables models to run on shared clusters with dynamically changing resources for better efficiency. Our implementation of VirtualFlow enables virtual node processing with elasticity for TensorFlow. Evaluation with representative deep learning models (ResNet, BERT, Transformer) demonstrates strong convergence guarantees on different hardware with out-of-the-box hyperparameters, and up to 48 completion times with resource elasticity.



There are no comments yet.


page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.