Automatic Configuration for Optimal Communication Scheduling in DNN Training

12/27/2021
by   Yiqing Ma, et al.
7

ByteScheduler partitions and rearranges tensor transmissions to improve the communication efficiency of distributed Deep Neural Network (DNN) training. The configuration of hyper-parameters (i.e., the partition size and the credit size) is critical to the effectiveness of partitioning and rearrangement. Currently, ByteScheduler adopts Bayesian Optimization (BO) to find the optimal configuration for the hyper-parameters beforehand. In practice, however, various runtime factors (e.g., worker node status and network conditions) change over time, making the statically-determined one-shot configuration result suboptimal for real-world DNN training. To address this problem, we present a real-time configuration method (called AutoByte) that automatically and timely searches the optimal hyper-parameters as the training systems dynamically change. AutoByte extends the ByteScheduler framework with a meta-network, which takes the system's runtime statistics as its input and outputs predictions for speedups under specific configurations. Evaluation results on various DNN models show that AutoByte can dynamically tune the hyper-parameters with low resource usage, and deliver up to 33.2% higher performance than the best static configuration in ByteScheduler.

READ FULL TEXT

page 1

page 9

research
10/31/2019

ALERT: Accurate Anytime Learning for Energy and Timeliness

An increasing number of software applications incorporate runtime Deep N...
research
05/03/2019

Planning as Optimization: Dynamically Discovering Optimal Configurations for Runtime Situations

The large number of possible configurations of modern software-based sys...
research
03/28/2022

LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications

Spark SQL has been widely deployed in industry but it is challenging to ...
research
03/15/2022

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Deep neural networks have seen great success in recent years; however, t...
research
03/02/2020

Adaptive Structural Hyper-Parameter Configuration by Q-Learning

Tuning hyper-parameters for evolutionary algorithms is an important issu...
research
11/29/2020

Srift: Swift and Thrift Cloud-Based Distributed Training

Cost-efficiency and training time are primary concerns in cloud-based di...
research
08/13/2019

ConfigTron: Tackling network diversity with heterogeneous configurations

The web serving protocol stack is constantly changing and evolving to ta...

Please sign up or login with your details

Forgot password? Click here to reset