Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

02/03/2021
by   Shang Wang, et al.
0

Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. To reduce this training cost and optimize the cluster-wide hardware resource usage, we analyze GPU cluster usage statistics from a well-known research institute. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely underutilizing the hardware. This is because DL researchers and practitioners often lack the required expertise to independently optimize their own workloads. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively and easily improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators, and then trains those models simultaneously on a shared accelerator. On three emerging DL training workloads and state-of-the-art accelerators (GPUs and TPUs), HFTA demonstrates strong effectiveness in squeezing out hardware utilization and achieves up to 15.1 × higher training throughput vs. the standard practice of running each job on a separate accelerator.

READ FULL TEXT

page 5

page 17

research
09/03/2021

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) mo...
research
02/27/2020

Optimizing Memory-Access Patterns for Deep Learning Accelerators

Deep learning (DL) workloads are moving towards accelerators for faster ...
research
12/03/2018

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Deep Learning system architects strive to design a balanced system where...
research
05/24/2022

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Deep learning (DL) shows its prosperity in a wide variety of fields. The...
research
11/08/2019

The Pitfall of Evaluating Performance on Emerging AI Accelerators

In recent years, domain-specific hardware has brought significant perfor...
research
08/08/2018

Characterizing Co-located Datacenter Workloads: An Alibaba Case Study

Warehouse-scale cloud datacenters co-locate workloads with different and...
research
11/18/2016

GaDei: On Scale-up Training As A Service For Deep Learning

Deep learning (DL) training-as-a-service (TaaS) is an important emerging...

Please sign up or login with your details

Forgot password? Click here to reset