Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

05/20/2023
by   Sahil Tyagi, et al.
0

Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than 4x on heterogeneous clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2019

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as ...
research
02/09/2016

Nested Mini-Batch K-Means

A new algorithm is proposed which accelerates the mini-batch k-means alg...
research
03/12/2018

High Throughput Synchronous Distributed Stochastic Gradient Descent

We introduce a new, high-throughput, synchronous, distributed, data-para...
research
03/12/2023

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it ...
research
02/07/2022

Parallel Successive Learning for Dynamic Distributed Model Training over Heterogeneous Wireless Networks

Federated learning (FedL) has emerged as a popular technique for distrib...
research
01/25/2023

On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems

The performance of neural network-based speech enhancement systems is pr...
research
01/03/2014

A Framework for Creating a Distributed Rendering Environment on the Compute Clusters

This paper discusses the deployment of existing render farm manager in a...

Please sign up or login with your details

Forgot password? Click here to reset