DeepAI AI Chat
Log In Sign Up

Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

05/20/2023
by   Sahil Tyagi, et al.
0

Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than 4x on heterogeneous clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/28/2019

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as ...
02/09/2016

Nested Mini-Batch K-Means

A new algorithm is proposed which accelerates the mini-batch k-means alg...
03/12/2018

High Throughput Synchronous Distributed Stochastic Gradient Descent

We introduce a new, high-throughput, synchronous, distributed, data-para...
03/12/2023

Scavenger: A Cloud Service for Optimizing Cost and Performance of ML Training

While the pay-as-you-go nature of cloud virtual machines (VMs) makes it ...
01/25/2023

On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems

The performance of neural network-based speech enhancement systems is pr...
01/03/2014

A Framework for Creating a Distributed Rendering Environment on the Compute Clusters

This paper discusses the deployment of existing render farm manager in a...