Addressing Algorithmic Bottlenecks in Elastic Machine Learning with Chicle

09/11/2019
by   Michael Kaufmann, et al.
0

Distributed machine learning training is one of the most common and important workloads running on data centers today, but it is rarely executed alone. Instead, to reduce costs, computing resources are consolidated and shared by different applications. In this scenario, elasticity and proper load balancing are vital to maximize efficiency, fairness, and utilization. Currently, most distributed training frameworks do not support the aforementioned properties. A few exceptions that do support elasticity, imitate generic distributed frameworks and use micro-tasks. In this paper we illustrate that micro-tasks are problematic for machine learning applications, because they require a high degree of parallelism which hinders the convergence of distributed training at a pure algorithmic level (i.e., ignoring overheads and scalability limitations). To address this, we propose Chicle, a new elastic distributed training framework which exploits the nature of machine learning algorithms to implement elasticity and load balancing without micro-tasks. We use Chicle to train deep neural network as well as generalized linear models, and show that Chicle achieves performance competitive with state of the art rigid frameworks, while efficiently enabling elastic execution and dynamic load balancing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2019

Survey of Major Load Balancing Algorithms in Distributed System

The classification of the most used load balancing algorithms in distrib...
research
06/21/2018

Spotlight: Scalable Transport Layer Load Balancing for Data Center Networks

Load Balancing plays a vital role in modern data centers to distribute t...
research
10/29/2021

Reinforced Workload Distribution Fairness

Network load balancers are central components in data centers, that dist...
research
04/23/2018

BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism

Neural network frameworks such as PyTorch and TensorFlow are the workhor...
research
02/22/2018

SparCML: High-Performance Sparse Communication for Machine Learning

One of the main drivers behind the rapid recent advances in machine lear...
research
10/20/2018

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ...
research
08/13/2019

uPredict: A User-Level Profiler-Based Predictive Framework for Single VM Applications in Multi-Tenant Clouds

Most existing studies on performance prediction for virtual machines (VM...

Please sign up or login with your details

Forgot password? Click here to reset