TonY: An Orchestrator for Distributed Machine Learning Jobs

03/24/2019
by   Anthony Hsu, et al.
0

Training machine learning (ML) models on large datasets requires considerable computing power. To speed up training, it is typical to distribute training across several machines, often with specialized hardware like GPUs or TPUs. Managing a distributed training job is complex and requires dealing with resource contention, distributed configurations, monitoring, and fault tolerance. In this paper, we describe TonY, an open-source orchestrator for distributed ML jobs built at LinkedIn to address these challenges.

READ FULL TEXT
research
02/13/2018

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Training machine learning (ML) models with large datasets can incur sign...
research
06/09/2022

HDTorch: Accelerating Hyperdimensional Computing with GP-GPUs for Design Space Exploration

HyperDimensional Computing (HDC) as a machine learning paradigm is highl...
research
01/31/2023

Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks

From natural language processing to genome sequencing, large-scale machi...
research
06/28/2019

Asymptotic Network Independence in Distributed Optimization for Machine Learning

We provide a discussion of several recent results which have overcome a ...
research
12/08/2014

MLitB: Machine Learning in the Browser

With few exceptions, the field of Machine Learning (ML) research has lar...
research
05/04/2022

SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training

In today's production machine learning (ML) systems, models are continuo...
research
05/30/2018

Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning

In many domains, the previous decade was characterized by increasing dat...

Please sign up or login with your details

Forgot password? Click here to reset