Towards Self-Tuning Parameter Servers

10/06/2018
by   Chris Liu, et al.
0

Recent years, many applications have been driven advances by the use of Machine Learning (ML). Nowadays, it is common to see industrial-strength machine learning jobs that involve millions of model parameters, terabytes of training data, and weeks of training. Good efficiency, i.e., fast completion time of running a specific ML job, therefore, is a key feature of a successful ML system. While the completion time of a long- running ML job is determined by the time required to reach model convergence, practically that is also largely influenced by the values of various system settings. In this paper, we contribute techniques towards building self-tuning parameter servers. Parameter Server (PS) is a popular system architecture for large-scale machine learning systems; and by self-tuning we mean while a long-running ML job is iteratively training the expert-suggested model, the system is also iteratively learning which system setting is more efficient for that job and applies it online. While our techniques are general enough to various PS- style ML systems, we have prototyped our techniques on top of TensorFlow. Experiments show that our techniques can reduce the completion times of a variety of long-running TensorFlow jobs from 1.4x to 18x.

READ FULL TEXT
research
01/03/2018

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been depl...
research
08/01/2023

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

We present CASSINI, a network-aware job scheduler for machine learning (...
research
06/01/2022

Good Intentions: Adaptive Parameter Servers via Intent Signaling

Parameter servers (PSs) ease the implementation of distributed training ...
research
09/05/2019

Straggler Mitigation with Tiered Gradient Codes

Coding theoretic techniques have been proposed for synchronous Gradient ...
research
04/25/2023

Deep Learning Framework for the Design of Orbital Angular Momentum Generators Enabled by Leaky-wave Holograms

In this paper, we present a novel approach for the design of leaky-wave ...
research
03/16/2022

NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction

Datacenters execute large computational jobs, which are composed of smal...
research
11/06/2020

Highly Available Data Parallel ML training on Mesh Networks

Data parallel ML models can take several days or weeks to train on sever...

Please sign up or login with your details

Forgot password? Click here to reset