PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters

10/01/2020
by   Isabelly Rocha, et al.
0

DNN learning jobs are common in today's clusters due to the advances in AI driven services such as machine translation and image recognition. The most critical phase of these jobs for model performance and learning cost is the tuning of hyperparameters. Existing approaches make use of techniques such as early stopping criteria to reduce the tuning impact on learning cost. However, these strategies do not consider the impact that certain hyperparameters and systems parameters have on training time. This paper presents PipeTune, a framework for DNN learning jobs that addresses the trade-offs between these two types of parameters. PipeTune takes advantage of the high parallelism and recurring characteristics of such jobs to minimize the learning cost via a pipelined simultaneous tuning of both hyper and system parameters. Our experimental evaluation using three different types of workloads indicates that PipeTune achieves up to 22.6 training time, respectively. PipeTune not only improves performance but also lowers energy consumption up to 29

READ FULL TEXT

page 2

page 5

page 7

page 11

page 12

research
11/07/2021

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...
research
09/26/2019

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...
research
04/16/2020

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

A good parallelization strategy can significantly improve the efficiency...
research
03/12/2020

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datase...
research
09/27/2022

An Overview of the Data-Loader Landscape: Comparative Performance Analysis

Dataloaders, in charge of moving data from storage into GPUs while train...
research
09/12/2019

Differential Approximation and Sprinting for Multi-Priority Big Data Engines

Today's big data clusters based on the MapReduce paradigm are capable of...
research
05/26/2023

Unleashing the Potential of Unsupervised Deep Outlier Detection through Automated Training Stopping

Outlier detection (OD) has received continuous research interests due to...

Please sign up or login with your details

Forgot password? Click here to reset