Parallax: Automatic Data-Parallel Training of Deep Neural Networks

08/08/2018
by   Soojeong Kim, et al.
0

The employment of high-performance servers and GPU accelerators for training deep neural network models have greatly accelerated recent advances in machine learning (ML). ML frameworks, such as TensorFlow, MXNet, and Caffe2, have emerged to assist ML researchers to train their models in a distributed fashion. However, correctly and efficiently utilizing multiple machines and GPUs is still not a straightforward task for framework users due to the non-trivial correctness and performance challenges that arise in the distribution process. This paper introduces Parallax, a tool for automatic parallelization of deep learning training in distributed environments. Parallax not only handles the subtle correctness issues, but also leverages various optimizations to minimize the communication overhead caused by scaling out. Experiments show that Parallax built atop TensorFlow achieves scalable training throughput on multiple CNN and RNN models, while requiring little effort from its users.

READ FULL TEXT

page 8

page 9

page 12

research
11/05/2018

Workload-aware Automatic Parallelization for Multi-GPU DNN Training

Deep neural networks (DNNs) have emerged as successful solutions for var...
research
12/19/2015

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Deep learning (DL) has achieved notable successes in many machine learni...
research
09/05/2023

Comparative Analysis of CPU and GPU Profiling for Deep Learning Models

Deep Learning(DL) and Machine Learning(ML) applications are rapidly incr...
research
07/08/2020

Distributed Training of Deep Learning Models: A Taxonomic Perspective

Distributed deep learning systems (DDLS) train deep neural network model...
research
11/07/2020

Exploring the limits of Concurrency in ML Training on Google TPUs

Recent results in language understanding using neural networks have requ...
research
12/30/2013

Consistent Bounded-Asynchronous Parameter Servers for Distributed ML

In distributed ML applications, shared parameters are usually replicated...
research
11/12/2020

Utilizing Ensemble Learning for Performance and Power Modeling and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks

Machine learning (ML) continues to grow in importance across nearly all ...

Please sign up or login with your details

Forgot password? Click here to reset