HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

11/20/2021
by   Ji Liu, et al.
1

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. In addition, heterogeneous computing resources, e.g., CPUs, GPUs of multiple types, are available for the distributed training process. Thus, the scheduling of multiple layers to diverse computing resources is critical for the training process. To efficiently train a DNN model using the heterogeneous computing resources, we propose a distributed framework, i.e., Paddle-Heterogeneous Parameter Server (Paddle-HeterPS), composed of a distributed architecture and a Reinforcement Learning (RL)-based scheduling method. The advantages of Paddle-HeterPS are three-fold compared with existing frameworks. First, Paddle-HeterPS enables efficient training process of diverse workloads with heterogeneous computing resources. Second, Paddle-HeterPS exploits an RL-based method to efficiently schedule the workload of each layer to appropriate computing resources to minimize the cost while satisfying throughput constraints. Third, Paddle-HeterPS manages data storage and data communication among distributed computing resources. We carry out extensive experiments to show that Paddle-HeterPS significantly outperforms state-of-the-art approaches in terms of throughput (14.5 times higher) and monetary cost (312.3 available at: https://github.com/PaddlePaddle/Paddle.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2022

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Although more layers and more parameters generally improve the accuracy ...
research
12/15/2020

Gegelati: Lightweight Artificial Intelligence through Generic and Evolvable Tangled Program Graphs

Tangled Program Graph (TPG) is a reinforcement learning technique based ...
research
09/22/2022

Reinforcement Learning in Computing and Network Convergence Orchestration

As computing power is becoming the core productivity of the digital econ...
research
07/31/2023

DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms

Datacenters are increasingly becoming heterogeneous, and are starting to...
research
04/11/2021

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Recently, Deep Neural Networks (DNNs) have recorded great success in han...
research
09/24/2021

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

Deep Neural Networks have gained significant attraction due to their wid...
research
01/18/2023

HLC2: a highly efficient cross-matching framework for large astronomical catalogues on heterogeneous computing environments

Cross-matching operation, which is to find corresponding data for the sa...

Please sign up or login with your details

Forgot password? Click here to reset