End-to-end Adaptive Distributed Training on PaddlePaddle

by   Yulong Ao, et al.

Distributed training has become a pervasive and effective approach for training a large neural network (NN) model with processing massive data. However, it is very challenging to satisfy requirements from various NN models, diverse computing resources, and their dynamic changes during a training job. In this study, we design our distributed training framework in a systematic end-to-end view to provide the built-in adaptive ability for different scenarios, especially for industrial applications and production environments, by fully considering resource allocation, model partition, task placement, and distributed execution. Based on the unified distributed graph and the unified cluster object, our adaptive framework is equipped with a global cost model and a global planner, which can enable arbitrary parallelism, resource-aware placement, multi-mode execution, fault-tolerant, and elastic distributed training. The experiments demonstrate that our framework can satisfy various requirements from the diversity of applications and the heterogeneity of resources with highly competitive performance. The ERNIE language model with 260 billion parameters is efficiently trained on thousands of AI processors with 91.7 system by employing the heterogeneous pipeline asynchronous execution can be increased up to 2.1 times and 3.3 times that of the GPU-only and CPU-only training respectively. Moreover, the fault-tolerant and elastic distributed training have been successfully applied to the online industrial applications, which give a reduction of 34.49 jobs and an increase of 33.91 production environment.


End-to-End Predictions-Based Resource Management Framework for Supercomputer Jobs

Job submissions of parallel applications to production supercomputer sys...

DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

The cloud is still a popular platform for distributed deep learning (DL)...

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

Distributed synchronized GPU training is commonly used for deep learning...

Sequence-to-sequence models for workload interference

Co-scheduling of jobs in data-centers is a challenging scenario, where j...

SerPyTor: A distributed context-aware computational graph execution framework for durable execution

Distributed computation is always a tricky topic to deal with, especiall...

SoCRATES: System-on-Chip Resource Adaptive Scheduling using Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) is being increasingly applied to the p...

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Oobleck enables resilient distributed training of large DNN models with ...

Please sign up or login with your details

Forgot password? Click here to reset