Blink: Fast and Generic Collectives for Distributed ML

10/11/2019
by   Guanhua Wang, et al.
0

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/08/2021

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU syst...
research
11/16/2019

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) ...
research
09/17/2019

Heterogeneity-Aware Asynchronous Decentralized Training

Distributed deep learning training usually adopts All-Reduce as the sync...
research
04/12/2021

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

With increasing data and model complexities, the time required to train ...
research
01/17/2019

Accelerated Training for CNN Distributed Deep Learning through Automatic Resource-Aware Layer Placement

The Convolutional Neural Network (CNN) model, often used for image class...
research
05/10/2019

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neura...
research
03/30/2023

Enabling Cost-Benefit Analysis of Data Sync Protocols

The problem of data synchronization arises in networked applications tha...

Please sign up or login with your details

Forgot password? Click here to reset