MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

02/07/2018
by   Xin Chen, et al.
0

In this paper, we present a co-designed petascale high-density GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent (SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system's network and properties of different types of nodes, we develop and implement a novel job server parallel software framework, named by "MiMatrix", for distributed deep learning training. Compared to the parameter server framework, in which parameter server is a bottleneck of data transfer in AllReduce algorithm of SSGD, the job server undertakes all of controlling, scheduling and monitoring tasks without model data transfer. In MiMatrix, we propose a novel GPUDirect Remote direct memory access (RDMA)-aware parallel algorithm of AllReucde executed by computing servers, which both computation and handshake message are O(1) at each epoch

READ FULL TEXT
research
06/13/2022

Modern Distributed Data-Parallel Large-Scale Pre-training Strategies For NLP models

Distributed deep learning is becoming increasingly popular due to the ex...
research
12/27/2018

Stanza: Distributed Deep Learning with Small Communication Footprint

The parameter server architecture is prevalently used for distributed de...
research
12/27/2018

Stanza: Layer Separation for Distributed Training in Deep Learning

The parameter server architecture is prevalently used for distributed de...
research
11/08/2018

Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

Distributed training of deep nets is an important technique to address s...
research
03/12/2018

High Throughput Synchronous Distributed Stochastic Gradient Descent

We introduce a new, high-throughput, synchronous, distributed, data-para...
research
04/19/2020

Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms

The widely-adopted practice is to train deep learning models with specia...
research
07/28/2023

Empirical Study of Straggler Problem in Parameter Server on Iterative Convergent Distributed Machine Learning

The purpose of this study is to test the effectiveness of current stragg...

Please sign up or login with your details

Forgot password? Click here to reset