RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

11/28/2022
by   Alessandro Ottino, et al.
0

Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8 Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171× speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16× and 7.8-58× reduction in Megatron and DLRM training time respectively while offering 42-53× and 3.3-12.4× improvement in energy consumption and cost respectively.

READ FULL TEXT

page 5

page 11

page 12

page 14

page 22

page 29

page 32

page 34

research
10/29/2019

Decomposing Collectives for Exploiting Multi-lane Communication

Many modern, high-performance systems increase the cumulated node-bandwi...
research
07/29/2019

Improving MPI Collective I/O Performance With Intra-node Request Aggregation

Two-phase I/O is a well-known strategy for implementing collective MPI-I...
research
11/28/2022

OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems

All-gather collective communication is one of the most important communi...
research
02/10/2020

PULSE: Optical circuit switched Data Center architecture operating at nanosecond timescales

We introduce PULSE, a sub-microsecond optical circuit-switched data cent...
research
02/20/2018

Efficient Embedding of MPI Collectives in MXNET DAGs for scaling Deep Learning

Availability of high performance computing infrastructures such as clust...
research
09/17/2021

Sparbit: a new logarithmic-cost and data locality-aware MPI Allgather algorithm

The collective operations are considered critical for improving the perf...
research
12/02/2021

Memory-efficient array redistribution through portable collective communication

Modern large-scale deep learning workloads highlight the need for parall...

Please sign up or login with your details

Forgot password? Click here to reset