Decomposing Collectives for Exploiting Multi-lane Communication

10/29/2019
by   Jesper Larsson Träff, et al.
0

Many modern, high-performance systems increase the cumulated node-bandwidth by offering more than a single communication network and/or by having multiple connections to the network. Efficient algorithms and implementations for collective operations as found in, e.g., MPI must be explicitly designed for such multi-lane capabilities. We discuss a model for the design of multi-lane algorithms, and in particular give a recipe for converting any standard, one-ported, (pipelined) communication tree algorithm into a multi-lane algorithm that can effectively use k lanes simultaneously. We first examine the problem from the perspective of self-consistent performance guidelines, and give simple, full-lane, mock-up implementations of the MPI broadcast, reduction, gather, scatter, allgather, and alltoall operations using only similar operations of the given MPI library itself. The mock-up implementations, contrary to expectation, in many cases show surprising performance improvements with different MPI libraries on a small 36-node dual-socket, dual-lane Intel OmniPath cluster, indicating severe problems with the native MPI library implementations. Our full-lane implementations are in many cases considerably more than a factor of two faster than the corresponding MPI collectives. We see similar results on the larger Vienna Scientific Cluster, VSC-3. These experiments indicate considerable room for improvement of the MPI collectives in current libraries including more efficient use of multi-lane communication.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2020

k-ported vs. k-lane Broadcast, Scatter, and Alltoall Algorithms

In k-ported message-passing systems, a processor can simultaneously rece...
research
11/28/2022

RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

Distributed deep learning (DDL) systems strongly depend on network perfo...
research
04/23/2020

Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling

The performance of collective operations has been a critical issue since...
research
02/06/2020

Scalable Communication Endpoints for MPI+Threads Applications

Hybrid MPI+threads programming is gaining prominence as an alternative t...
research
05/01/2020

How I Learned to Stop Worrying About User-Visible Endpoints and Love MPI

MPI+threads is gaining prominence as an alternative to the traditional M...
research
12/13/2013

Transparent Checkpoint-Restart over InfiniBand

InfiniBand is widely used for low-latency, high-throughput cluster compu...
research
09/26/2021

A Doubly-pipelined, Dual-root Reduction-to-all Algorithm and Implementation

We discuss a simple, binary tree-based algorithm for the collective allr...

Please sign up or login with your details

Forgot password? Click here to reset