COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

06/11/2021
by   Marko Kabić, et al.
0

Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided. Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements A=α·op(B) + β· A, op∈{transpose, conjugate-transpose, identity} on distributed systems, where A, B are matrices with potentially different (distributed) layouts and α, β are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2017

Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI

Matrix-matrix multiplication is a basic operation in linear algebra and ...
research
08/28/2020

Distributed-memory ℋ-matrix Algebra I: Data Distribution and Matrix-vector Multiplication

We introduce a data distribution scheme for ℋ-matrices and a distributed...
research
12/29/2020

Scalable Parallel Linear Solver for Compact Banded Systems on Heterogeneous Architectures

A scalable algorithm for solving compact banded linear systems on distri...
research
09/24/2019

A high-level characterisation and generalisation of communication-avoiding programming techniques

Today's hardware's explosion of concurrency plus the explosion of data w...
research
09/09/2019

Scheduling optimization of parallel linear algebra algorithms using Supervised Learning

Linear algebra algorithms are used widely in a variety of domains, e.g m...
research
05/08/2019

Performance Engineering for a Tall Skinny Matrix Multiplication Kernel on GPUs

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS lib...
research
04/20/2020

A Generalization of the Allreduce Operation

Allreduce is one of the most frequently used MPI collective operations, ...

Please sign up or login with your details

Forgot password? Click here to reset