Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

08/26/2019
by   Grzegorz Kwasniewski, et al.
0

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √(3) times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

READ FULL TEXT

page 1

page 3

research
05/26/2022

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds

Communication lower bounds have long been established for matrix multipl...
research
12/15/2018

Layer Based Partition for Matrix Multiplication on Heterogeneous Processor Platforms

While many approaches have been proposed to analyze the problem of matri...
research
08/20/2021

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

Matrix factorizations are among the most important building blocks of sc...
research
07/06/2019

Optimizing Xeon Phi for Interactive Data Analysis

The Intel Xeon Phi manycore processor is designed to provide high perfor...
research
11/13/2019

Improving the Space-Time Efficiency of Processor-Oblivious Matrix Multiplication Algorithms

Classic cache-oblivious parallel matrix multiplication algorithms achiev...
research
07/17/2023

Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids

We propose an algorithm that aims at minimizing the inter-node communica...
research
04/11/2019

The MOMMS Family of Matrix Multiplication Algorithms

As the ratio between the rate of computation and rate with which data ca...

Please sign up or login with your details

Forgot password? Click here to reset