Distributed-Memory Sparse Kernels for Machine Learning

03/15/2022
by   Vivek Bharadwaj, et al.
10

Sampled Dense Times Dense Matrix Multiplication (SDDMM) and Sparse Times Dense Matrix Multiplication (SpMM) appear in diverse settings, such as collaborative filtering, document clustering, and graph embedding. Frequently, the SDDMM output becomes the input sparse matrix for a subsequent SpMM operation. Existing work has focused on shared memory parallelization of these primitives. While there has been extensive analysis of communication-minimizing distributed 1.5D algorithms for SpMM, no such analysis exists for SDDMM or the back-to-back sequence of SDDMM and SpMM, termed FusedMM. We show that distributed memory 1.5D and 2.5D algorithms for SpMM can be converted to algorithms for SDDMM with identical communication costs and input / output data layouts. Further, we give two communication-eliding strategies to reduce costs further for FusedMM kernels: either reusing the replication of an input dense matrix for the SDDMM and SpMM in sequence, or fusing the local SDDMM and SpMM kernels. We benchmark FusedMM algorithms on Cori, a Cray XC40 at LBNL, using Erdos-Renyi random matrices and large real-world sparse matrices. On 256 nodes with 68 cores each, 1.5D FusedMM algorithms using either communication eliding approach can save at least 30 compared to executing a distributed-memory SpMM and SDDMM kernel in sequence. On real-world matrices with hundreds of millions of edges, all of our algorithms exhibit at least a 10x speedup over the SpMM algorithm in PETSc. On these matrices, our communication-eliding techniques exhibit runtimes up to 1.6 times faster than an unoptimized sequence of SDDMM and SpMM. We embed and test the scaling of our algorithms in real-world applications, including collaborative filtering via alternating-least-squares and inference for attention-based graph neural networks.

READ FULL TEXT

page 2

page 3

page 5

page 7

page 8

page 10

page 11

page 12

research
10/16/2020

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in ...
research
11/07/2020

FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks

We develop a fused matrix multiplication kernel that unifies sampled den...
research
05/07/2020

Reducing Communication in Graph Neural Network Training

Graph Neural Networks (GNNs) are powerful and flexible neural networks t...
research
02/03/2016

An SSD-based eigensolver for spectral analysis on billion-node graphs

Many eigensolvers such as ARPACK and Anasazi have been developed to comp...
research
03/22/2018

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multipl...
research
10/26/2016

The Reverse Cuthill-McKee Algorithm in Distributed-Memory

Ordering vertices of a graph is key to minimize fill-in and data structu...
research
11/28/2022

Distributed Parallelization of xPU Stencil Computations in Julia

We present a straightforward approach for distributed parallelization of...

Please sign up or login with your details

Forgot password? Click here to reset