
GESpMM: Generalpurpose Sparse MatrixMatrix Multiplication on GPUs for Graph Neural Networks
Graph Neural Networks (GNNs) have achieved significant improvements in v...
read it

Sparse Matrix to Matrix Multiplication: A Representation and Architecture for Acceleration (long version)
Accelerators for sparse matrix multiplication are important components i...
read it

SpArch: Efficient Architecture for Sparse Matrix Multiplication
Generalized Sparse MatrixMatrix Multiplication (SpGEMM) is a ubiquitous...
read it

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention ...
read it

A Systematic Survey of General Sparse MatrixMatrix Multiplication
SpGEMM (General Sparse MatrixMatrix Multiplication) has attracted much ...
read it

On Optimal Partitioning For Sparse Matrices In Variable Block Row Format
The Variable Block Row (VBR) format is an influential blocked sparse mat...
read it

A Recursive Algebraic Coloring Technique for HardwareEfficient Symmetric Sparse MatrixVector Multiplication
The symmetric sparse matrixvector multiplication (SymmSpMV) is an impor...
read it
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparsematrix densematrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressedsparserow (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on threadlevel parallelism, we additionally focus on latency hiding with instructionlevel parallelism and loadbalancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients(i) mergebased loadbalancing and (ii) rowmajor coalesced memory accesswe demonstrate a 3.6x peak speedup and a 23.5 realworld datasets.
READ FULL TEXT
Comments
There are no comments yet.