-
GE-SpMM: General-purpose Sparse Matrix-Matrix Multiplication on GPUs for Graph Neural Networks
Graph Neural Networks (GNNs) have achieved significant improvements in v...
read it
-
Sparse Matrix to Matrix Multiplication: A Representation and Architecture for Acceleration (long version)
Accelerators for sparse matrix multiplication are important components i...
read it
-
SpArch: Efficient Architecture for Sparse Matrix Multiplication
Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous...
read it
-
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention ...
read it
-
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much ...
read it
-
On Optimal Partitioning For Sparse Matrices In Variable Block Row Format
The Variable Block Row (VBR) format is an influential blocked sparse mat...
read it
-
A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication
The symmetric sparse matrix-vector multiplication (SymmSpMV) is an impor...
read it
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 3.6x peak speedup and a 23.5 real-world datasets.
READ FULL TEXT
Comments
There are no comments yet.