Efficient Sparse-Dense Matrix-Matrix Multiplication on GPUs Using the Customized Sparse Storage Format

05/29/2020
by   Shaohuai Shi, et al.
0

Multiplication of a sparse matrix to a dense matrix (SpDM) is widely used in many areas like scientific computing and machine learning. However, existing works under-look the performance optimization of SpDM on modern many-core architectures like GPUs. The storage data structures help sparse matrices store in a memory-saving format, but they bring difficulties in optimizing the performance of SpDM on modern GPUs due to irregular data access of the sparse structure, which results in lower resource utilization and poorer performance. In this paper, we refer to the roofline performance model of GPUs to design an efficient SpDM algorithm called GCOOSpDM, in which we exploit coalescent global memory access, fast shared memory reuse and more operations per byte of global memory traffic. Experiments are evaluated on three Nvidia GPUs (i.e., GTX 980, GTX Titan X Pascal and Tesla P100) with CUDA-8.0 using a large number of matrices including a public dataset and randomly generated matrices. Experimental results show that GCOOSpDM achieves 1.5-8× speedup over Nvidia's library cuSPARSE in many matrices. We also analyze instruction-level operations on a particular GPU to understand the performance gap between GCOOSpDM and cuSPARSE. The profiled instructions confirm that cuSPARSE spends a lot of time on slow memory access (including DRAM access and L2 cache access), while GCOOSpDM transfers such slow memory access to faster shared memory, which mainly contributes to the performance gain. Results also show that GCOOSpDM would outperform the dense algorithm (cuBLAS) with lower sparsity than cuSPARSE on GPUs.

READ FULL TEXT
research
03/24/2021

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational l...
research
06/18/2020

Sparse GPU Kernels for Deep Learning

Scientific workloads have traditionally exploited high levels of sparsit...
research
02/09/2020

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs

Linear algebra operations have been widely used in big data analytics an...
research
08/24/2018

Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs

Conventional GPU implementations of Strassen's algorithm (Strassen) typi...
research
11/09/2018

Spatter: A Benchmark Suite for Evaluating Sparse Access Patterns

Recent characterizations of data movement performance have evaluated opt...
research
01/11/2019

Automatic acceleration of Numpy applications on GPUs and multicore CPUs

Frameworks like Numpy are a popular choice for application developers fr...
research
11/22/2020

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads

Sparse matrices are the key ingredients of several application domains, ...

Please sign up or login with your details

Forgot password? Click here to reset