Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

01/09/2023
by   Muhammad Osama, et al.
0

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14× and 6.7×, and an average performance response that is both higher and more consistent across 32,824 GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS. Furthermore, we achieve this performance from a single tile size configuration per floating-point precision, whereas today's math libraries employ complex kernel-selection heuristics to select from a large ensemble of kernel variants.

READ FULL TEXT

page 2

page 7

research
12/17/2022

GPU Load Balancing

Fine-grained workload and resource balancing is the key to high performa...
research
10/05/2017

Tuning Technique for Multiple Precision Dense Matrix Multiplication using Prediction of Computational Time

Although reliable long precision floating-point arithmetic libraries suc...
research
01/17/2021

Acceleration of multiple precision matrix multiplication based on multi-component floating-point arithmetic using AVX2

In this paper, we report the results obtained from the acceleration of m...
research
05/08/2019

Performance Engineering for a Tall Skinny Matrix Multiplication Kernel on GPUs

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS lib...
research
11/12/2015

GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication

The generic matrix-matrix multiplication (GEMM) is arguably the most pop...
research
09/18/2023

Spatz: Clustering Compact RISC-V-Based Vector Units to Maximize Computing Efficiency

The ever-increasing computational and storage requirements of modern app...
research
10/17/2022

Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor

5G Radio access network disaggregation and softwarization pose challenge...

Please sign up or login with your details

Forgot password? Click here to reset