Performance Engineering for a Tall Skinny Matrix Multiplication Kernel on GPUs

05/08/2019
by   Dominik Ernst, et al.
0

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall skinny matrices, which are much taller than wide. Nvidia's current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.

READ FULL TEXT
research
05/08/2019

Performance Engineering for Real and Complex Tall Skinny Matrix Multiplication Kernels on GPUs

General matrix-matrix multiplications with double-precision real and com...
research
03/24/2021

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational l...
research
05/03/2016

Implementing Strassen's Algorithm with BLIS

We dispel with "street wisdom" regarding the practical implementation of...
research
02/16/2023

GEMMFIP: Unifying GEMM in BLIS

Matrix libraries often focus on achieving high performance for problems ...
research
01/09/2023

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

We introduce Stream-K, a work-centric parallelization of matrix multipli...
research
06/22/2020

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform ...
research
06/11/2021

COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

Communication-avoiding algorithms for Linear Algebra have become increas...

Please sign up or login with your details

Forgot password? Click here to reset