Automating the Last-Mile for High Performance Dense Linear Algebra

11/24/2016
by   Richard Michael Veras, et al.
0

High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving high performance for the Gemm kernel translates into a high performance linear algebra stack above this kernel. However, it is a monumental task for a domain expert to manually implement the kernel for every library-supported architecture. This leads to the belief that the craft of a Gemm kernel is more dark art than science. It is this premise that drives the popularity of autotuning with code generation in the domain of DLA. This paper, instead, focuses on an analytical approach to code generation of the Gemm kernel for different architecture, in order to shed light on the details or voo-doo required for implementing a high performance Gemm kernel. We distill the implementation of the kernel into an even smaller kernel, an outer-product, and analytically determine how available SIMD instructions can be used to compute the outer-product efficiently. We codify this approach into a system to automatically generate a high performance SIMD implementation of the Gemm kernel. Experimental results demonstrate that our approach yields generated kernels with performance that is competitive with kernels implemented manually or using empirical search.

READ FULL TEXT
research
08/04/2021

High-Performance Level-1 and Level-2 BLAS

The introduction of the Basic Linear Algebra Subroutine (BLAS) in the 19...
research
09/02/2023

CoRD: Converged RDMA Dataplane for High-Performance Clouds

High-performance networking is often characterized by kernel bypass whic...
research
01/03/2018

Computing the Sparse Matrix Vector Product using Block-Based Kernels Without Zero Padding on Processors with AVX-512 Instructions

The sparse matrix-vector product (SpMV) is a fundamental operation in ma...
research
03/17/2016

An Implementation and Analysis of a Kernel Network Stack in Go with the CSP Style

Modern operating system kernels are written in lower-level languages suc...
research
05/29/2021

SMASH: Sparse Matrix Atomic Scratchpad Hashing

Sparse matrices, more specifically SpGEMM kernels, are commonly found in...
research
04/06/2019

On the Representation of Partially Specified Implementations and its Application to the Optimization of Linear Algebra Kernels on GPU

Traditional optimizing compilers rely on rewrite rules to iteratively ap...
research
04/30/2021

QDOT: Quantized Dot Product Kernel for Approximate High-Performance Computing

Approximate computing techniques have been successful in reducing comput...

Please sign up or login with your details

Forgot password? Click here to reset