Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

05/15/2023
by   Braedy Kuzma, et al.
0

The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers, for data (re)organization, and micro kernels that perform the actual computations. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to deliver high-performance matrix operations. This paper presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to 22x (Intel), for small matrices, and more than 6x (POWER9), for large matrices, faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only 34 on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over 2.6x faster than the vector-extension solution, matches Eigen performance, and achieves up to 96

READ FULL TEXT
research
11/03/2016

Generating Families of Practical Fast Matrix Multiplication Algorithms

Matrix multiplication (GEMM) is a core operation to numerous scientific ...
research
03/13/2020

Fireiron: A Scheduling Language for High-Performance Linear Algebra on GPUs

Achieving high-performance GPU kernels requires optimizing algorithm imp...
research
05/12/2023

AMULET: Adaptive Matrix-Multiplication-Like Tasks

Many useful tasks in data science and machine learning applications can ...
research
02/16/2023

GEMMFIP: Unifying GEMM in BLIS

Matrix libraries often focus on achieving high performance for problems ...
research
08/04/2021

High-Performance Level-1 and Level-2 BLAS

The introduction of the Basic Linear Algebra Subroutine (BLAS) in the 19...
research
05/13/2020

High Performance and Portable Convolution Operators for ARM-based Multicore Processors

The considerable impact of Convolutional Neural Networks on many Artific...
research
02/21/2019

The BLAS API of BLASFEO: optimizing performance for small matrices

BLASFEO is a dense linear algebra library providing high-performance imp...

Please sign up or login with your details

Forgot password? Click here to reset