Systolic Computing on GPUs for Productive Performance

10/29/2020
by   Hongbo Rong, et al.
0

We propose a language and compiler to productively build high-performance software systolic arrays that run on GPUs. Based on a rigorous mathematical foundation (uniform recurrence equations and space-time transform), our language has a high abstraction level and covers a wide range of applications. A programmer specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler; the compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs. In this way, both productivity and performance are achieved in the same time. This approach neatly combines loop transformations, data shuffling, and vector register allocation into a single framework. Meanwhile, many other optimizations can be applied as well; the compiler composes the optimizations together to generate efficient code. We implemented the approach on Intel GPUs. This is the first system that allows productive construction of systolic arrays on GPUs. We allow multiple projections, arbitrary projection directions and linear schedules, which can express most, if not all, systolic arrays in practice. Experiments with 1- and 2-D convolution on an Intel GEN9.5 GPU have demonstrated the generality of the approach, and its productivity in expressing various systolic designs for finding the best candidate. Although our systolic arrays are purely software running on generic SIMD hardware, compared with the GPU's specialized, hardware samplers that perform the same convolutions, some of our best designs are up to 59% faster. Overall, this approach holds promise for productive high-performance computing on GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2021

C-for-Metal: High Performance SIMD Programming on Intel GPUs

The SIMT execution model is commonly used for general GPU development. C...
research
02/02/2022

Efficient Memory Partitioning in Software Defined Hardware

As programmers turn to software-defined hardware (SDH) to maintain a hig...
research
03/17/2020

Towards High Performance, Portability, and Productivity: Lightweight Augmented Neural Networks for Performance Prediction

Writing high-performance code requires significant expertise in the prog...
research
10/10/2017

SoAx: A generic C++ Structure of Arrays for handling Particles in HPC Codes

The numerical study of physical problems often require integrating the d...
research
11/21/2017

Programmatic Control of a Compiler for Generating High-performance Spatial Hardware

This methodology paper addresses high-performance high-productivity prog...
research
06/19/2023

From array algebra to energy efficiency on GPUs: Data and hardware shapes with dimension-lifting to optimize memory-processor layouts

We present a new formulation for parallel matrix multiplication (MM) to ...
research
02/20/2023

Control Flow Duplication for Columnar Arrays in a Dynamic Compiler

Columnar databases are an established way to speed up online analytical ...

Please sign up or login with your details

Forgot password? Click here to reset