SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

05/03/2022
by   Yunan Zhang, et al.
0

Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD^2, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD^2 instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD^2 instructions resemble a matrix-multiplication instruction, we are able to build SIMD^2 architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD^2 using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59× speedup and more than 10.63× on average over optimized CUDA programs, with only 5 of full-chip area overhead.

READ FULL TEXT

page 3

page 4

page 7

page 9

page 10

page 11

research
06/06/2022

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Tensor Cores have been an important unit to accelerate Fused Matrix Mult...
research
09/29/2020

Accelerating Sparse Matrix-Matrix Multiplication with GPU Tensor Cores

Sparse general matrix-matrix multiplication (spGEMM) is an essential com...
research
07/04/2023

Matrix Multiplication Using Only Addition

Matrix multiplication consumes a large fraction of the time taken in man...
research
06/22/2020

Automatic Kernel Generation for Volta Tensor Cores

A commonly occurring computation idiom in neural networks is to perform ...
research
04/07/2021

A matrix math facility for Power ISA(TM) processors

Power ISA(TM) Version 3.1 has introduced a new family of matrix math ins...
research
08/03/2020

High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

Matrix multiplications between asymmetric bit-width operands, especially...
research
06/30/2020

Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs

Despite foreseeing tremendous speedups over conventional deep neural net...

Please sign up or login with your details

Forgot password? Click here to reset