High Throughput Matrix-Matrix Multiplication between Asymmetric Bit-Width Operands

08/03/2020
by   Dibakar Gope, et al.
0

Matrix multiplications between asymmetric bit-width operands, especially between 8- and 4-bit operands are likely to become a fundamental kernel of many important workloads including neural networks and machine learning. While existing SIMD matrix multiplication instructions for symmetric bit-width operands can support operands of mixed precision by zero- or sign-extending the narrow operand to match the size of the other operands, they cannot exploit the benefit of narrow bit-width of one of the operands. We propose a new SIMD matrix multiplication instruction that uses mixed precision on its inputs (8- and 4-bit operands) and accumulates product values into narrower 16-bit output accumulators, in turn allowing the SIMD operation at 128-bit vector width to process a greater number of data elements per instruction to improve processing throughput and memory bandwidth utilization without increasing the register read- and write-port bandwidth in CPUs. The proposed asymmetric-operand-size SIMD instruction offers 2x improvement in throughput of matrix multiplication in comparison to throughput obtained using existing symmetric-operand-size instructions while causing negligible (0.05 for representative machine learning workloads. The asymmetric-operand-size instruction not only can improve matrix multiplication throughput in CPUs, but also can be effective to support multiply-and-accumulate (MAC) operation between 8- and 4-bit operands in state-of-the-art DNN hardware accelerators (e.g., systolic array microarchitecture in Google TPU, etc.) and offer similar improvement in matrix multiply performance seamlessly without violating the various implementation constraints. We demonstrate how a systolic array architecture designed for symmetric-operand-size instructions could be modified to support an asymmetric-operand-sized instruction.

READ FULL TEXT
research
11/18/2019

General Matrix-Matrix Multiplication Using SIMD features of the PIII

Generalised matrix-matrix multiplication forms the kernel of many mathem...
research
09/06/2023

The Case for Asymmetric Systolic Array Floorplanning

The widespread proliferation of deep learning applications has triggered...
research
11/29/2021

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

This work focuses on an efficient Agile design methodology for domain-sp...
research
05/03/2022

SIMD^2: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Matrix-multiplication units (MXUs) are now prevalent in every computing ...
research
09/13/2023

Short reasons for long vectors in HPC CPUs: a study based on RISC-V

For years, SIMD/vector units have enhanced the capabilities of modern CP...
research
11/08/2022

Microprocessor Design with Dynamic Clock Source and Multi-Width Instructions

This paper introduces a novel 32-bit microprocessor, based on the RISC-V...
research
01/25/2022

Faster multiplication over 𝔽_2[X] using AVX512 instruction set and VPCLMULQDQ instruction

Code-based cryptography is one of the main propositions for the post-qua...

Please sign up or login with your details

Forgot password? Click here to reset