DGEMM on Integer Matrix Multiplication Unit

06/21/2023
by   Hiroyuki Ootomo, et al.
0

Deep learning hardware achieves high throughput and low power consumption by reducing computing precision and specializing in matrix multiplication. For machine learning inference, fixed-point value computation is commonplace, where the input and output values and the model parameters are quantized. Thus, many processors are now equipped with fast integer matrix multiplication units (IMMU). It is of significant interest to find a way to harness these IMMUs to improve the performance of HPC applications while maintaining accuracy. We focus on the Ozaki scheme, which computes a high-precision matrix multiplication by using lower-precision computing units, and show the advantages and disadvantages of using IMMU. The experiment using integer Tensor Cores shows that we can compute double-precision matrix multiplication faster than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum circuit simulation by up to 4.33 while maintaining the FP64 accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2022

Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance

Tensor Core is a mixed-precision matrix-matrix multiplication unit on NV...
research
11/24/2018

Accelerating Reduction and Scan Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processor...
research
10/27/2020

Matrix Engines for High Performance Computing:A Paragon of Performance or Grasping at Straws?

Matrix engines or units, in different forms and affinities, are becoming...
research
06/22/2018

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable Computing

Matrix-matrix multiplication is a key computational kernel for numerous ...
research
04/10/2023

Mixed-Precision Random Projection for RandNLA on Tensor Cores

Random projection can reduce the dimension of data while capturing its s...
research
10/01/2019

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Quantization has emerged to be an effective way to significantly boost t...
research
05/12/2017

CLBlast: A Tuned OpenCL BLAS Library

This work demonstrates how to accelerate dense linear algebra computatio...

Please sign up or login with your details

Forgot password? Click here to reset