NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

10/01/2019
by   Wenlei Bao, et al.
0

Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory bandwidth and computation cycles with little losses of accuracy. Integer GEMM (General Matrix Multiplication) is critical to running quantized DNN models efficiently, as GEMM operations often dominate the computations in these models. Various approaches have been developed by leveraging techniques such as vectorization and memory layout to improve the performance of integer GEMM. However, these existing approaches are not fast enough in certain scenarios. We developed NGEMM, a compiler-based GEMM implementation for accelerating lower-precision training and inference. NGEMM has better use of the vector units by avoiding unnecessary vector computation that is introduced during tree reduction. Our experimental results showed that NGEMM outperformed the state-of-art BLAS libraries such as MKL by an average of 1.4x. We have applied NGEMM to a number of production services in Microsoft

READ FULL TEXT

page 2

page 4

research
05/20/2020

BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

The number of parameters in deep neural networks (DNNs) is rapidly incre...
research
06/21/2023

DGEMM on Integer Matrix Multiplication Unit

Deep learning hardware achieves high throughput and low power consumptio...
research
03/19/2020

LANCE: efficient low-precision quantized Winograd convolution for neural networks based on graphics processing units

Accelerating deep convolutional neural networks has become an active top...
research
09/12/2017

Streamlined Deployment for Quantized Neural Networks

Running Deep Neural Network (DNN) models on devices with limited computa...
research
07/28/2021

MARViN – Multiple Arithmetic Resolutions Vacillating in Neural Networks

Quantization is a technique for reducing deep neural networks (DNNs) tra...
research
02/12/2023

Quark: An Integer RISC-V Vector Processor for Sub-Byte Quantized DNN Inference

In this paper, we present Quark, an integer RISC-V vector processor spec...
research
09/08/2017

Low-memory GEMM-based convolution algorithms for deep neural networks

Deep neural networks (DNNs) require very large amounts of computation bo...

Please sign up or login with your details

Forgot password? Click here to reset