BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

05/20/2020
by   Yongkweon Jeon, et al.
14

The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized into a few bits, CPUs and GPUs cannot access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 7

page 9

page 10

page 11

research
10/01/2019

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Quantization has emerged to be an effective way to significantly boost t...
research
06/15/2023

Neural Network Compression using Binarization and Few Full-Precision Weights

Quantization and pruning are known to be two effective Deep Neural Netwo...
research
07/13/2020

Term Revealing: Furthering Quantization at Run Time on Quantized DNNs

We present a novel technique, called Term Revealing (TR), for furthering...
research
09/08/2021

Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks

Quantization has been proven to be a vital method for improving the infe...
research
07/20/2021

CREW: Computation Reuse and Efficient Weight Storage for Hardware-accelerated MLPs and RNNs

Deep Neural Networks (DNNs) have achieved tremendous success for cogniti...
research
04/11/2020

From Quantized DNNs to Quantizable DNNs

This paper proposes Quantizable DNNs, a special type of DNNs that can fl...
research
09/11/2023

Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the nu...

Please sign up or login with your details

Forgot password? Click here to reset