F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization

02/10/2022
by   Qing Jin, et al.
0

Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm – parameterized clipping activation (PACT) – and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.

READ FULL TEXT
research
04/06/2021

TENT: Efficient Quantization of Neural Networks on the tiny Edge with Tapered FixEd PoiNT

In this research, we propose a new low-precision framework, TENT, to lev...
research
05/27/2020

Accelerating Neural Network Inference by Overflow Aware Quantization

The inherent heavy computation of deep neural networks prevents their wi...
research
03/09/2023

Dynamic Stashing Quantization for Efficient Transformer Training

Large Language Models (LLMs) have demonstrated impressive performance on...
research
08/02/2019

U-Net Fixed-Point Quantization for Medical Image Segmentation

Model quantization is leveraged to reduce the memory consumption and the...
research
11/10/2017

Quantized Memory-Augmented Neural Networks

Memory-augmented neural networks (MANNs) refer to a class of neural netw...
research
03/04/2023

Fixed-point quantization aware training for on-device keyword-spotting

Fixed-point (FXP) inference has proven suitable for embedded devices wit...
research
02/24/2021

FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism

In this paper, we present a deep reinforcement learning platform named F...

Please sign up or login with your details

Forgot password? Click here to reset