Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices

09/14/2020
by   Anton Trusov, et al.
0

Quantized low-precision neural networks are very popular because they require less computational resources for inference and can provide high performance, which is vital for real-time and embedded recognition systems. However, their advantages are apparent for FPGA and ASIC devices, while general-purpose processor architectures are not always able to perform low-bit integer computations efficiently. The most frequently used low-precision neural network model for mobile central processors is an 8-bit quantized network. However, in a number of cases, it is possible to use fewer bits for weights and activations, and the only problem is the difficulty of efficient implementation. We introduce an efficient implementation of 4-bit matrix multiplication for quantized neural networks and perform time measurements on a mobile ARM processor. It shows 2.9 times speedup compared to standard floating-point multiplication and is 1.5 times faster than 8-bit quantized one. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. 4-bit quantization gives 95.0 inference speedup, while an 8-bit quantized network gives 95.4 39 devices, yielding good enough accuracy and low inference time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2022

Fast matrix multiplication for binary and ternary CNNs on ARM CPU

Low-bit quantized neural networks are of great interest in practical app...
research
02/26/2020

Quantized Neural Network Inference with Precision Batching

We present PrecisionBatching, a quantized inference algorithm for speedi...
research
09/12/2017

Streamlined Deployment for Quantized Neural Networks

Running Deep Neural Network (DNN) models on devices with limited computa...
research
04/02/2021

Inference of Recyclable Objects with Convolutional Neural Networks

Population growth in the last decades has resulted in the production of ...
research
02/12/2023

Quark: An Integer RISC-V Vector Processor for Sub-Byte Quantized DNN Inference

In this paper, we present Quark, an integer RISC-V vector processor spec...
research
02/18/2019

Low-bit Quantization of Neural Networks for Efficient Inference

Recent breakthrough methods in machine learning make use of increasingly...
research
01/13/2021

FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference

Deep learning models typically use single-precision (FP32) floating poin...

Please sign up or login with your details

Forgot password? Click here to reset