Neural Network Compression using Binarization and Few Full-Precision Weights

06/15/2023
by   Franco Maria Nardini, et al.
0

Quantization and pruning are known to be two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We perform an extensive evaluation of APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB shows to deliver better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization also in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bits quantized model with no loss in accuracy.

READ FULL TEXT

page 1

page 14

research
05/29/2018

Retraining-Based Iterative Weight Quantization for Deep Neural Networks

Model compression has gained a lot of attention due to its ability to re...
research
09/19/2017

Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks

A low precision deep neural network training technique for producing spa...
research
02/22/2022

Distilled Neural Networks for Efficient Learning to Rank

Recent studies in Learning to Rank have shown the possibility to effecti...
research
05/20/2020

BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

The number of parameters in deep neural networks (DNNs) is rapidly incre...
research
05/26/2019

HadaNets: Flexible Quantization Strategies for Neural Networks

On-board processing elements on UAVs are currently inadequate for traini...
research
12/22/2022

EuclidNets: An Alternative Operation for Efficient Inference of Deep Learning Models

With the advent of deep learning application on edge devices, researcher...
research
04/11/2018

Hybrid Binary Networks: Optimizing for Accuracy, Efficiency and Memory

Binarization is an extreme network compression approach that provides la...

Please sign up or login with your details

Forgot password? Click here to reset