SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference

07/20/2022
by   Sachit Kuhar, et al.
0

Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit weight repetition. The key observation is that due to DNN quantization schemes—which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight—the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs with weight-dependent structure. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel’s optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09×-2.05× and 1.04×-1.51× relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7× to 15.4×

READ FULL TEXT
research
09/27/2018

Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural Networks

Quantization of weights and activations in Deep Neural Networks (DNNs) i...
research
03/09/2023

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Quantization is a popular technique used in Deep Neural Networks (DNN) i...
research
10/30/2021

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions

This work proposes a novel Deep Neural Network (DNN) quantization framew...
research
02/25/2020

TxSim:Modeling Training of Deep Neural Networks on Resistive Crossbar Systems

Resistive crossbars have attracted significant interest in the design of...
research
08/23/2021

On the Acceleration of Deep Neural Network Inference using Quantized Compressed Sensing

Accelerating deep neural network (DNN) inference on resource-limited dev...
research
11/23/2020

Proximu: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute

Deep Neural Network (DNN) inference is emerging as the fundamental bedro...

Please sign up or login with your details

Forgot password? Click here to reset