Near-Precise Parameter Approximation for Multiple Multiplications on A Single DSP Block

04/05/2021
by   Ercan Kalali, et al.
0

A multiply-accumulate (MAC) operation is the main computation unit for DSP applications. DSP blocks are one of the efficient solutions to implement MACs in FPGA's. However, since the DSP blocks have wide multiplier and adder blocks, MAC operations using low bit-length parameters lead to an underutilization problem. Hence, an efficient approximation technique is introduced. The technique includes manipulation and approximation of the low bit-length fixed-point parameters based upon a Single DSP - Multiple Multiplication (SDMM) execution. The SDMM changes the traditional MAC implementation in the DSP block by separating multiplication and accumulation operations. While the accumulator hardware available in the DSP block is used for multiple parameter multiplication, parallel LUTs are employed for the accumulation part of the MAC operation. The accuracy of the developed optimization technique was evaluated for different CNN weight bit precisions using the Alexnet and VGG-16 networks and the Tiny ImageNet dataset. The optimization can be implemented without loss of accuracy in almost all cases, while it causes slight accuracy losses in a few cases. Through these optimizations, the SDMM is performed at the cost of a small hardware overhead. For example, a single DSP block executes 3 8-bit fixed-point parameter multiplications. As a result of our optimizations, the parameters are represented in a different format on off-chip memory, providing up to 33 further increased by up to 97 methods for the VGG-16. Reaching this compression rate requires extra hardware cost. A prototype systolic array architecture was implemented employing our optimizations on a Xilinx Zynq FPGA. It reduced the number of DSP blocks by 66.6

READ FULL TEXT
research
02/04/2016

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

Deep neural networks (DNNs) demand a very large amount of computation an...
research
06/21/2018

Generic and Universal Parallel Matrix Summation with a Flexible Compression Goal for Xilinx FPGAs

Bit matrix compression is a highly relevant operation in computer arithm...
research
07/07/2023

BlendNet: Design and Optimization of a Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions

This paper presents BlendNet, a neural network architecture employing a ...
research
06/23/2021

EC Scalar Multiplication: Successful Simple Address Bit SCA Attack against Atomic Patterns

In this work we discuss the resistance of atomic pattern algorithms for ...
research
07/19/2021

Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

The configurable building blocks of current FPGAs – Logic blocks (LBs), ...
research
01/11/2019

Low Precision Constant Parameter CNN on FPGA

We report FPGA implementation results of low precision CNN convolution l...
research
10/27/2017

A Single-Channel Architecture for Algebraic Integer Based 8×8 2-D DCT Computation

An area efficient row-parallel architecture is proposed for the real-tim...

Please sign up or login with your details

Forgot password? Click here to reset