Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks

11/25/2020
by   Nick Iliev, et al.
0

We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz clock. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FClayer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 in VGG16 when compared to an alternative EIE solution which uses compression.

READ FULL TEXT
research
07/27/2017

Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability

Tartan (TRT), a hardware accelerator for inference with Deep Neural Netw...
research
07/01/2017

Structured Sparse Ternary Weight Coding of Deep Neural Networks for Efficient Hardware Implementations

Deep neural networks (DNNs) usually demand a large amount of operations ...
research
07/09/2019

A Targeted Acceleration and Compression Framework for Low bit Neural Networks

1 bit deep neural networks (DNNs), of which both the activations and wei...
research
05/22/2017

A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity

It remains a challenge to run Deep Learning in devices with stringent po...
research
12/06/2021

Kraken: An Efficient Engine with a Uniform Dataflow for Deep Neural Networks

Deep neural networks (DNNs) have been successfully employed in a multitu...
research
09/16/2023

A Low-Latency FFT-IFFT Cascade Architecture

This paper addresses the design of a partly-parallel cascaded FFT-IFFT a...
research
03/03/2023

PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

High-speed long polynomial multiplication is important for applications ...

Please sign up or login with your details

Forgot password? Click here to reset