Bit-Parallel Vector Composability for Neural Acceleration

04/11/2020
by   Soroush Ghodrati, et al.
0

Conventional neural accelerators rely on isolated self-sufficient functional units that perform an atomic operation while communicating the results through an operand delivery-aggregation logic. Each single unit processes all the bits of their operands atomically and produce all the bits of the results in isolation. This paper explores a different design style, where each unit is only responsible for a slice of the bit-level operations to interleave and combine the benefits of bit-level parallelism with the abundant data-level parallelism in deep neural networks. A dynamic collection of these units cooperate at runtime to generate bits of the results, collectively. Such cooperation requires extracting new grouping between the bits, which is only possible if the operands and operations are vectorizable. The abundance of Data Level Parallelism and mostly repeated execution patterns, provides a unique opportunity to define and leverage this new dimension of Bit-Parallel Vector Composability. This design intersperses bit parallelism within data-level parallelism and dynamically interweaves the two together. As such, the building block of our neural accelerator is a Composable Vector Unit that is a collection of Narrower-Bitwidth Vector Engines, which are dynamically composed or decomposed at the bit granularity. Using six diverse CNN and LSTM deep networks, we evaluate this design style across four design points: with and without algorithmic bitwidth heterogeneity and with and without availability of a high-bandwidth off-chip memory. Across these four design points, Bit-Parallel Vector Composability brings (1.4x to 3.5x) speedup and (1.1x to 2.7x) energy reduction. We also comprehensively compare our design style to the Nvidia RTX 2080 TI GPU, which also supports INT-4 execution. The benefits range between 28.0x and 33.7x improvement in Performance-per-Watt.

READ FULL TEXT

page 1

page 4

page 5

page 6

research
12/05/2017

Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Hardware acceleration of Deep Neural Networks (DNNs) aims to tame their ...
research
01/29/2020

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technol...
research
01/07/2020

HyGCN: A GCN Accelerator with Hybrid Architecture

In this work, we first characterize the hybrid execution patterns of GCN...
research
03/13/2013

Parallelizing Probabilistic Inference: Some Early Explorations

We report on an experimental investigation into opportunities for parall...
research
11/14/2015

8-Bit Approximations for Parallelism in Deep Learning

The creation of practical deep learning data-products often requires par...
research
03/15/2022

Energy-efficient Dense DNN Acceleration with Signed Bit-slice Architecture

As the number of deep neural networks (DNNs) to be executed on a mobile ...
research
06/30/2020

Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs

Despite foreseeing tremendous speedups over conventional deep neural net...

Please sign up or login with your details

Forgot password? Click here to reset