Barrier-Free Large-Scale Sparse Tensor Accelerator (BARISTA) For Convolutional Neural Networks

04/18/2021
by   Ashish Gondimalla, et al.
0

Convolutional neural networks (CNNs) are emerging as powerful tools for visual recognition. Recent architecture proposals for sparse CNNs exploit zeros in the feature maps and filters for performance and energy without losing accuracy. Sparse architectures that exploit two-sided sparsity in both feature maps and filters have been studied only at small scales (e.g., 1K multiply-accumulate(MAC) units). However, to realize their advantages in full, the sparse architectures have to be scaled up to levels of the dense architectures (e.g., 32K MACs in the TPU). Such scaling is challenging since achieving reuse through broadcasts incurs implicit barrier cost raises the inter-related issues of load imbalance, buffering, and on-chip bandwidth demand. SparTen, a previous scheme, addresses one aspect of load balancing but not other aspects, nor the other issues of buffering and bandwidth. To that end, we propose the barrier-free large-scale sparse tensor accelerator (BARISTA). BARISTA (1) is the first architecture for scaling up sparse CNN accelerators; (2) reduces on-chip bandwidth demand by telescoping request-combining the input map requests and snarfing the filter requests; (3) reduces buffering via basic buffer sharing and avoids the ensuing barriers between consecutive input maps by coloring the output buffers; (4) load balances intra-filter work via dynamic round-robin work assignment; and (5) employs hierarchical buffering which achieves high cache bandwidth via a few, wide, shared buffers and low buffering via narrower, private buffers at the compute. Our simulations show that, on average, barista performs 5.4x, 2.2x, 1.7x, 2.5x better than a dense, a one-sided, a naively-scaled two-sided, and an iso-area two-sided architecture, respectively. Using 45-nm technology, ASIC synthesis of our RTL design for four clusters of 8K MACs at 1 GHz clock speed, reports 213 mm^2 area and 170 W power.

READ FULL TEXT

page 5

page 8

page 10

research
11/09/2021

Phantom: A High-Performance Computational Core for Sparse Convolutional Neural Networks

Sparse convolutional neural networks (CNNs) have gained significant trac...
research
06/05/2017

NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps

Convolutional neural networks (CNNs) have become the dominant neural net...
research
09/30/2018

Mini-batch Serialization: CNN Training with Inter-layer Data Reuse

Training convolutional neural networks (CNNs) requires intense computati...
research
08/30/2019

EBPC: Extended Bit-Plane Compression for Deep Neural Network Inference and Training Accelerators

In the wake of the success of convolutional neural networks in image cla...
research
06/27/2021

OCCAM: Optimal Data Reuse for Convolutional Neural Networks

Convolutional neural networks (CNNs) are emerging as powerful tools for ...
research
10/27/2022

Improved Projection Learning for Lower Dimensional Feature Maps

The requirement to repeatedly move large feature maps off- and on-chip d...
research
09/18/2020

GrateTile: Efficient Sparse Tensor Tiling for CNN Processing

We propose GrateTile, an efficient, hardwarefriendly data storage scheme...

Please sign up or login with your details

Forgot password? Click here to reset