Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

09/04/2020
by   Zhi-Gang Liu, et al.
0

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. In this paper, we address a key architectural challenge with structural sparsity: how to provide support for a range of sparsity levels while maintaining high utilization of the hardware. We describe a time unrolled formulation of variable density-bound block (VDBB) sparsity that allows for a configurable number of non-zero elements per block, at constant utilization. We then describe a systolic array microarchitecture that implements this scheme, with two data reuse optimizations. Firstly, we increase reuse in both operands and partial products by increasing the number of MACs per PE. Secondly, we introduce a novel approach of moving the IM2COL transform into the hardware, which allows us to achieve a 3x data bandwidth expansion just before the operands are consumed by the datapath, reducing the SRAM power consumption. The optimizations for weight sparsity, activation sparsity and data reuse are all interrelated and therefore the optimal combination is not obvious. Therefore, we perform an design space evaluation to find the pareto-optimal design characteristics. The resulting design achieves 16.8 TOPS/W in 16nm with modest 50 well as successfully demonstrating the variable DBB technique, this result significantly outperforms previously reported sparse CNN accelerators.

READ FULL TEXT

page 1

page 4

page 5

page 10

research
07/16/2021

S2TA: Exploiting Structured Sparsity for Energy-Efficient Mobile CNN Acceleration

Exploiting sparsity is a key technique in accelerating quantized convolu...
research
05/16/2020

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference

Convolutional neural network (CNN) inference on mobile devices demands e...
research
06/15/2023

Retrospective: EIE: Efficient Inference Engine on Sparse and Compressed Neural Network

EIE proposed to accelerate pruned and compressed neural networks, exploi...
research
07/15/2023

PASS: Exploiting Post-Activation Sparsity in Streaming Architectures for CNN Acceleration

With the ever-growing popularity of Artificial Intelligence, there is an...
research
02/01/2022

Sense: Model Hardware Co-design for Accelerating Sparse Neural Networks

Sparsity is an intrinsic property of neural network(NN). Many software r...
research
04/20/2021

CoDR: Computation and Data Reuse Aware CNN Accelerator

Computation and Data Reuse is critical for the resource-limited Convolut...
research
11/11/2022

RepGhost: A Hardware-Efficient Ghost Module via Re-parameterization

Feature reuse has been a key technique in light-weight convolutional neu...

Please sign up or login with your details

Forgot password? Click here to reset