Spartus: A 9.4 TOp/s FPGA-based LSTM Accelerator Exploiting Spatio-temporal Sparsity

08/04/2021
by   Chang Gao, et al.
4

Long Short-Term Memory (LSTM) recurrent networks are frequently used for tasks involving time-sequential data such as speech recognition. However, it is difficult to deploy these networks on hardware to achieve high throughput and low latency because the fully connected structure makes LSTM networks a memory-bounded algorithm. Previous LSTM accelerators either exploited weight spatial sparsity or temporal activation sparsity. This paper proposes a new accelerator called "Spartus" that exploits spatio-temporal sparsity to achieve ultra-low latency inference. The spatial sparsity is induced using our proposed pruning method called Column-Balanced Targeted Dropout (CBTD), which structures sparse weight matrices for balanced workload. It achieved up to 96 sparsity with negligible accuracy difference for an LSTM network trained on a TIMIT phone recognition task. To induce temporal sparsity in LSTM, we create the DeltaLSTM by extending the previous DeltaGRU method to the LSTM network. This combined sparsity simultaneously saves on the weight memory access and associated arithmetic operations. Spartus was implemented on a Xilinx Zynq-7100 FPGA. The Spartus per-sample latency for a single DeltaLSTM layer of 1024 neurons averages 1 us. Spartus achieved 9.4 TOp/s effective batch-1 throughput and 1.1 TOp/J energy efficiency, which, respectively, are 4X and 7X higher than the previous state-of-the-art.

READ FULL TEXT

page 1

page 4

page 7

page 9

page 12

research
01/07/2021

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

In this paper, first, a hardware-friendly pruning algorithm for reducing...
research
03/14/2022

Skydiver: A Spiking Neural Network Accelerator Exploiting Spatio-Temporal Workload Balance

Spiking Neural Networks (SNNs) are developed as a promising alternative ...
research
09/02/2023

Accelerating LSTM-based High-Rate Dynamic System Models

In this paper, we evaluate the use of a trained Long Short-Term Memory (...
research
12/05/2022

Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition with Hierarchical Tucker Tensor Decomposition

Long short-term memory (LSTM) is a type of powerful deep neural network ...
research
11/04/2019

LSTM-Sharp: An Adaptable, Energy-Efficient Hardware Accelerator for Long Short-Term Memory

The effectiveness of LSTM neural networks for popular tasks such as Auto...
research
12/25/2020

EdgeDRNN: Recurrent Neural Network Accelerator for Edge Inference

Low-latency, low-power portable recurrent neural network (RNN) accelerat...
research
10/16/2022

Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices

In this paper, we propose a data-model-hardware tri-design framework for...

Please sign up or login with your details

Forgot password? Click here to reset