Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

07/21/2019
by   Weiwen Jiang, et al.
1

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple's Siri) and edge computing (e.g., Google/Waymo's driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, "Super-LIP", which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48x speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 8

page 10

page 11

page 12

research
12/01/2018

DeCoILFNet: Depth Concatenation and Inter-Layer Fusion based ConvNet Accelerator

Convolutional Neural Networks (CNNs) are rapidly gaining popularity in v...
research
01/06/2020

AutoDNNchip: An Automated DNN Chip Predictor and Builder for Both FPGAs and ASICs

Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growin...
research
12/07/2022

FPGA Implementation of Multi-Layer Machine Learning Equalizer with On-Chip Training

We design and implement an adaptive machine learning equalizer that alte...
research
01/04/2019

A Scalable Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Weight and Workload Balancing

Deep Neural Networks (DNNs) have revolutionized numerous applications, b...
research
11/05/2020

Deep-Dup: An Adversarial Weight Duplication Attack Framework to Crush Deep Neural Network in Multi-Tenant FPGA

The wide deployment of Deep Neural Networks (DNN) in high-performance cl...
research
11/02/2020

On the Impact of Partial Sums on Interconnect Bandwidth and Memory Accesses in a DNN Accelerator

Dedicated accelerators are being designed to address the huge resource r...
research
03/26/2020

Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

FPGAs have shown great potential in providing low-latency and energy-eff...

Please sign up or login with your details

Forgot password? Click here to reset