Balancing Efficiency and Flexibility for DNN Acceleration via Temporal GPU-Systolic Array Integration

02/18/2020
by   Cong Guo, et al.
0

The research interest in specialized hardware accelerators for deep neural networks (DNN) spiked recently owing to their superior performance and efficiency. However, today's DNN accelerators primarily focus on accelerating specific "kernels" such as convolution and matrix multiplication, which are vital but only part of an end-to-end DNN-enabled application. Meaningful speedups over the entire application often require supporting computations that are, while massively parallel, ill-suited to DNN accelerators. Integrating a general-purpose processor such as a CPU or a GPU incurs significant data movement overhead and leads to resource under-utilization on the DNN accelerators. We propose Simultaneous Multi-mode Architecture (SMA), a novel architecture design and execution model that offers general-purpose programmability on DNN accelerators in order to accelerate end-to-end applications. The key to SMA is the temporal integration of the systolic execution model with the GPU-like SIMD execution model. The SMA exploits the common components shared between the systolic-array accelerator and the GPU, and provides lightweight reconfiguration capability to switch between the two modes in-situ. The SMA achieves up to 63 the baseline Volta architecture with TensorCore.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2021

G-GPU: A Fully-Automated Generator of GPU-like ASIC Accelerators

Modern Systems on Chip (SoC), almost as a rule, require accelerators for...
research
08/17/2021

O-HAS: Optical Hardware Accelerator Search for Boosting Both Acceleration Performance and Development Speed

The recent breakthroughs and prohibitive complexities of Deep Neural Net...
research
09/21/2022

In-Network Accumulation: Extending the Role of NoC for DNN Acceleration

Network-on-Chip (NoC) plays a significant role in the performance of a D...
research
10/26/2018

Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)

Experience shows that on today's high performance systems the utilizatio...
research
08/18/2022

L3: Accelerator-Friendly Lossless Image Format for High-Resolution, High-Throughput DNN Training

The training process of deep neural networks (DNNs) is usually pipelined...
research
10/05/2021

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

As AI-based applications become pervasive, CPU vendors are starting to i...
research
09/11/2020

Fast LDPC GPU Decoder for Cloud RAN

The GPU as a digital signal processing accelerator for cloud RAN is inve...

Please sign up or login with your details

Forgot password? Click here to reset