Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

07/19/2021
by   Aman Arora, et al.
0

The configurable building blocks of current FPGAs – Logic blocks (LBs), Digital Signal Processing (DSP) slices, and Block RAMs (BRAMs) – make them efficient hardware accelerators for the rapid-changing world of Deep Learning (DL). Communication between these blocks happens through an interconnect fabric consisting of switching elements spread throughout the FPGA. In this paper, a new block, Compute RAM, is proposed. Compute RAMs provide highly-parallel processing-in-memory (PIM) by combining computation and storage capabilities in one block. Compute RAMs can be integrated in the FPGA fabric just like the existing FPGA blocks and provide two modes of operation (storage or compute) that can be dynamically chosen. They reduce power consumption by reducing data movement, provide adaptable precision support, and increase the effective on-chip memory bandwidth. Compute RAMs also help increase the compute density of FPGAs. In our evaluation of addition, multiplication and dot-product operations across multiple data precisions (int4, int8 and bfloat16), we observe an average savings of 80 execution time ranging from 20 applications as well, and make FPGAs more efficient, flexible, and performant accelerators.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 6

page 7

page 8

page 9

research
03/23/2022

CoMeFa: Compute-in-Memory Blocks for FPGAs

Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive ...
research
02/27/2020

Optimizing Memory-Access Patterns for Deep Learning Accelerators

Deep learning (DL) workloads are moving towards accelerators for faster ...
research
04/23/2020

Using DSP Slices as Content-Addressable Update Queues

Content-Addressable Memory (CAM) is a powerful abstraction for building ...
research
04/05/2021

Near-Precise Parameter Approximation for Multiple Multiplications on A Single DSP Block

A multiply-accumulate (MAC) operation is the main computation unit for D...
research
05/19/2021

Block Convolution: Towards Memory-Efficient Inference of Large-Scale CNNs on FPGA

Deep convolutional neural networks have achieved remarkable progress in ...
research
12/30/2018

ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning

Memory bandwidth bottleneck is a major challenges in processing machine ...
research
09/04/2019

Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

We study the problem of multiplying two bit matrices with entries either...

Please sign up or login with your details

Forgot password? Click here to reset