ORIGAMI: A Heterogeneous Split Architecture for In-Memory Acceleration of Learning

12/30/2018
by   Hajar Falahati, et al.
8

Memory bandwidth bottleneck is a major challenges in processing machine learning (ML) algorithms. In-memory acceleration has potential to address this problem; however, it needs to address two challenges. First, in-memory accelerator should be general enough to support a large set of different ML algorithms. Second, it should be efficient enough to utilize bandwidth while meeting limited power and area budgets of logic layer of a 3D-stacked memory. We observe that previous work fails to simultaneously address both challenges. We propose ORIGAMI, a heterogeneous set of in-memory accelerators, to support compute demands of different ML algorithms, and also uses an off-the-shelf compute platform (e.g.,FPGA,GPU,TPU,etc.) to utilize bandwidth without violating strict area and power budgets. ORIGAMI offers a pattern-matching technique to identify similar computation patterns of ML algorithms and extracts a compute engine for each pattern. These compute engines constitute heterogeneous accelerators integrated on logic layer of a 3D-stacked memory. Combination of these compute engines can execute any type of ML algorithms. To utilize available bandwidth without violating area and power budgets of logic layer, ORIGAMI comes with a computation-splitting compiler that divides an ML algorithm between in-memory accelerators and an out-of-the-memory platform in a balanced way and with minimum inter-communications. Combination of pattern matching and split execution offers a new design point for acceleration of ML algorithms. Evaluation results across 12 popular ML algorithms show that ORIGAMI outperforms state-of-the-art accelerator with 3D-stacked memory in terms of performance and energy-delay product (EDP) by 1.5x and 29x (up to 1.6x and 31x), respectively. Furthermore, results are within a 1 system that has unlimited compute resources on logic layer of a 3D-stacked memory.

READ FULL TEXT

page 4

page 5

page 9

page 10

page 11

research
07/21/2020

TCIM: Triangle Counting Acceleration With Processing-In-MRAM Architecture

Triangle counting (TC) is a fundamental problem in graph analysis and ha...
research
10/14/2021

Bandwidth Utilization Side-Channel on ML Inference Accelerators

Accelerators used for machine learning (ML) inference provide great perf...
research
09/14/2023

Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

Continual demand for memory bandwidth has made it worthwhile for memory ...
research
09/16/2020

GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Genome sequence analysis has enabled significant advancements in medical...
research
02/11/2022

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

Offloading compute-intensive kernels to hardware accelerators relies on ...
research
01/06/2023

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

Dense matrix multiply (MM) serves as one of the most heavily used kernel...
research
07/19/2021

Compute RAMs: Adaptable Compute and Storage Blocks for DL-Optimized FPGAs

The configurable building blocks of current FPGAs – Logic blocks (LBs), ...

Please sign up or login with your details

Forgot password? Click here to reset