CUDA-PIM: End-to-End Integration of Digital Processing-in-Memory from High-Level C++ to Microarchitectural Design

08/27/2023
by   Orian Leitersdorf, et al.
0

Digital processing-in-memory (PIM) architectures mitigate the memory wall problem by facilitating parallel bitwise operations directly within memory. Recent works have demonstrated their algorithmic potential for accelerating data-intensive applications; however, there remains a significant gap in the programming model and microarchitectural design. This is further exacerbated by the emerging model of partitions, which significantly complicates control and periphery. Therefore, inspired by NVIDIA CUDA, this paper provides an end-to-end architectural integration of digital memristive PIM from an abstract high-level C++ programming interface for vector operations to the low-level microarchitecture. We begin by proposing an efficient microarchitecture and instruction set architecture (ISA) that bridge the gap between the low-level control periphery and an abstraction of PIM parallelism into warps and threads. We subsequently propose a PIM compilation library that converts high-level C++ to ISA instructions, and a PIM driver that translates ISA instructions into PIM micro-operations. This drastically simplifies the development of PIM applications and enables PIM integration within larger existing C++ CPU/GPU programs for heterogeneous computing with significant ease. Lastly, we present an efficient GPU-accelerated simulator for the proposed PIM microarchitecture. Although slower than a theoretical PIM chip, this simulator provides an accessible platform for developers to start executing and debugging PIM algorithms. To validate our approach, we implement state-of-the-art matrix operations and FFT PIM-based algorithms as case studies. These examples demonstrate drastically simplified development without compromising performance, showing the potential and significance of CUDA-PIM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2023

A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX Code

Various kinds of applications take advantage of GPUs through automation ...
research
11/18/2021

NetQASM – A low-level instruction set architecture for hybrid quantum-classical programs in a quantum internet

We introduce NetQASM, a low-level instruction set architecture for quant...
research
03/10/2021

ELLA: Exploration through Learned Language Abstraction

Building agents capable of understanding language instructions is critic...
research
02/04/2019

Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers

Recent developments in engineering and algorithms have made real-world a...
research
10/21/2022

Programming Bare-Metal Accelerators with Heterogeneous Threading Models: A Case Study of Matrix-3000

As the hardware industry moves towards using specialized heterogeneous m...
research
06/09/2022

PartitionPIM: Practical Memristive Partitions for Fast Processing-in-Memory

Digital memristive processing-in-memory overcomes the memory wall throug...
research
08/14/2016

Machine Learning with Memristors via Thermodynamic RAM

Thermodynamic RAM (kT-RAM) is a neuromemristive co-processor design base...

Please sign up or login with your details

Forgot password? Click here to reset