CXLMemUring: A Hardware Software Co-design Paradigm for Asynchronous and Flexible Parallel CXL Memory Pool Access

by   Yiwei Yang, et al.

CXL has been the emerging technology for expanding memory for both the host CPU and device accelerators with load/store interface. Extending memory coherency to the PCIe root complex makes the codesign more flexible in that you can access the memory with coherency using your near-device computability. Since the capacity demand with tolerable latency and bandwidth is growing, we need to come up with a new hardware-software codesign way to offload the synthesized memory operations to the CXL endpoint, CXL switch or near CXL root complex cores like Intel DSA to fetch data; the CPU or accelerators can calculate other stuff in the backend. On CXL done loading, the data will be put into L1 if capacity fits, and the in-core ROB will be notified by mailbox and resume the calculation on the previous hardware context. Since the distance(timing window) of the load instruction sequence is unknown, a profiling-guided way of codegening and adaptively updating offloaded code will be required for a long-running job. We propose to evaluate CXLMemUring the modified BOOMv3 with added in-core-logic and CXL endpoint access simulation using CHI, and we will add a weaker RISCV Core near endpoint for code offloading, and the codegening will be based on program analysis with traditional profiling guided way.


page 1

page 2

page 3


CHoNDA: Near Data Acceleration with Concurrent Host Access

Near-data accelerators (NDAs) that are integrated with main memory have ...

Computation offloading to hardware accelerators in Intel SGX and Gramine Library OS

The Intel Software Guard Extensions (SGX) technology enables application...

Cudagrind: A Valgrind Extension for CUDA

Valgrind, and specifically the included tool Memcheck, offers an easy an...

Tearing Down the Memory Wall

We present a vision for the Erudite architecture that redefines the comp...

METICULOUS: An FPGA-based Main Memory Emulator for System Software Studies

Due to the scaling problem of the DRAM technology, non-volatile memory d...

RDMA is Turing complete, we just did not know it yet!

It is becoming increasingly popular for distributed systems to exploit n...

Multi-threaded Output in CMS using ROOT

CMS has worked aggressively to make use of multi-core architectures, rou...

Please sign up or login with your details

Forgot password? Click here to reset