Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads

09/01/2020
by   Karthik Sankaranarayanan, et al.
0

The growing memory footprints of cloud and big data applications mean that data center CPUs can spend significant time waiting for memory. An attractive approach to improving performance in such centralized compute settings is to employ prefetchers that are customized per application, where gains can be easily scaled across thousands of machines. Helper thread prefetching is such a technique but has yet to achieve wide adoption since it requires spare thread contexts or special hardware/firmware support. In this paper, we propose an inline software prefetching technique that overcomes these restrictions by inserting the helper code into the main thread itself. Our approach is complementary to and does not interfere with existing hardware prefetchers since we target only delinquent irregular load instructions (those with no constant or striding address patterns). For each chosen load instruction, we generate and insert a customized software prefetcher extracted from and mimicking the application's dataflow, all without access to the application source code. For a set of irregular workloads that are memory-bound, we demonstrate up to 2X single-thread performance improvement on recent high-end hardware (Intel Skylake) and up to 83 implementation on the same hardware, due to the absence of thread spawning overhead.

READ FULL TEXT

page 5

page 6

page 7

page 9

page 10

research
08/29/2022

Improving the Efficiency of OpenCL Kernels through Pipes

In an effort to lower the barrier to the adoption of FPGAs by a broader ...
research
07/14/2020

Irregular Accesses Reorder Unit: Improving GPGPU Memory Coalescing for Graph-Based Workloads

GPGPU architectures have become established as the dominant parallelizat...
research
04/11/2023

High-performance and Scalable Software-based NVMe Virtualization Mechanism with I/O Queues Passthrough

NVMe(Non-Volatile Memory Express) is an industry standard for solid-stat...
research
10/18/2020

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

We propose Atomic Active Messages (AAM), a mechanism that accelerates ir...
research
07/15/2020

Auto Adaptive Irregular OpenMP Loops

OpenMP is a standard for the parallelization due to the ease in programm...
research
05/19/2019

Safe and Chaotic Compilation for Hidden Deterministic Hardware Aliasing

Hardware aliasing occurs when the same logical address can access differ...
research
03/09/2018

ROLP: Runtime Object Lifetime Profiling for Big Data Memory Management

Low latency services such as credit-card fraud detection and website tar...

Please sign up or login with your details

Forgot password? Click here to reset