A NUMA-Aware Provably-Efficient Task-Parallel Platform Based on the Work-First Principle

06/28/2018
by   Justin Deters, et al.
0

Task parallelism is designed to simplify the task of parallel programming. When executing a task parallel program on modern NUMA architectures, it can fail to scale due to the phenomenon called work inflation, where the overall processing time that multiple cores spend on doing useful work is higher compared to the time required to do the same amount of work on one core, due to effects experienced only during parallel executions such as additional cache misses, remote memory accesses, and memory bandwidth issues. It's possible to mitigate work inflation by co-locating the computation with the data, but this is nontrivial to do with task parallel programs. First, by design, the scheduling for task parallel programs is automated, giving the user little control over where the computation is performed. Second, the platforms tend to employ work stealing, which provides strong theoretical guarantees, but its randomized protocol for load balancing does not discern between work items that are far away versus ones that are closer. In this work, we propose NUMA-WS, a NUMA-aware task parallel platform engineering based on the work-first principle. By abiding by the work-first principle, we are able to obtain a platform that is work efficient, provides the same theoretical guarantees as the classic work stealing scheduler, and mitigates work inflation. Furthermore, we implemented a prototype platform by modifying Intel's Cilk Plus runtime system and empirically demonstrate that the resulting system is work efficient and scalable.

READ FULL TEXT
research
04/27/2020

In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory: An In-Place Algorithm With Provably Optimal Cache Behavior

We present an in-place algorithm for the parallel partition problem that...
research
08/01/2016

TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

We have developed a task-parallel runtime system, called TREES, that is ...
research
06/19/2019

Reduced I/O Latency with Futures

Task parallelism research has traditionally focused on optimizing comput...
research
01/03/2019

Efficient Race Detection with Futures

This paper addresses the problem of provably efficient and practically g...
research
04/29/2022

DePa: Simple, Provably Efficient, and Practical Order Maintenance for Task Parallelism

A number of problems in parallel computing require reasoning about the d...

Please sign up or login with your details

Forgot password? Click here to reset