Scheduling computations with provably low synchronization overheads

10/24/2018
by   Guilherme Rito, et al.
0

Work Stealing has been a very successful algorithm for scheduling parallel computations, and is known to achieve high performances even for computations exhibiting fine-grained parallelism. In Work Stealing, each processor owns a deque that uses to keep track of its assigned work. To ensure proper load balancing, each processor's deque is accessible not only to its owner but also to other processors that may be searching for work. However, due to the concurrent nature of deques, it has been proven that even when processors operate locally on their deque, synchronization is required. Moreover, many studies have found that synchronization related overheads often account for a significant portion of the total execution time. For that reason, many efforts have been carried out to reduce the synchronization overheads incurred by traditional schedulers. In this paper we present and analyze a variant of Work Stealing that avoids most synchronization overheads by keeping processors' deques entirely private by default, and only exposing work when requested by thieves. Consider any computation with work T_1 and critical-path length T_∞ executed by P processors using our scheduler. Our analysis shows that the expected execution time is O(T_1/P + T_∞), and the expected synchronization overheads incurred during the execution are at most O((C_Compare-And-Swap + C_Memory Fence)PT_∞), where C_Compare-And-Swap and C_Memory Fence respectively denote the maximum cost of executing a Compare-And-Swap instruction and a Memory Fence instruction. The second bound corresponds to an order of magnitude improvement over state-of-the-art Work Stealing algorithms that use concurrent deques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2018

On the analysis of scheduling algorithms for structured parallel computations

Algorithms for scheduling structured parallel computations have been wid...
research
05/22/2023

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Machine Learning (ML) models contain highly-parallel computations, such ...
research
01/23/2023

Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism

The demise of Moore's Law and Dennard Scaling has revived interest in sp...
research
07/20/2018

Shared Processor Scheduling of Multiprocessor Jobs

We study shared processor scheduling of multiprocessor weighted jobs whe...
research
08/23/2020

ILP Aware Scheduling on Multithreaded Multi-core Processors

Multithreaded Multi-core processors are prevalent today and are used for...
research
05/15/2018

The Parallel Persistent Memory Model

We consider a parallel computational model that consists of P processors...
research
12/02/2019

Recent Developments in Iterative Methods for Reducing Synchronization

On modern parallel architectures, the cost of synchronization among proc...

Please sign up or login with your details

Forgot password? Click here to reset