The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

by   Ayesha Afzal, et al.

The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are involved. In this paper we investigate the effect of "bottleneck evasion" and how it can lead to automatic overlap of communication overhead with computation. Bottleneck evasion leads to a gradual loss of the initial bulk-synchronous behavior of a parallel code so that its processes become desynchronized. This occurs most prominently in memory-bound programs, which is why we choose memory-bound benchmark and application codes, specifically an MPI-augmented STREAM Triad, sparse matrix-vector multiplication, and a collective-avoiding Chebyshev filter diagonalization code to demonstrate the consequences of desynchronization on two different supercomputing platforms. We investigate the role of idle waves as possible triggers for desynchronization and show the impact of automatic asynchronous communication for a spectrum of code properties and parameters, such as saturation point, matrix structures, domain decomposition, and communication concurrency. Our findings reveal how eliminating synchronization points (such as collective communication or barriers) precipitates performance improvements that go beyond what can be expected by simply subtracting the overhead of the collective from the overall runtime.


page 6

page 11

page 12

page 13


Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs

Analytic, first-principles performance modeling of distributed-memory pa...

Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact

Most distributed-memory bulk-synchronous parallel programs in HPC assume...

Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study

Analytic, first-principles performance modeling of distributed-memory ap...

Performance Analysis and Optimal Node-Aware Communication for Enlarged Conjugate Gradient Methods

Krylov methods are a key way of solving large sparse linear systems of e...

Extracting Clean Performance Models from Tainted Programs

Performance models are well-known instruments to understand the scaling ...