In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory: An In-Place Algorithm With Provably Optimal Cache Behavior
We present an in-place algorithm for the parallel partition problem that has linear work and polylogarithmic span. The algorithm uses only exclusive read/write shared variables, and can be implemented using parallel-for-loops without any additional concurrency considerations (i.e., the algorithm is EREW). A key feature of the algorithm is that it exhibits provably optimal cache behavior, up to small-order factors. We also present a second in-place EREW algorithm that has linear work and span O(log n ·loglog n), which is within an O(loglog n) factor of the optimal span. By using this low-span algorithm as a subroutine within the cache-friendly algorithm, we are able to obtain a single EREW algorithm that combines their theoretical guarantees: the algorithm achieves span O(log n ·loglog n) and optimal cache behavior. As an immediate consequence, we also get an in-place EREW quicksort algorithm with work O(n log n), span O(log^2 n ·loglog n). Whereas the standard EREW algorithm for parallel partitioning is memory-bandwidth bound on large numbers of cores, our cache-friendly algorithm is able to achieve near-ideal scaling in practice by avoiding the memory-bandwidth bottleneck. The algorithm's performance is comparable to that of the Blocked Strided Algorithm of Francis, Pannan, Frias, and Petit, which is the previous state-of-the art for parallel EREW sorting algorithms, but which lacks theoretical guarantees on its span and cache behavior.
READ FULL TEXT