## 1 Introduction

Quicksort [6] is often used in the classroom as an example of a sorting algorithm with obvious potential for parallelization on different types of parallel computers, and with enough obstacles to make the discussion instructive. Still, distributed memory parallel Quicksort is practically relevant (fastest) for certain ranges of (smaller) problem sizes and numbers of processors [1, 2].

This note presents two new parallel variants of the Quicksort scheme:
select a pivot, partition elements around pivot, recurse on two
disjoint sets of elements. A distributed memory implementation needs
to efficiently parallelize both the pivot selection and the
partitioning step in order to let the two recursive invocations
proceed concurrently. In standard implementations, partitioning
usually involves exchanging elements between neighboring processors in
a hypercube communication pattern. We observe that this explicit
element exchange is not necessary; it suffices instead to distribute
the chosen pivots over the involved processors. This leads to two new
*exchange-free* parallel Quicksort variants with a cost tradeoff
between element exchange and pivot distribution. We discuss
implementations of the two variants using the *Message-Passing
Interface* (MPI) [9], and compare these to standard
implementations of parallel Quicksort. Experiments on a medium scale
cluster show that this can be faster than the standard pairwise
exchange variants when the number of elements per process is not too
small. The two approaches can be combined for a smooth transition
between exchange-based and exchange-free Quicksort.

Using MPI terminology, we let denote the number of MPI
*processes* that will be mapped to physical processor(core)s. For
the Quicksort variants discussed here, must be a power of two. MPI
processes are *ranked* consecutively, . We let
denote the total number of input elements, and assume that these
are initially distributed evenly, possibly
randomized [1] over the processes, such that
each process has roughly elements. Elements may be large and
complex and hence expensive to exchange between processes; but all
have a key from some ordered domain with a comparison function that
can be evaluated in time. For each process, input and output
elements are stored consecutively in process local arrays. For each
process, the number of output elements should be close to the number
of input elements, but actual load balance depends on the quality of
the selected pivots. The output elements for each process
must be sorted, and all output elements for process must be smaller than or
equal to all output elements of process , .

## 2 Standard, distributed memory Quicksort and HyperQuicksort

Standard, textbook implementations of parallel Quicksort for distributed memory systems work roughly as follows [11, 4, 8, 15, 16]. A global pivot is chosen by some means and distributed over the processes, after which the processes all perform a local partitioning of their input elements. The processes pair up such that process is paired with process (with denoting bitwise exclusive or), and process pairs exchange data elements such that all elements smaller than (or equal to) the pivot end up at the lower ranked process, and elements larger than (or equal) to the pivot at the higher ranked process. After this, the set of processes is split into two groups, those with rank lower than and those with larger rank. The algorithm is invoked recursively on these two sets of processes, and terminates with a local sorting step when each process belongs to a singleton set of processes.

Assuming that pivots close to the median element can be effectively
determined, the communication volume for the element exchange over all
recursive calls is . With linear time communication costs,
the exchange time per process is . Global pivot
selection and distribution is normally done by the processes locally
selecting a (sample of) pivot(s) and agreeing on a global pivot by
means of a suitable collective operation. If we assume the cost for
this to be where , is the size of the pivot
sample per process, the total cost for the pivot selection becomes
. Some textbook implementations simply use the
local pivot from some designated process which is distributed by an
MPI_Bcast operation [11, 4]; others
use local pivots from all processes from which a global pivot closer
to the median is determined by an MPI_Allreduce-like
operation [1, 14]. Either of these take
time in a linear communication cost model, see,
e.g. [12]. Before the recursive calls, the set of MPI
processes is split in two which can conveniently be done using the
collective MPI_Comm_create operation, and the recursive calls simply
consist in each process recursing on the subset of processes to which
it belongs. Ideally, MPI_Comm_create ^{1}^{1}1It can be assumed that
both MPI_Comm_create and MPI_Comm_create_group are faster than the
alternative MPI_Comm_split in any reasonable MPI library
implementation; if not, a better implementation of MPI_Comm_create can trivially be given in terms of MPI_Comm_split. For evidence, see,
e.g. [2]. takes time . At the
end of the recursion, each process locally sorts (close to)
elements. The best, overall running time for this parallel Quicksort
implementation becomes assuming a small (constant) sample size is used,
with linear speed-up over sequential Quicksort when
is in . We refer to this implementation variant
as standard, *parallel Quicksort*.

Wagar [16] observed that much better pivot selection would
result by using the real medians from the processes to determine the
global pivots. In this variation of the Quicksort scheme, the
processes *first sort* their elements locally, and during
the recursion keep their local elements in order. The exact, local
medians for the processes are the middle elements in the local arrays,
among which a global pivot is selected and distributed by a suitable
collective operation. As above, this can be done in time.
The local arrays are split into two halves of elements smaller (or
equal) and larger (or equal) than the global pivot. Instead of having
to scan through the array as in the parallel Quicksort implementation,
this can be done in time by binary search. Processes
pairwise exchange small and large elements, and to maintain order in
the local arrays, each process has to merge its own elements with
those received from its partner in the exchange. The processes then
recurse as explained above. The overall running time of is the same as for parallel Quicksort. Wagar’s name for
this Quicksort variant is *HyperQuicksort*.
Wagar [16], Quinn [10, 11], Axtmann and
Sanders [1] and others show that HyperQuicksort
can perform better than parallel Quicksort due to the possibly better
pivot selection. A potential drawback of HyperQuicksort is that the
process-local merge step can only be done after the elements from the
partner process have been received. In contrast, in parallel
Quicksort, the local copying of the elements that are kept at the
process can potentially be done concurrently (overlapped) with the
reception of elements from the partner process.

## 3 Exchange-free, parallel Quicksort

We observe that the partitioning step can be done without actually exchanging any input elements. Instead, it suffices to distribute the pivots, and postpone the element redistribution to the end of the recursion.

Algorithm 1 shows how this is realized for the standard parallel Quicksort algorithm. The idea is best described iteratively. Before iteration , each process maintains a partition of its elements into segments with all elements in segment being smaller than (or equal to) all elements in segment . In iteration , pivots for all segments are chosen locally by the processes, and by a collective communication operation they agree on a global pivot for each segment. The processes then locally partition their segments, resulting in segments for the next iteration. The process is illustrated in Figure 1. After the iterations, each process has segments with the mentioned ordering property, which with good pivot selection each contain about elements. By an all-to-all communication operation, all th segments are sent to process , after which the processes locally sort their received, approximately elements. Note that no potentially expensive process set splitting is necessary as was the case for the standard Quicksort variations. Also note that the algorithm as performed by each MPI process is actually oblivious to the process rank. All communication is done by process-symmetric, collective operations.

If pivot selection is done by a collective MPI_Bcast or MPI_Allreduce operation, the cost over all iterations will be , since in iteration , pivots need to be found and each collective operation takes time [12]. A small difficulty here is that some processes’ segments could be empty (if pivot selection is bad) and for such segments no local pivot candidate is contributed. This can be handled using an all-reduction operation that reduces only elements from non-empty processes, or by relying on the reduction operator having a neutral element for the processes not contributing a local pivot.

This Quicksort algorithm variant can be viewed as a sample sort implementation (see discussion in [7] and, e.g. [5]) with sample key selection done by the partitioning iterations. No communication of elements is necessary during the sampling process, only the pivots are distributed in each iteration. In each iteration, all processes participate which can make it possible to select better pivots that the standard Quicksort variations, where the number of participating processes is halved in each recursive call.

Compared to standard, parallel Quicksort, the pivot selection and exchange terms of are traded for a single term accounting for the all-to-all distribution at the end of the partitioning, and an term for the pivot selection over all iterations, thus effectively saving a logarithmic factor on the expensive element exchange. The total running time becomes which means that this version scales worse than the standard Quicksort variants with linear speed-up when is in .

## 4 Exchange-free HyperQuicksort

The observation that actual exchange of partitioned elements is not necessary also applies to the HyperQuicksort algorithm [1, 11, 16]. The exchange-free variant is shown as Algorithm 2. In iteration , each process chooses (optimal) local pivots from each of the segments; each of these local pivots are just the middle element of the sorted segment. With a collective operation, global pivots are selected, and each process performs a split of its segments by binary search for the pivot. Iterations thus become fast, namely time for the collective all-reduce for global pivot selection of time over all iterations, and for the binary searches per iteration. At the end, an (irregular) all-to-all operation is again necessary to send all th segments to process . Each of the segments received per process are ordered, therefore a multiway merge over segments is necessary to get the received elements into sorted order. This takes time.

Since no element exchanges are done in this algorithm and Algorithm 1 before the all-to-all exchange, the amount of work for the different processes remains balanced over the iterations (assuming that all processes have roughly elements to start with).

## 5 Concrete Implementation and Experimental Results

We have implemented all four discussed parallel Quicksort variants with MPI [9] as discussed. For the process local sorting, we use the standard, C library qsort() function. Speed-up is also evaluated relative to qsort().

In the experimental evaluation we seek to compare the standard parallel Quicksort variants against the proposed exchange-free variants. For that purpose we use inputs where (almost) optimal pivots can be determined easily. Concretely, each local pivot is selected by interpolation between the maximum and the minimum element found in a sample of size

. For the HyperQuicksort variants, where input elements are kept in order, process local minimum and maximum elements are simply the first and last element in the input element array. For the standard Quicksort variants, process local maximum and minimum elements are chosen from a small sample of elements. Global maximum and minimum elements over a set of processes are computed by an MPI_Allreduce operation with the MPI_MAX operator. The global pivot is interpolated as the average of global maximum and minimum element. As inputs we have used either random permutations of or uniformly generated random numbers in the range . With these inputs, the chosen pivot selection procedure leads to (almost) perfect pivots, and all processes process almost the same number of elements throughout. For the standard parallel Quicksort variants, standard partition with sentinel elements into two parts is used, but such that sequences of elements equal to the pivot are evenly partitioned. For inputs with many equal elements, partition into three segments might be used to improve the load balance [3]. For the HyperQuicksort variants, the multiway merge is done using a straightforward binary heap, see, e.g. [13]. For both variants without element exchange, the final data redistribution is done by an MPI_Alltoall followed by an MPI_Alltoallv operation^{2}

^{2}2The implementations are available from the author..

The plots in Figure 2 shows a few strong scaling
results on a medium-sized InfiniBand cluster consisting of 2020
dual-socket nodes with two Intel Xeon E5-2650v2, 2.6 GHz, 8-core Ivy
Bridge-EP processors and Intel QDR-80 dual-link high-speed InfiniBand
fabric^{3}^{3}3This is the Vienna Scientific Cluster (VSC). The
author thanks for access and support.. The MPI library used is
OpenMPI 3.0.0, and the programs were compiled with gcc 6.4 with
optimization level -O3. The input elements are
uniformly randomly generated doubles in the range from ,
and varies from to . The MPI processes are
distributed with 16 processes per compute node. For each input, measurements
were repeated 43 times with 5 non-timed, warm-up measurements, and
the best observed times for the slowest processor are shown in the
plots and used for computing speed-up. As can be seen, the
exchange-free variant of parallel Quicksort gives consistently higher
speed-up by about 20% than the corresponding standard variant with
element exchanges up to about processes. From then on, the
number of elements per process of becomes so small that
the MPI_Allreduce

on the vectors of pivots and the linear latency of the

MPI_Alltoallv operation become more expensive than the explicit element exchanges. The exchange-free variant of HyperQuicksort is slightly faster than HyperQuicksort, but by a much smaller margin. Interestingly, with (almost) perfect pivot selection as in these experiments, standard parallel Quicksort seems preferable to Wagar’s HyperQuicksort.To counter the degradation in the exchange-free variants for small
, the explicit exchange and exchange-free variants can be
combined. Throughout the recursion in parallel Quicksort and
HyperQuicksort, the number of elements per process stays the same
(under the assumption that optimal pivots are chosen) whereas the
number of processes is halved in each recursive call. Thus, the
recursion is stopped when for some chosen
implementation and system dependent constant , and the
corresponding exchange-free variant invoked. By choosing well,
this will give a smoother transition from parallel Quicksort for small
per process input sizes to exchange-free Quicksort as
grows. Results from these *combined Quicksort* variants are shown
in Figure 3. With the constant
chosen experimentally as , the combined Quicksort is never
worse than neither standard nor exchange-free variant. Combined
parallel Quicksort reaches a speed-up of more than 1000 on
processes. This speed-up, and the speed-up of 650 with
processes is larger than achieved with either parallel Quicksort and
exchange-free Quicksort alone.

The plots in Figure 4 show results from a weak scaling experiment where is kept fixed, and the initial input size per process varies from to randomly generated doubles in the range . The experimental setup is as for the strong scaling experiment. Beyond about elements per process, the exchange-free variants perform better than the standard variants, reaching a parallel efficiency of for exchange-free Quicksort, and for exchange-free HyperQuicksort.

Repeating the experiments with random permutations and with integer type elements does not qualitatively change the results.

## 6 Concluding remarks

This note presented two new variants of parallel Quicksort for the classroom that trade explicit element exchanges throughout the Quicksort recursion against global selection of multiple pivots and a single element redistribution. All communication in the new variants is delegated to MPI collective operations, and the quality of the MPI library will co-determine the scalability of the implementations. For moderately large numbers of elements per process, these variants can be faster than standard parallel Quicksort variants by a significant factor, and can be combined with the standard, exchange-based variants to provide a smoothly scaling parallel Quicksort implementation.

## References

- [1] M. Axtmann and P. Sanders. Robust massively parallel sorting. In Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 83–97, 2017.
- [2] M. Axtmann, A. Wiebigke, and P. Sanders. Lightweight MPI communicators with applications to perfectly balanced quicksort. In 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018.
- [3] J. L. Bentley and M. D. McIlroy. Engineering a sort function. Software – Practice and Experience, 23(11):1249–1265, 1993.
- [4] A. Grama, G. Karypis, V. Kumar, and A. Gupta. Introduction to Parallel Computing. Addison-Wesley, second edition, 2003.
- [5] V. Harsh, L. V. Kalé, and E. Solomonik. Histogram sort with sampling. arXiv:1803.01237, 2018.
- [6] C. A. R. Hoare. Quicksort. The Computer Journal, 5(4):10–15, 1962.
- [7] J. JáJá. A perspective on quicksort. Computing in Science and Engineering, 2(1):43–49, 2000.
- [8] Y. Lan and M. A. Mohamed. Parallel quicksort in hypercubes. In Proceedings of the ACM/SIGAPP Symposium on Applied Computing (SAC): Technological Challenges of the 1990’s, pages 740–746, 1992.
- [9] MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.1, June 4th 2015. www.mpi-forum.org.
- [10] M. J. Quinn. Analysis and benchmarking of two parallel sorting algorithms: Hyperquicksort and quickmerge. BIT, 29(2):239–250, 1989.
- [11] M. J. Quinn. Parallel Programming in C with MPI and OpenMP. McGraw-Hill, 2003.
- [12] P. Sanders, J. Speck, and J. L. Träff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Computing, 35(12):581–594, 2009.
- [13] R. Sedgewick and K. Wayne. Algorithms. Addison-Wesley, 4th edition, 2011.
- [14] C. Siebert and F. Wolf. Parallel sorting with minimal data. In Recent Advances in Message Passing Interface. 18th European MPI Users’ Group Meeting, volume 6960 of Lecture Notes in Computer Science, pages 170–177, 2011.
- [15] H. Sundar, D. Malhotra, and G. Biros. Hyksort: A new variant of hypercube quicksort on distributed memory architectures. In International Conference on Supercomputing (ICS), pages 293–302, 2013.
- [16] B. Wagar. Hyperquicksort – a fast sorting algorithm for hypercubes. In Hypercube Multiprocessors, pages 292–299. SIAM Press, 1987.

## Appendix A Standard parallel Quicksort algorithms

For completeness, this appendix gives pseudo-code for the two standard
parallel Quicksort implementation variants, shown as
Algorithm 3 (parallel Quicksort) and
Algorithm 4
(HyperQuicksort). Algorithm 3 also shows how
standard, parallel Quicksort can transition to exchange-free Quicksort
when the input to be sorted raises above threshold relative to the
number of processes employed. This is the *combined Quicksort*
discussed in the main text. In order that all processes switch
consistently, the maximum number of elements at any process is used as
basis for the decisions. This can either be computed by a collective
MPI_Allreduce

operation, or an identical estimate for all processes can be used.