1 Algorithm and Implementation
The reductiontoall collective communication problem is the following. Each of successively ranked processors has a vector of elements for which the elementwise product for some given associative (but not necessarily commutative) binary operator has to be computed and distributed to all processors. The effect of a reductiontoall operation is the same as performing the reduction of all input vectors onto some selected root processor, say processor , and afterwards broadcasting the resulting vector to all the other processors.
1.1 Algorithm description
We assume that the processors can be organized into two, roughly equally large, rooted, postorder numbered, as balanced and complete as possible binary trees. Processors that are in parentchild relationship are assumed to be able to communicate directly with each other with uniform communication costs, as are the dual roots. Communication is bidirectional, and two parentchild processors can in the same communication operation both send and receive elements to and from each other (telephonelike bidirectional communication [1]). We assume a linearcost model in which bidirectional communication of elements in both directions can be done in time units for system dependent constants (communication startup latency) and (transmission time per element).
The algorithm is quite simple and follows the same idea as in [5] where a doubly pipelined algorithm for the parallelprefix operation on large vectors was discussed and benchmarked. The input vector for processor is divided into a number of successive blocks, , that are handled one after the other. Each processor works in a number of rounds for some to be determined later. In each such round, a nonleaf processor (in either of the binary trees) receives a partial result block from its first child and bidirectionally sends an earlier block of the result to this first child, and then performs an elementwise reduction with the corresponding block of its own input vector. In the same way, the processor then receives a partial result block and sends an earlier block from and to its second child, and performs a reduction with the received block and the partial result block computed from the first child. In the last step, the processor sends the partial result block to its parent, and receives an earlier block of the result from the parent. Processors that are leaves in their tree just send and receive blocks from their parent processor. Thus each round entails at most three bidirectional communication steps with roughly even sized blocks of about elements each. With the processors organized as a postorder binary tree, the subtree rooted at some processor consists of successively numbered processors and for some child processors . The first child of processor is processor , and the second child is processor . The partial result blocks computed by processor are blocks of the product , and the partial result blocks computed by processor are blocks of the product . Thus, processor can compute partial result blocks of the product over all processors in its subtree while relying only on the associativity of the operator. The root of either tree communicates with the root of the other tree, sending partial result blocks from its own tree and receiving partial result blocks from the other tree. Thus, for the root an extra application of the operation is needed, while the other nonleaf processors take at most two operations per round.
The per processor algorithm is illustrated as Algorithm 1. The input for the processor is initially stored in the array, which will also contain the final result . The th block in is denoted . Blocks for which have roughly elements, while blocks for which or are for convenience assumed to have elements. In round , processor receives block from each of its children into an intermediate buffer , and sends the previous block of the final result. Reductions are performed on the received blocks in with the process’s own block , and the result is sent to the parent. From the parent, processor receives block of the final result . Bidirectional, simultaneous send and receive communication is denoted by a Send in parallel () with a Recv operation.
We claim that this correctly computes in if is chosen as the depth of processor in its postordered numbered tree. This can easily be seen by induction on the height of the two trees. When the height is the algorithm runs just over the two roots which in round just exchanges their input blocks and compute the result correctly into by a single application of . Assume the claim holds for any two binary trees of height at most , and consider a processor of depth with one or two children at depth . In round such a processor will need to compute a partial result in to send to its parent in the third step. For this it needs to receive from its children, which is what the algorithm does. Before round such a processor has by the induction hypothesis correctly received all result blocks of in . In the first two steps, it sends the last of these blocks, namely to its children, and in the third step receives block of the result .
1.2 Analysis remarks
Assume that for the number of processors , it holds that for some , that is , and that . Then, the height of the two binary trees is , and the number of communication rounds for the first block to reach a tree root is (as can easily be shown by induction on ). Each root receives a new block every third step since the algorithm takes three send and receive operations per round. One extra step is required for either root to receive the first block from the other tree. For the last block that is broadcast down the tree, another steps is needed. Thus the latency of the doubly pipelined algorithm in terms of the number of communication steps for the first block of the result to reach the last leaf of either of the binary trees is . Each subsequent block requires three steps.
Assuming a linearcost communication model with known constants and , the time to perform the allreduce operations on the element vectors when divided into blocks of roughly elements each is thus
By balancing (“Pipelining Lemma”) the terms that increase and decrease with , the analytical best number of blocks and with this the best possible running time of
can easily be deduced, which is .
The analysis accounts only for the communication costs. All nonleaves except the roots perform two applications of the operator on blocks of elements per round. The two roots unfortunately need one more reduction with the partial result block received from the other root. With a cost of time units per element, the added cost for the reductions is thus at most per round.
If instead only one, doubly pipelined binary tree is used, all nonleaves, including the root, perform at most two applications of the operator per round. On the other hand, with only one binary tree and , the latency for the first block of the result to reach the last child is , and therefore slightly higher (by a small constant term).
If the reductiontoall operation is implemented as a reduction operation to the root followed by a broadcast, both with binary trees, the total time is
which results in a running time of with the right choice of best number of blocks. Thus, in the term, exploiting bidirectional communication with doubly pipelined trees gives an improvement from a factor to a factor of .
The bestknown, pipelined, binary treebased algorithm places the processors in two trees, such that each processor is an internal node in one tree and a root in the other [4]. This gives a running time for the reductiontoall operations, when implemented as a reduction followed by a broadcast operation of .
1.3 Implementation sketch
A concrete implementation of the algorithm has been given using MPI [3]. The bidirectional communication is assumed to be effected with the MPI_Sendrecv operation. When a processor has not yet partially reduced all blocks, a block of size is received from each of the children and sent to the parent; when all blocks have been reduced, virtual blocks of zero elements are received and sent. Likewise, blocks of zero elements are initially received from the parent and sent to the children, and first when the parent has received blocks with more than zero elements, these are transmitted to the children. For the implementation, each MPI_Sendrecv operation gives the upper bound, namely elements, on the size of the block expected to be received, and the actual number of elements in a received block is queried with MPI_Get_elements. Using this functionality, there is no need to explicitly keep track of the depth of the processor and the excess number of rounds . A nonleaf processor terminates when it has received its last nonzero element block from both its children and the parent, but since the blocks received from the parent are always behind (earlier than) blocks from the children, a processor can terminate as soon as it has received the last nonzero element block from the parent. The MPI_Reduce_local function is used for performing the block wise reductions; but since this is less flexible than the assignments shown in Algorithm 1, some care has to be taken to respect the possible noncommutativity of the operator, and to avoid extra buffer copying.
The whole algorithm can be implemented in less than hundred of lines of MPI C code. Such code is available from the author.
Name  Processor  Interconnect  MPI library  

Hydra  32  36  1152  Intel Xeon Gold 6130, 2.1 GHz  OmniPath  Open MPI 4.0.5 
with gcc 8.3.06  
Dual socket  Dual (2lane) 
2 An experimental evaluation
We have done an initial evaluation of the doubly pipelined, dualroot reductiontoall implementation on a small, Intel Skylake dualsocket, dualrail OmniPath “Hydra” cluster as described in Table 1. The nodes of this cluster consists of two 16core sockets, each with a direct OmniPath connection to a separate network.
The evaluation compares the following implementations of the reductiontoall operation on vectors of given numbers of elements.

The native MPI_Allreduce operation.

An MPI_Reduce followed by an MPI_Bcast operation.

A pipelined reduce followed by a pipelined broadcast with the same pipeline block size using a single binary tree (UserAllreduce1).
The two implemented pipelined algorithms use the same pipeline block size which is set at compile time. The implementations do not attempt to find and use a best block size in dependence on the number of elements to reduce and the number of MPI processes used, or other characteristics of the system. Experiments with different block sizes, different numbers of MPI processes, and different mappings of the MPI processes to the cores of the compute nodes must be performed.
This is all not done here (see Section 3 for justification). In Figure 1 results from a single run with MPI processes (8 processes on each of the 36 compute nodes) with a fixed number of pipeline blocks of elements is shown. The elements are MPI_INT integers, and the reduction operator used in MPI_SUM. The results are gathered using the mpicroscope benchmark [6]. This benchmark defines the running time of an experiment as the minimum over a number of measurement rounds of the completion time of the slowest MPI process, and synchronizes individual measurements with MPI_Barrier operations [2].
Elements (count)  MPI_Allreduce  MPI_Reduce+MPI_Bcast  Pipelined  Doubly pipelined 

0  0.29  0.84  0.19  0.19 
1  16.75  24.44  29.95  33.60 
2  15.55  19.08  25.37  32.41 
8  19.96  27.80  33.60  36.18 
15  19.38  25.91  30.82  35.12 
21  20.63  24.64  31.45  35.49 
25  21.45  24.98  30.70  35.32 
87  21.46  27.82  37.08  38.48 
150  23.77  29.23  38.84  40.88 
212  24.96  31.21  41.32  43.41 
250  25.27  32.17  41.82  44.25 
875  42.19  71.44  75.20  73.15 
1500  63.98  94.17  112.31  104.39 
2125  99.07  124.31  162.24  152.74 
2500  1059.83  129.21  172.16  165.21 
8750  1122.91  456.07  689.72  621.82 
15000  1233.48  688.99  775.91  719.73 
21250  1218.00  1020.46  862.24  805.53 
25000  1211.81  1146.03  908.35  822.63 
87500  1563.37  4294.96  1630.25  1412.93 
150000  1854.84  6087.61  2276.29  1958.36 
212500  2472.61  7106.53  2941.19  2489.45 
250000  2893.00  7835.16  3289.41  2765.93 
875000  14083.86  21566.69  9392.92  8158.38 
1500000  12421.02  36192.82  15557.71  13434.51 
2125000  16154.38  34915.25  21776.97  18955.76 
2500000  19579.38  39681.02  25773.33  22346.98 
4597152  31391.74  63723.56  46497.68  40701.29 
6694304  45622.58  88317.08  67372.14  59036.27 
8388608  56249.24  204326.0  84081.41  73116.03 
The raw data are listed in Table 2. The data range is from to
Bytes with exponentially distributed measure points as chosen by the
mpicroscope benchmark.As can be seen, the doubly pipelined algorithm consistently (except for small counts) beats the pipelined reduction followed by broadcast algorithm, but the ratio of improvement in time is less than the factor of as expected by the analysis (for instance, for the largest count, the ratio is only and not ), which may or may not indicate that bidirectional communication capabilities are being exploited. In order to answer this question, a baseline on raw, bidirectional communication would need to be experimentally established. For small and large counts, the native MPI_Allreduce operation performs the best, but is excessively poor in a midrange of counts, where it is the worst implementation by a sometimes large factor. This indicates a bad switch of algorithm in the used Open MPI 4.0.5 library. As counts get larger, the poorest implementation choice is MPI_Reduce+MPI_Bcast, as is the way an MPI library can be expected to behave [7].
3 Summary
This note is meant as an exercise in reductionalgorithm implementation and evaluation, and most concrete issues are therefore intentionally left open. The main question is whether bidirectional communication in messagepassing systems can be exploited and make a noticeable and robust performance difference over algorithms that cannot exploit bidirectional communication. Further questions concern the experimental evaluation, in particular the determination of the best pipeline block size, and the role of the hierarchical structure (network and nodes) of a clustered, highperformance system play. The invitation and challenge is to investigate these questions better than presented here. Concrete implementations can be compared against the author’s code.
References
 [1] Pierre Fraigniaud and Emmanuel Lazard. Methods and problems of communication in usual networks. Discrete Applied Mathematics, 53(1–3):79–133, 1994.
 [2] Sascha Hunold and Alexandra CarpenAmarie. Reproducible MPI benchmarking is still not as easy as you think. IEEE Transactions on Parallel and Distributed Systems, 27(12):3617–3630, 2016.
 [3] MPI Forum. MPI: A MessagePassing Interface Standard. Version 3.1, June 4th 2015. www.mpiforum.org.
 [4] Peter Sanders, Jochen Speck, and Jesper Larsson Träff. Twotree algorithms for full bandwidth broadcast, reduction and scan. Parallel Computing, 35(12):581–594, 2009.
 [5] Peter Sanders and Jesper Larsson Träff. Parallel prefix (scan) algorithms for MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 of Lecture Notes in Computer Science, pages 49–57. Springer, 2006.
 [6] Jesper Larsson Träff. mpicroscope: Towards an MPI benchmark tool for performance guideline verification. In Recent Advances in Message Passing Interface. 19th European MPI Users’ Group Meeting, volume 7490 of Lecture Notes in Computer Science, pages 100–109. Springer, 2012.
 [7] Jesper Larsson Träff, William D. Gropp, and Rajeev Thakur. Selfconsistent MPI performance guidelines. IEEE Transactions on Parallel and Distributed Systems, 21(5):698–709, 2010.