DeepAI
Log In Sign Up

A Doubly-pipelined, Dual-root Reduction-to-all Algorithm and Implementation

09/26/2021
by   Jesper Larsson Träff, et al.
0

We discuss a simple, binary tree-based algorithm for the collective allreduce (reduction-to-all, MPI_Allreduce) operation for parallel systems consisting of p suitably interconnected processors. The algorithm can be doubly pipelined to exploit bidirectional (telephone-like) communication capabilities of the communication system. In order to make the algorithm more symmetric, the processors are organized into two rooted trees with communication between the two roots. For each pipeline block, each non-leaf processor takes three communication steps, consisting in receiving and sending from and to the two children, and sending and receiving to and from the root. In a round-based, uniform, linear-cost communication model in which simultaneously sending and receiving n data elements takes time α+β n for system dependent constants α (communication start-up latency) and β (time per element), the time for the allreduce operation on vectors of m elements is O(log p+√(mlog p))+3β m by suitable choice of the pipeline block size. We compare the performance of an implementation in MPI to similar reduce followed by broadcast algorithms, and the native MPI_Allreduce collective on a modern, small 36× 32 processor cluster. With proper choice of the number of pipeline blocks, it is possible to achieve better performance than pipelined algorithms that do not exploit bidirectional communication.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/23/2017

On Optimal Trees for Irregular Gather and Scatter Collectives

This paper studies the complexity of finding cost-optimal communication ...
08/27/2020

k-ported vs. k-lane Broadcast, Scatter, and Alltoall Algorithms

In k-ported message-passing systems, a processor can simultaneously rece...
05/10/2022

All-to-All Encode in Synchronous Systems

We define all-to-all encode, a collective communication operation servin...
10/29/2019

Decomposing Collectives for Exploiting Multi-lane Communication

Many modern, high-performance systems increase the cumulated node-bandwi...
04/20/2018

Parallel Quicksort without Pairwise Element Exchange

Standard implementations of 2-way, parallel, distributed memory Quicksor...

1 Algorithm and Implementation

The reduction-to-all collective communication problem is the following. Each of successively ranked processors has a vector of elements for which the element-wise product for some given associative (but not necessarily commutative) binary operator has to be computed and distributed to all processors. The effect of a reduction-to-all operation is the same as performing the reduction of all input vectors onto some selected root processor, say processor , and afterwards broadcasting the resulting vector to all the other processors.

1.1 Algorithm description

We assume that the processors can be organized into two, roughly equally large, rooted, post-order numbered, as balanced and complete as possible binary trees. Processors that are in parent-child relationship are assumed to be able to communicate directly with each other with uniform communication costs, as are the dual roots. Communication is bidirectional, and two parent-child processors can in the same communication operation both send and receive elements to and from each other (telephone-like bidirectional communication [1]). We assume a linear-cost model in which bidirectional communication of elements in both directions can be done in time units for system dependent constants (communication start-up latency) and (transmission time per element).

1: Initialize pipelining array to input
2:for  do
3:      First child
4:      Post-order reduction
5:      Second child
6:      Post-order reduction
7:     if root then
8:          Dual root
9:          For non-commutative , for lower numbered root
10:     else
11:          Parent
12:     end if
13:end for
Algorithm 1 Doubly pipelined reduction-to-all algorithm as performed by processor . The constant is the height of processor in its post-order numbered binary tree. Each tree root communicates with its dual in the other tree, and communication with non-existing children is void. The th pipeline block is denoted by , and blocks for which either or are assumed to contain elements. The number of elements in the blocks that are sent and received is assumed to be implicitly known by the bidirectional send and receive operations.

The algorithm is quite simple and follows the same idea as in [5] where a doubly pipelined algorithm for the parallel-prefix operation on large vectors was discussed and benchmarked. The input vector for processor is divided into a number of successive blocks, , that are handled one after the other. Each processor works in a number of rounds for some to be determined later. In each such round, a non-leaf processor (in either of the binary trees) receives a partial result block from its first child and bidirectionally sends an earlier block of the result to this first child, and then performs an element-wise reduction with the corresponding block of its own input vector. In the same way, the processor then receives a partial result block and sends an earlier block from and to its second child, and performs a reduction with the received block and the partial result block computed from the first child. In the last step, the processor sends the partial result block to its parent, and receives an earlier block of the result from the parent. Processors that are leaves in their tree just send and receive blocks from their parent processor. Thus each round entails at most three bidirectional communication steps with roughly even sized blocks of about elements each. With the processors organized as a post-order binary tree, the subtree rooted at some processor consists of successively numbered processors and for some child processors . The first child of processor is processor , and the second child is processor . The partial result blocks computed by processor are blocks of the product , and the partial result blocks computed by processor are blocks of the product . Thus, processor can compute partial result blocks of the product over all processors in its subtree while relying only on the associativity of the operator. The root of either tree communicates with the root of the other tree, sending partial result blocks from its own tree and receiving partial result blocks from the other tree. Thus, for the root an extra application of the operation is needed, while the other non-leaf processors take at most two operations per round.

The per processor algorithm is illustrated as Algorithm 1. The input for the processor is initially stored in the array, which will also contain the final result . The th block in is denoted . Blocks for which have roughly elements, while blocks for which or are for convenience assumed to have elements. In round , processor receives block from each of its children into an intermediate buffer , and sends the previous block of the final result. Reductions are performed on the received blocks in with the process’s own block , and the result is sent to the parent. From the parent, processor receives block of the final result . Bidirectional, simultaneous send and receive communication is denoted by a Send in parallel () with a Recv operation.

We claim that this correctly computes in if is chosen as the depth of processor in its post-ordered numbered tree. This can easily be seen by induction on the height of the two trees. When the height is the algorithm runs just over the two roots which in round just exchanges their input blocks and compute the result correctly into by a single application of . Assume the claim holds for any two binary trees of height at most , and consider a processor of depth with one or two children at depth . In round such a processor will need to compute a partial result in to send to its parent in the third step. For this it needs to receive from its children, which is what the algorithm does. Before round such a processor has by the induction hypothesis correctly received all result blocks of in . In the first two steps, it sends the last of these blocks, namely to its children, and in the third step receives block of the result .

1.2 Analysis remarks

Assume that for the number of processors , it holds that for some , that is , and that . Then, the height of the two binary trees is , and the number of communication rounds for the first block to reach a tree root is (as can easily be shown by induction on ). Each root receives a new block every third step since the algorithm takes three send and receive operations per round. One extra step is required for either root to receive the first block from the other tree. For the last block that is broadcast down the tree, another steps is needed. Thus the latency of the doubly pipelined algorithm in terms of the number of communication steps for the first block of the result to reach the last leaf of either of the binary trees is . Each subsequent block requires three steps.

Assuming a linear-cost communication model with known constants and , the time to perform the allreduce operations on the element vectors when divided into blocks of roughly elements each is thus

By balancing (“Pipelining Lemma”) the terms that increase and decrease with , the analytical best number of blocks and with this the best possible running time of

can easily be deduced, which is .

The analysis accounts only for the communication costs. All non-leaves except the roots perform two applications of the operator on blocks of elements per round. The two roots unfortunately need one more reduction with the partial result block received from the other root. With a cost of time units per element, the added cost for the reductions is thus at most per round.

If instead only one, doubly pipelined binary tree is used, all non-leaves, including the root, perform at most two applications of the operator per round. On the other hand, with only one binary tree and , the latency for the first block of the result to reach the last child is , and therefore slightly higher (by a small constant term).

If the reduction-to-all operation is implemented as a reduction operation to the root followed by a broadcast, both with binary trees, the total time is

which results in a running time of with the right choice of best number of blocks. Thus, in the -term, exploiting bidirectional communication with doubly pipelined trees gives an improvement from a factor to a factor of .

The best-known, pipelined, binary tree-based algorithm places the processors in two trees, such that each processor is an internal node in one tree and a root in the other [4]. This gives a running time for the reduction-to-all operations, when implemented as a reduction followed by a broadcast operation of .

1.3 Implementation sketch

A concrete implementation of the algorithm has been given using MPI [3]. The bidirectional communication is assumed to be effected with the MPI_Sendrecv operation. When a processor has not yet partially reduced all blocks, a block of size is received from each of the children and sent to the parent; when all blocks have been reduced, virtual blocks of zero elements are received and sent. Likewise, blocks of zero elements are initially received from the parent and sent to the children, and first when the parent has received blocks with more than zero elements, these are transmitted to the children. For the implementation, each MPI_Sendrecv operation gives the upper bound, namely elements, on the size of the block expected to be received, and the actual number of elements in a received block is queried with MPI_Get_elements. Using this functionality, there is no need to explicitly keep track of the depth of the processor and the excess number of rounds . A non-leaf processor terminates when it has received its last non-zero element block from both its children and the parent, but since the blocks received from the parent are always behind (earlier than) blocks from the children, a processor can terminate as soon as it has received the last non-zero element block from the parent. The MPI_Reduce_local function is used for performing the block wise reductions; but since this is less flexible than the assignments shown in Algorithm 1, some care has to be taken to respect the possible non-commutativity of the operator, and to avoid extra buffer copying.

The whole algorithm can be implemented in less than hundred of lines of MPI C code. Such code is available from the author.

Name Processor Interconnect MPI library
Hydra 32 36 1152 Intel Xeon Gold 6130, 2.1 GHz OmniPath Open MPI 4.0.5
with gcc 8.3.0-6
Dual socket Dual (2-lane)
Table 1: Systems (hardware and software) used for the experimental evaluation.

2 An experimental evaluation

We have done an initial evaluation of the doubly pipelined, dual-root reduction-to-all implementation on a small, Intel Skylake dual-socket, dual-rail OmniPath “Hydra” cluster as described in Table 1. The nodes of this cluster consists of two 16-core sockets, each with a direct OmniPath connection to a separate network.

The evaluation compares the following implementations of the reduction-to-all operation on vectors of given numbers of elements.

  1. The native MPI_Allreduce operation.

  2. An MPI_Reduce followed by an MPI_Bcast operation.

  3. A pipelined reduce followed by a pipelined broadcast with the same pipeline block size using a single binary tree (User-Allreduce1).

  4. The doubly pipelined, dual root reduction-to-all algorithm implemented as sketched in Sections 1.1 and 1.3 (User-Allreduce2).

The two implemented pipelined algorithms use the same pipeline block size which is set at compile time. The implementations do not attempt to find and use a best block size in dependence on the number of elements to reduce and the number of MPI processes used, or other characteristics of the system. Experiments with different block sizes, different numbers of MPI processes, and different mappings of the MPI processes to the cores of the compute nodes must be performed.

Figure 1: Runs with the four different reduction-to-all implementations. The two pipelined algorithms pipeline with blocks of MPI_INT elements. User-Allreduce1 implements the pipelined reduction followed by broadcast algorithm. User-Allreduce2 implements the doubly pipelined, dual root algorithm.

This is all not done here (see Section 3 for justification). In Figure 1 results from a single run with MPI processes (8 processes on each of the 36 compute nodes) with a fixed number of pipeline blocks of elements is shown. The elements are MPI_INT integers, and the reduction operator used in MPI_SUM. The results are gathered using the mpicroscope benchmark [6]. This benchmark defines the running time of an experiment as the minimum over a number of measurement rounds of the completion time of the slowest MPI process, and synchronizes individual measurements with MPI_Barrier operations [2].

Elements (count) MPI_Allreduce MPI_Reduce+MPI_Bcast Pipelined Doubly pipelined
0 0.29 0.84 0.19 0.19
1 16.75 24.44 29.95 33.60
2 15.55 19.08 25.37 32.41
8 19.96 27.80 33.60 36.18
15 19.38 25.91 30.82 35.12
21 20.63 24.64 31.45 35.49
25 21.45 24.98 30.70 35.32
87 21.46 27.82 37.08 38.48
150 23.77 29.23 38.84 40.88
212 24.96 31.21 41.32 43.41
250 25.27 32.17 41.82 44.25
875 42.19 71.44 75.20 73.15
1500 63.98 94.17 112.31 104.39
2125 99.07 124.31 162.24 152.74
2500 1059.83 129.21 172.16 165.21
8750 1122.91 456.07 689.72 621.82
15000 1233.48 688.99 775.91 719.73
21250 1218.00 1020.46 862.24 805.53
25000 1211.81 1146.03 908.35 822.63
87500 1563.37 4294.96 1630.25 1412.93
150000 1854.84 6087.61 2276.29 1958.36
212500 2472.61 7106.53 2941.19 2489.45
250000 2893.00 7835.16 3289.41 2765.93
875000 14083.86 21566.69 9392.92 8158.38
1500000 12421.02 36192.82 15557.71 13434.51
2125000 16154.38 34915.25 21776.97 18955.76
2500000 19579.38 39681.02 25773.33 22346.98
4597152 31391.74 63723.56 46497.68 40701.29
6694304 45622.58 88317.08 67372.14 59036.27
8388608 56249.24 204326.0 84081.41 73116.03
Table 2: Raw data for the experiment with the four algorithms with pipeline blocks. The minimum completion times over a number of measurements are shown.

The raw data are listed in Table 2. The data range is from to

Bytes with exponentially distributed measure points as chosen by the

mpicroscope benchmark.

As can be seen, the doubly pipelined algorithm consistently (except for small counts) beats the pipelined reduction followed by broadcast algorithm, but the ratio of improvement in time is less than the factor of as expected by the analysis (for instance, for the largest count, the ratio is only and not ), which may or may not indicate that bidirectional communication capabilities are being exploited. In order to answer this question, a baseline on raw, bidirectional communication would need to be experimentally established. For small and large counts, the native MPI_Allreduce operation performs the best, but is excessively poor in a midrange of counts, where it is the worst implementation by a sometimes large factor. This indicates a bad switch of algorithm in the used Open MPI 4.0.5 library. As counts get larger, the poorest implementation choice is MPI_Reduce+MPI_Bcast, as is the way an MPI library can be expected to behave [7].

3 Summary

This note is meant as an exercise in reduction-algorithm implementation and evaluation, and most concrete issues are therefore intentionally left open. The main question is whether bidirectional communication in message-passing systems can be exploited and make a noticeable and robust performance difference over algorithms that cannot exploit bidirectional communication. Further questions concern the experimental evaluation, in particular the determination of the best pipeline block size, and the role of the hierarchical structure (network and nodes) of a clustered, high-performance system play. The invitation and challenge is to investigate these questions better than presented here. Concrete implementations can be compared against the author’s code.

References

  • [1] Pierre Fraigniaud and Emmanuel Lazard. Methods and problems of communication in usual networks. Discrete Applied Mathematics, 53(1–3):79–133, 1994.
  • [2] Sascha Hunold and Alexandra Carpen-Amarie. Reproducible MPI benchmarking is still not as easy as you think. IEEE Transactions on Parallel and Distributed Systems, 27(12):3617–3630, 2016.
  • [3] MPI Forum. MPI: A Message-Passing Interface Standard. Version 3.1, June 4th 2015. www.mpi-forum.org.
  • [4] Peter Sanders, Jochen Speck, and Jesper Larsson Träff. Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Computing, 35(12):581–594, 2009.
  • [5] Peter Sanders and Jesper Larsson Träff. Parallel prefix (scan) algorithms for MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 of Lecture Notes in Computer Science, pages 49–57. Springer, 2006.
  • [6] Jesper Larsson Träff. mpicroscope: Towards an MPI benchmark tool for performance guideline verification. In Recent Advances in Message Passing Interface. 19th European MPI Users’ Group Meeting, volume 7490 of Lecture Notes in Computer Science, pages 100–109. Springer, 2012.
  • [7] Jesper Larsson Träff, William D. Gropp, and Rajeev Thakur. Self-consistent MPI performance guidelines. IEEE Transactions on Parallel and Distributed Systems, 21(5):698–709, 2010.