## I Introduction

Collective communication operations are an essential component of many data-parallel computation frameworks. Originally developed for high-performance computing frameworks such as MPI [walker1996mpi]

, they are now widely used for cloud-based distributed machine learning

[vishnu2016distributed, WritingD41:online, ExtendMX17:online, NIPS2016_6381, Ke2017LightGBMAH, SparkMPI16:online, hpcincloud] workloads. With increasingly more complex models [AIandCom71:online, TuringNL73:online, rajbh2019zero] calling for larger data exchanges, and rapid deployment of faster accelerators [sagemaker, brainwave, Jouppi:2017:IPA:3079856.3080246] demanding more frequent exchanges, efficient execution of these workloads contingent on efficient collectives.Unfortunately, achieving good collectives performance in a cloud environment is fundamentally more challenging than in an HPC world, because the user has no control over node placement, topology and has to share the infrastructure with other tenants. These constraints have a strong implication on the performance of collectives. As a result, the bottleneck of running these workloads on the cloud has shifted from computation to communication [7092922, PLink, phub].

Consider a common practice of applying allreduce ring collectives, a popular algorithm, in the cloud context, where a randomly-ordered IP list (obtained through the provider) of VMs is used to form a virtual ring on which data is passed along, with -th VM sending data to -th VM. But do different ways of forming ring (through permutation of VMs in the list) exhibit the same performance? The answer is most likely no, as the ring that corresponds to shorter total hop cost will likely perform better (Figure 1). On the other hand, not all ways of forming rings achieve the same cost, because the point to point communication cost (bandwidth, latency, or collectively referred to as locality in this work) is different across VMs (Figure 2), due to the hierarchical structure of the datacenter network, and the dynamic nature of traffic from other tenants. Consequently, running collectives with a randomly-ordered list of VMs results in unpredictable and subpar performance.

Our work focuses on discovering a permutation of the IP list that exploits the network locality for efficient communication, in a completely transparent way, by minimizing the cost model of a given collectives parameterized with the actual hop cost. To do so, we need to (1) efficiently identify the underlying network constraints (or collectively, locality); (2) accurately build cost model for the collectives at hand; (3) effectively approximate the minimum of complex cost functions.

This paper proposes Collectives, a tool that uses network probes to discover locality within the underlying datacenter network, and uses it to solve a communication cost minimization problem with constraints, with the rank of each VM as the unknowns. We use reordered ranks as input to unmodified communication backends in microbenchmarks including OMB [10.1007/978-3-642-33518-1_16], Nvidia NCCL [NVIDIACo76:online], Facebook Gloo [goyal2017accurate]

and real-world workloads of training deep neural networks with Pytorch/Caffe2 and gradient boosted decision trees using LightGBM

[NIPS2016_6381, Ke2017LightGBMAH] and our preliminary results show a speedup of up to 3.7x in various allreduce operations and 1.3x in end-to-end performance across EC2 and Azure.## Ii Background

We provide an overview of typical structures and performance implications for datacenter networks and a brief introduction to the various popular collectives operations.

### Ii-a Locality in Datacenter Network

Modern datacenters are hierarchical, with machines connecting to a top of rack switch, which are in turn connected to upper-level devices [Mysore2009PortLandAS, VL2, Roy2015InsideTS, incbricks]. This particular topology induces locality [PLink], as the communication performance between two physical hosts is not the same. For example, VMs within the same rack have the best and stable performance, as the physical link is not shared. On the other hand, links between hosts residing in different racks are shared, and the communication performance depends on factors like hop count, link congestion, oversubscription ratio [Bilal2012ACS], and dynamic load. Topology information is crucial for achieving optimal performance as many collectives implementations generate routines based on this information [4154092, 1639364, 4228133, 1419910, phub]. But in a cloud-environment, this information is hidden. Various attempts are made to reconstruct the physical affinity, e.g., PLink [PLink]

uses DPDK-based latency probes and K-Means clustering to find hosts with high physical affinity.

### Ii-B Collectives

Collectives works by decomposing an operation into a series of point to point operations between two nodes according to a predefined communication schedule. collectives most often appear in MPI contexts [Sack:2011:SCM:2522220, mpich, collectivesOptimization, blum2000architectures, bala1995ccl]. Typically, all nodes in collectives participate in the communication, usually running symmetric tasks. collectives can be used in many tasks such as (all)reduce, broadcast, (all)gather, and (reduce)scatter [pagestac0:online], and it is thus impractical to individually optimize each task. Fortunately, many of the tasks can be decomposed into multiple stages of collectives primitives (e.g., allreduce can be decomposed into reducescatter followed by allgather. Therefore, we only need to focus on accelerating such primitives. We now introduce these algorithms. We use as the number of participating nodes, as the amount of data to process per node.

Ring [patarasuk2009bandwidth]. As shown in Figure 3(a), ring algorithms work by connecting nodes to form a virtual ring. Data is then passed along the ring sequentially. Ring algorithms require steps to complete, sending amount of data.

Halving Doubling [mpich]. As shown in Figure 3(b), halving doubling works by recursively doubling the distance (in terms of rank ID) while halving the total amount of data sent in each round, requiring steps to finish while sending amount of data on the wire.

Tree. In one form, a single tree is built where data is transferred from leaves to the root and vice-versa [firecaffe]; in a more optimized setting, a pair of complementary binary trees are built to fully utilize the full bisection bandwidth [dbt], each sending and receiving . For binary trees, rounds of communication are required, sending bytes.

BCube [glooalgo70:online]. BCube is very similar to halving doubling from a structural perspective, in the sense that nodes are organized into a group of peers. BCube operates in rounds, and each node in each round would peer with a unique node in another groups. Each node communicates amount of data in round . BCube achieving a total bytes on wire of .

## Iii Motivation

This section motivates Collectives by demonstrating the implication of rank order on the performance of cloud-based collectives. We start by highlighting the asymmetric, non-uniform link cost in the cloud environment, by launching 64 Standard F64 VMs on the Azure cloud. We then run an in-house hybrid DPDK and ping-based [HomeDPDK76:online] latency probe (§IV-B) between each pair of VM node, using the same technique in PLink. The result is summarized in Figure 2 as a heatmap. We observe the pairwise round-trip latency can range from sub-10 to hundreds of microseconds.

We proceed to examine the performance of allreduce operation using Gloo [GitHubfa54:online], running the Ring chunked algorithm with 512 Standard F16 VMs. To derive a performance distribution, we use 500 randomly generated rank orders to generate 500 samples, and each is the average runtime of a 10-iteration reduction of 100MB of data. The result is summarized in the yellow distribution in Figure 1

. The performance of different rank orderings of VM varies drastically, ranging from 330ms to 3400ms, with a mean of 1012ms and a standard deviation of 418ms. Now we have established the profound influence of rank ordering of VM nodes on the performance of collectives algorithms, the goal of this paper is to derive an approximately optimal rank-ordering given a selected collectives algorithm such that it maximizes the performance.

## Iv Design and Implementation

We now describe Collectives, a tool that takes in a list of VM nodes and a target algorithm, accurately and efficiently probes their pairwise distance, and uses that information to construct a rank order of VMs that attempts to minimize the total cost of communication.

### Iv-a Cost Models for Collective Algorithms

Collectives builds a cost model for each popular algorithm used in collectives , parameterized with the number of participating nodes and size . This section details the cost models for popular algorithms. We use to refer to the cost for transferring amount of data from node to . We further define . We assume a power of 2 to simplify explanation, and allow arbitrary rank to alias to canonical rank .

Ring. The cost model of the ring algorithm is the sum of the cost of each hop when traversing the ring:

Having Doubling. The cost of halving doubling is the sum of costs for each round of communication, which in turn is the max cost of all communications in that round.

Tree. The cost of running tree algorithms depends on the number of trees and how trees are constructed. The total cost is the maximum cost of all trees, which is in turn determined by the maximum cost of each subtree. We provide a cost model for a popular variant of tree algorithm: double binary tree as used in [Massivel64:online].

where is expressed recursively:

Similarly a mirrored tree is built by decrementing each node’s rank in the tree without changing the tree structure.

BCube. The cost of running the BCube algorithm is similar to halving doubling, except in each round, each node communicates with peers, instead of 1.

### Iv-B Probing for Pairwise Distance

We need to determine values for with end-to-end measurements. In this work, we use a latency-centric view for the cost component. The rationale behind this stems from the well-known theoretical TCP bandwidth model of [mathis1997macroscopic]: given constant drop rate and window , higher latency induces lower bandwidth in TCP streams. This conveniently lets us approximate costs by only probing for latency. We adopt the probing pipeline used in PLink, which focuses on discovering physical locality with an in-house DPDK based echo tool, leveraging network enhancement provided by the clouds [Createan37:online, Enablean80:online]. Each pair of nodes receive a total of probes from sequentially and bidirectionally. To derive an accurate reading, we take the RTT of 10th percentile to filter out interference during probes. Each probe is a UDP packet with a 32-bit payload that encodes sequence number and round id for fault tolerance. When DPDK cannot be used, we use fping, a ICMP Echo-based latency probing tool. For each entry in , we update to make it symmetric.

### Iv-C Minimizing the Cost Model

We parameterize the cost model with values of probed . To derive a rank ordering that minimizes , we perform the following transformation: let set of variables defined as be a permutation of to be solved, and we replace each with . We can then establish a bijection from the original rank ordering to the desired order once s are solved. We flatten to use theory of arrays to allow direct solving with conventional optimizing SMT solvers such as Z3 [de2008z3, ORToolsG24:online].

Unfortunately, we find solvers inefficient, perhaps due to the non-convex, non-linear nature of the objective function and a large search space (). Thus, we take a two-stage solving process. The first step employs a range of stochastic search techniques such as simulated annealing [simanneal]

, with a few standard heuristics (e.g., permuting a random sub-array, permuting random pairs) for obtaining neighboring states and a timeout. When the search returns with an initial result

, we generate an additional SMT constraint to better guide pruning for solvers. We let the solver continue to run for a few minutes, and we either find a better solution or will use as the final value. The end-product of this process is a rearranged list of VMs.## V Preliminary Evaluation

We evaluate Collectives with a series of microbenchmarks from various communication backends and real-world applications that use collectives. We represent speedup by comparing the performance we get from the best rank order and the worst rank order. We avoid comparing with the original rank order because it is random and unstable (Figure 1).

### V-a Experimental Setup

Our experiments are conducted on two public clouds, Azure and EC2. We enable network acceleration on both clouds and set TCP congestion control protocol to DCTCP. We include microbenchmarks that exercise ring, having doubling, double binary tree, and Bcube algorithms. All experiments run on Ubuntu 19. We focus our evaluation on one of the most important tasks in collectives, allreduce for its popularity.

### V-B Prediction Accuracy of Cost Model

While the goal of the cost model is not to predict the actual performance, but rather, it should preserve the relative order of performance, i.e., should hold true for as many pairs of s as possible. We demonstrate this for ring based collectives by generating 10 different rank orders, with the -th order approximately corresponds to the -th percentile in the range of costs found by the solver. We obtain performance data for Facebook Gloo and OpenMPI running OSU Benchmark on 64 F16 nodes on Azure and 64 C5 nodes on EC2. We then compute Spearman [spearman] correlation coefficient between the predicted performance and the actual performance for each setup (Table I). We found the cost function predicting and actual collectives performance exhibits strong correlation.

Setup | Azure | EC2 |
---|---|---|

Gloo Ring 100MB | 0.58 | 0.78 |

OpenMPI Ring 100MB | 0.81 | 0.94 |

### V-C Microbenchmark Performance

We evaluate Collectives’s efficacy with microbenchmarks of algorithms introduced in II. We report a mean speedup of 20 iterations. We run all benchmarks with 512 F16 nodes on Azure, except for NCCL, which runs on 64 P3.8xLarge GPU nodes on EC2. Specifically, we set for BCube; for NCCL, we use a single binary tree reduction for small buffers and a ring for large buffers. In all benchmarks, we reduce a buffer of 100MB, except for Nvidia NCCL, where we reduce a small buffer of 4B to trigger the tree algorithm.

Figure 4 shows a summary of speedups achieved using Collectives on these benchmarks with the ring-family algorithms benefiting the most, up to 3.7x. We speculate the reason for the effectiveness is that they have a much wider performance distribution, as each permutation of the order can potentially generate a different performance (cost of each hop is on the critical path); they also have simpler cost model, allowing the solvers to quickly navigate the objective landscape. On the other hand, halving doubling, BCube, and tree algorithms have complex objectives – sum of maximums, resulting in a narrower performance distribution because mutation of the ordering may not change the cost at all if the mutation does not cause the critical path to change.

### V-D End-to-end Performance Impact on Real-world Applications

Distributed Gradient Boosted Decision Tree. We evaluate Collectives’s impact on LightGBM [Ke2017LightGBMAH], a gradient boosted decision tree training system. We use data parallelism to run lambdarank with metric ndcg. Communication-wise, this workload runs two tasks: allreduce and reducescatter, and they are called sequentially in each split of the iteration. At our scale of 512 nodes, LightGBM automatically chooses to use halving and doubling for both reducescatter and allreduce. We use a dataset that represents an actual workload in our commercial setting with 5K columns and a total size of 10GB for each node. We train 1K trees, each with 120 leaves. We exclude the time it takes to load data from disk to memory and report an average speedup of 1000 iterations. Collectives generated rank orders speed up training by 1.3x.

Distributed Deep Neural Network. We show

Collectives’s effectiveness on distributed training of DNNs with Caffe2/Pytorch, on 64 EC2 p3.8xLarge nodes with data parallelism and a batch size of 64/GPU. We train AlexNet on the ImageNet dataset. Since our

Collectives does not change computation and only improves communication efficiency of the allreduce operation at iteration boundary, we report speedup of training in terms of images/second, averaged across 50 iterations. We use the ring chunked algorithm, which achieves the best baseline performance, and the Collectives-optimized rank ordering of VM nodes achieves a speedup of 1.2x.## Vi Discussion and Limitations

Generalizability Study. Due to lack of resources, we only evaluated Collectives on a few VM allocations on each cloud. Further investigation is needed to understand how Collectives performs across different physical allocations and whether

Collectives’s rank order reacts to cloud variance well.

Limitations to Cost Modeling. Our use of latency-centric cost function is reasonable and effective, but is not perfect: it is possible that different bandwidth can correspond to the same latency depending on link condition. This may cause unoptimal solution when a transfer is bandwidth-bound. Further study is needed to determine how to incorporate bandwidth related cost into the cost function without incurring high probing overhead.

Complements to Cost Models. While building accurate cost models is difficult, we can dynamically adjust rank ordering with help from tools such as TCP_INFO [mathis2003web100] that monitors link properties such as latency and bandwidth. Since we know the communication pattern, we can determine the critical path and find bottleneck transfer between node and in the system. From there, we can find a to replace such that the replacement results in a minimized cost objective.

Adapting to Dynamic Traffic. The above mechanism can be applied to adapt to dynamic network load in the cloud environment. The framework, however, must support the dynamic change of node ranks, which is possible and should come with a small cost as a full mesh of connections can be established beforehand among all nodes.