1. Introduction
Ethernet continues to be important in the HPC landscape. While the most powerful Top500 systems use vendorspecific or Infiniband (IB) interconnects, more than half of the Top500 (the November 2018 issue) machines (dongarra1997top500, ) are based on Ethernet, see Figure 1 (the left plot). We observe similar numbers for the Green500 list. The importance of Ethernet is increased by the “convergence of HPC and Big Data”, with cloud providers and data center operators aggressively aiming for highbandwidth and lowlatency fabric (valadarsky2015, ; handley2017re, ; vanini2017letflow, ). Another example is Mellanox, with its Ethernet sales for the 3rd quarter of 2017 being higher than those for Infiniband (mellanoxsales, ).
Yet, Ethernet systems are scarce in the highest 100 positions of Top500. For example, in November 2018, only four such systems were among the highest 100. Ethernet systems are also less efficient than Infiniband, custom, OmniPath, and proprietary systems, see Figure 1 (on the right). This is also the case for systems with similar sizes, injection bandwidth, and topologies, indicating overheads caused by routing. Thus, enhancing routing in HPC Ethernet clusters would improve the overall performance of 50% of Top500 systems. As Ethernet is prevalent in cloud systems (zhang2010cloud, ; azodolmolky2013cloud, ), it would similarly accelerate cloud infrastructure.
Clos is the most commonly deployed topology in data centers and supercomputers today, and it dominates the landscape of Ethernet clusters (niranjan2009portland, ; handley2017re, ; valadarsky2015, ). Yet, many lowdiameter topologies have recently been proposed which claim to improve the costperformance tradeoff compared to Clos networks. For instance, Slim Fly is more cost and powerefficient than fat trees and Clos (alfares2008scalable, ) while offering 25% lower latency. Similar numbers have been reported for Jellyfish (singla2012jellyfish, ) and Xpander (valadarsky2015, ). These topologies could significantly enhance the compute capabilities of Ethernet clusters.
However, the above comparisons (lowdiameter topologies vs. Clos) assume hardtodeploy routing, for example in the case of Jellyfish (singla2012jellyfish, ). Moreover, the bar for comparison with Clos interconnects has been raised substantially. Clos was traditionally deployed using ECMP, which tries to approximate an equal split of a fluid flow across shortest paths. Bleedingedge Clos proposals based on perpacket load balancing^{1}^{1}1These schemes account for packetreordering. and novel transport mechanisms achieve  smaller tail flow completion time (FCT) than ECMP (handley2017re, ; ghorbani17drill, ).
The above two research threads raise two questions we have not seen addressed so far. First, what is a highperformance routing architecture for lowdiameter networks, assuming an Ethernet stack? The key issue here is that traditional routing schemes such as ECMP cannot be directly used in networks such as Slim Fly, because (as we will show) shortest paths fall short in these topologies: there is almost always only one shortest path between endpoint pairs. Restricting traffic to these paths does not utilize such topologies’ path diversity, and it remains unclear how to split traffic across nonshortest paths of unequal lengths. Second, can lowdiameter networks continue to claim an improvement in the costperformance tradeoff against the new, superior Clos baselines? The key issue here is that the recent progress on Clos and fat trees also does not directly translate to topologies like Slim Fly, because the optimality of splitting traffic equally for Clos does not extend to recent lowdiameter topologies.
In this work, we answer both questions affirmatively. We first analyze in detail path diversity in five lowdiameter topologies and we discover that, even though lowdiameter topologies fall short of shortest paths, they have enough path diversity when using “almost” shortest paths. We then present FatPaths, a highperformance, simple, and robust routing architecture for Ethernet lowdiameter networks, aiming to accelerate both HPC systems and cloud infrastructure. FatPaths encodes the rich diversity of nonminimal paths in lowdiameter networks in commodity hardware using layered routing. It also uses a redesigned (“purified”) transport layer (based on recent data center designs for fat trees (handley2017re, )) with lossless metadata exchange (packet headers always reach their destinations), almost no dropped packet payload, fast start (senders start transmitting at line rate), shallow buffers, and priority queues for retransmitted packets to avoid headofline congestion (handley2017re, ), ultimately ensuring low latency and high bandwidth. Finally, FatPaths uses flowlet switching (kandula2007dynamic, ), a scheme proposed for Clos to prevent packet reordering, to enable very simple but powerful load balancing in nonClos lowdiameter networks.
Routing Scheme (Name, Abbreviation, Reference)  Stack Layer  Features of routing schemes  
SP  NP  SM  MP  DP  ALB  AT  
(1) SIMPLE ROUTING PROTOCOLS (often used as building blocks):  
Valiant load balancing (VLB) (valiant1982scheme, )  L2–L3  
Simple Spanning Tree (ST) (perlman1985algorithm, )  L2  
Simple routing, e.g., OSPF (moy1997ospf, ; rekhter2005border, ; malkin1994rip, ; oran1990osi, )  L2, L3  
ECMP (hopps2000analysis, ), OSPFOMP (villamizar1999ospf, )  L3  
UGAL (kim2008technology, )  L2–L3  
Simple Packet Spraying (PR) (dixit2013impact, ; sen2013localflow, )  L2–L3  
(2) ROUTING ARCHITECTURES:  
DCell (guo2008dcell, )  L2–L3  
Monsoon (greenberg2008towards, )  L2, L3  
PortLand (niranjan2009portland, )  L2  
DRILL (ghorbani2017drill, )  L2  
LocalFlow (sen2013localflow, ), DRB (cao2013per, )  L2  
VL2 (greenberg2009vl2, )  L3  
Architecture by AlFares et al. (alfares2008scalable, )  L2–L3  
BCube (guo2009bcube, )  L2–L3  
SEATTLE (kim2008floodless, ), others (lui2002star, ; rodeheffer2000smartbridge, ; perlman2004rbridges, ; garcia2003lsom, )  L2  
VIRO (jain2011viro, )  L2–L3  
Ethernet on Air (sampath2010ethernet, )  L2  
PAST (stephens2012past, )  L2  
MLAG, MCLAG, others (subramanian2014multi, )  L2  
MOOSE (scott2009addressing, )  L2  
MPA (narvaez1999efficient, )  L3  
AMP (gojmerac2003adaptive, )  L3  
MSTP (de2006improving, ), GOE (iwata2004global, ), Viking (sharma2004viking, )  L2  
SPB (allan2010shortest, ), TRILL (touch2009transparent, ), Shadow MACs (agarwal2014shadow, )  L2  
SPAIN (mudigonda2010spain, )  L2  
(3) Schemes for exposing/encoding paths (can be combined with FatPaths):  
XPath (hu2016explicit, )  L3  
Source routing for flexible DC fabric (jyothi2015towards, )  L3  
(3) FatPaths [This work]  L2–L3 
We extensively compare FatPaths to other routing schemes in Table 1. FatPaths is the only scheme that simultaneously (1) enables multipathing using both (2) shortest and (3) nonshortest paths, (4) explicitly considers disjoint paths for highest performance, (5) offers adaptive load balancing, and (6) is generic, being applicable across topologies. Table 1 focuses on various aspects of path diversity, because as topologies lower their diameter and reduce link count, path diversity, which is key to high performance of routing, becomes a scarce resource demanding careful examination and use.
Even if FatPaths primarily targets Ethernet networks, most its schemes are generic. We briefly discuss the feasibility of implementing Remote Direct Memory Access (RDMA) (fompipaper, ) technologies such as RDMA over Converged Ethernet (RoCE) (infiniband2014rocev2, ) and Internet Wide Area RDMA Protocol (iWARP) (iwarp, ) on top of FatPaths. For wide applicability in data centers and cloud systems, we integrate FatPaths with TCP protocols such as Data Center TCP (DCTCP) (alizadeh2011data, ) and MPTCP (raiciu2011improving, ). We also summarize advantages of FatPaths over flow control schemes such as Priority Flow Control (PFC) (ieee802.1bb, ; pfc_paper, ). Finally, we discuss how FatPaths could enhance Infiniband, possibly starting a line of future work on more powerful lossless routing on lowdiameter topologies.
We conduct extensive, largescale packetlevel simulations, and a comprehensive theoretical analysis. We simulate topologies with up to 1 million endpoints (to the best of our knowledge, these are the largest sharedmemory simulations so far). We motivate FatPaths in Figure 2. Slim Fly and Xpander equipped with FatPaths ensure 15% higher throughput and 2 lower latency than similarcost fat trees, for various flow sizes^{2}^{2}2We use the term “flow”, which is equivalent to a “message”.
and for heavilyskewed traffic.
Towards the above goals, we contribute:

[noitemsep, leftmargin=1em]

The identification of diversity of nonminimal paths as a key resource and the first detailed analysis of the potential for multipath routing in five lowdiameter network topologies, considering several metrics for their path diversity (§ 4).

A novel path diversity metric (Path Interference) that captures bandwidth loss between specific pairs of routers (§ 4) and enhaces the path diversity analysis.

A comprehensive analysis of existing routing schemes in terms of their support for path diversity (Table 1).

A theoretical analysis illustrating the advantages coming from FatPaths (§ 6).

Extensive, largescale packetlevel simulations (reaching around one million endpoints) to demonstrate the advantages of lowdiameter network topologies equipped with FatPaths over very recent Clos designs, achieving 15% higher net throughput at 2 lower latency for comparable cost (§ 7).
2. Notation, Background, Concepts
We first introduce the notation and basic concepts. The most important used symbols are summarized in Table 2.
Network structure  Sets of vertices/edges (routers/links, ).  
#endpoints and #routers in the network ().  
#endpoints attached to a router (concentration).  
#channels to other routers (network radix).  
Router radix ().  
Network diameter and the average path length.  
Diversity of paths (§ 4)  Different routers used in § 4 ().  
Different router sets used in § 4 ().  
Count of (at most hop) disjoint paths between router sets , .  
Diversity of minimal paths between routers and .  
Lengths of minimal paths between routers and .  
Path interference between pairs of routers and .  
Layers (§ 5)  The total number of layers in FatPaths routing.  
A layer, defined by its forwarding function, .  
Fraction of edges used in routing. 
2.1. Network Model
We model an interconnection network as an undirected graph ; and are sets of routers^{3}^{3}3We abstract away HW details and use a term “router” for L2 switches and L3 routers. () and fullduplex interrouter physical links. Endpoints are not modeled explicitly. There are endpoints in total, endpoints are attached to each router (concentration) and channels from each router to other routers (network radix). The total router radix is . The diameter is while the average path length is .
2.2. Network Topologies
We summarize the considered topologies in Table 3. We consider Slim Fly (SF) (besta2014slim, ) (a variant with ), Dragonfly (DF) (kim2008technology, ) (the “balanced” variant with ), Jellyfish (JF) (singla2012jellyfish, ) (with ), Xpander (XP) (valadarsky2015, ) (with ), HyperX (Hamming graph) (HX) (ahn2009hyperx, ) that generalizes Flattened Butterflies (FBF) (kim2007flattened, ) with . We also use established threestage fat trees (FT3) (leiserson1996cm5, ) that are a variant of the Clos network (clos1953study, ).
2.2.1. Topology Types
Some selected networks are flexible (parameters determining their structure can have arbitrary values) while most are fixed (parameters must follow welldefined closedform expressions). Next, networks can be group hierarchical (routers form groups connected with the same pattern of intragroup local cables and then groups are connected with global intergroup links), semihierarchical (there is some structure but no such groups), or flat (no distinctive hierarchical structure at all). Finally, topologies can be random (based on randomized constructions) or deterministic.
Topology  Structure remarks  Variant  Deployed?  
Slim Fly (SF) (besta2014slim, )  Consists of groups  2  MMS  unknown 
HyperX (HX2) (ahn2009hyperx, )  Consists of groups  2  Flat. Butterfly (kim2007flattened, )  unknown 
Dragonfly (DF) (kim2008technology, )  Consists of groups  3  “balanced”  PERCS (arimilli2010percs, ), Cascade (faanes2012cray, ) 
HyperX (HX3) (ahn2009hyperx, )  Consists of groups  3  “regular” (cube)  unknown 
Xpander (XP) (valadarsky2015, )  Consists of metanodes  3  randomized  unknown 
Jellyfish (JF) (singla2012jellyfish, )  Random network  3  “homogeneous”  unknown 
Fat tree (FT) (leiserson1996cm5, )  Endpoints form pods  4  3 router layers  Many systems 
2.2.2. Fair Selection of Topology Parameters
We use four classes of sizes : small (), medium (), large (), and huge (). We set concentration to ; it maximizes throughput while minimizing congestion and network cost (we analyze this later in § 7). Third, we select network radix and router count so that, for a fixed , the compared topologies use similar amounts of networking hardware and thus have similar construction costs.
2.2.3. Special Case: Jellyfish
The considered topologies cannot use arbitrary values of and . An exception is Jellyfish, which is “fully flexible”: There is a JF instance for each combination of and . Thus, to fully evaluate JF, for every other network X, we consider an equivalent JF (denoted as XJF) with identical .
2.3. Flow Model
We use a Poissondistributed flow arrival rate and a matrix defined on endpoint pairs to model flow sizes and traffic.
2.4. Considered Traffic Patterns
We analyze recent works in highperformance and datacenter networking (besta2014slim, ; prisacari2013fast, ; yuan2014lfti, ; yuan2013new, ; prisacari2014efficient, ; prisacari2014randomizing, ; kathareios2015cost, ; prisacari2015performance, ; chen2016evaluation, ; prisacari2013bandwidth, ; karacali2018assessing, ; sehery2017flow, ; kassing2017beyond, ) to select traffic patterns that represent important HPC workloads and cloud or datacenter traffic. Denote a set of endpoint IDs as . Formally, a traffic pattern is a mapping from source endpoint IDs to destination endpoints .
2.4.1. Random Patterns
First, we select random uniform
and random permutation
These patterns represent irregular workloads such as graph computations, sparse linear algebra solvers, and adaptive mesh refinement methods (Yuan:2013:NRS:2503210.2503229, ). They are used in both HPC studies (besta2014slim, ) and data center and cloud infrastructure analyses (karacali2018assessing, ; sehery2017flow, ; lebiednik2016survey, ).
2.4.2. OffDiagonal Patterns
We also use offdiagonals:
These patterns are often used in workloads such as nearest neighbor data exchanges (Yuan:2013:NRS:2503210.2503229, ), used in HPC and data centers (jyothi2016measuring, ).
2.4.3. Bit Permutation Patterns
Next, we pick shuffle, a traffic pattern that represents bit permutation pattern:
where the bitwise left rotation on bits is denoted as .
They represent collective operations such as MPIalltoall or MPIallgather (besta2014slim, ; Yuan:2013:NRS:2503210.2503229, ), used in HPC. They are also used for the evaluation of data center networks (sehery2017flow, ; lebiednik2016survey, ; kassing2017beyond, ).
2.4.4. Stencils
We also use stencils, realistic traffic patterns often used in HPC. We model 2D stencils as four offdiagonals at fixed offsets . For large simulations () we also use offsets to reduce counts of communicating endpoint pairs that sit on the same switches.
2.4.5. AllToOne
In this pattern, traffic from all endpoints is directed towards a single random endpoint in the network.
2.4.6. Adversarial Pattern
We use a skewed offdiagonal with large offsets (we make sure it has a very high amount of colliding paths).
2.4.7. WorstCase Pattern
Finally, we use worstcase traffic patterns. We focus on a recently proposed pattern, developed specifically to maximize stress on the interconnect while hampering effective routing (jyothi2016measuring, ). This pattern is generated individually for each topology. It uses maximum weighted matching algorithms to find a pairing of endpoints that maximizes average flow path length, using both elephant and small flows.
As the generation process is individual for each network, our worstcase pattern stresses the interconnect in any setting, including HPC systems, data centers, or any other cloud infrastructure.
3. FatPaths Architecture: Overview
We first outline the FatPaths architecture. A design summary is in Figure 3. FatPaths stands on four key design ideas that, combined, effectively use the “fat” diversity of minimal and nonminimal paths. These ideas are layered nonminimal routing, flowlet load balancing, “purified” transport, and randomized workload mapping.
3.1. Layered Routing
To encode the diversity of minimal and nonminimal paths with commodity hardware, FatPaths divides all the links into (not necessarily disjoint) subsets called layers. Routing within each layer uses shortest paths; these paths are usually not shortest when considering all network links. Different layers encode different paths between each endpoint pair. This enables taking advantage of the diversity of nonminimal paths in lowdiameter topologies. The number of layers is minimized to reduce hardware resources needed to deploy layers. Layers can easily be implemented with commodity schemes, e.g., VLANs or a simple partitioning of the address space.
We provide two schemes for the construction of layers: a simple randomized approach and an augmentation that minimizes the number of overlapping paths between communicating endpoints. Moreover, we encode existing routing schemes that enable multipathing, such as SPAIN (mudigonda2010spain, ), PAST (stephens2012past, ), and shortest paths (singla2012jellyfish, ), using FatPaths layers. We analyze which scheme is most advantageous for which topology.
3.2. Load Balancing
To achieve very simple but powerful load balancing, we use flowlet switching (sinha2004burstiness, ; kandula2007dynamic, ), a technique used in the past to alleviate packet reordering in TCP. A flowlet is a sequence (also referred to as a burst) of packets within one flow, separated from other flowlets by sufficient time gaps. Now, flowlet switching can also be used for a very simple load balancing: a router simply picks a random path for each flowlet, without any probing for congestion. This scheme was used for Clos networks (vanini2017letflow, ). The power of such load balancing lies in the fact that flowlets are elastic: their size changes automatically based on conditions in the network. On paths that have higher latency or lower bandwidth, flowlets are usually smaller in size because time gaps large enough to separate two flowlets are more frequent. Contrarily, paths with lower latency and more bandwidth feature longer flowlets because such time gaps appear less often.
We propose to use flowlets in lowdiameter nonClos networks, as a load balancing part of FatPaths. Here, we combine flowlets with layered routing: each flow is divided into flowlets that are sent using different layers. The key observation is that elasticity of flowlets automatically ensures that such load balancing takes into account both static network properties (e.g., longer vs. shorter paths) and dynamic network properties (e.g., more vs. less congestion). Consider a pair of communicating routers. As we will show later (§ 4), virtually all router pairs are connected with exactly one shortest part but multiple nonminimal paths, possibly of different lengths. In many workload scenarios, a shortest path experiences smallest congestion. Contrarily, longer paths are more likely to be congested. Here, the elasticity of flowlet load balancing ensures that larger flowlets are sent over shorter and less congested paths. Shorter flowlets are then transmitted over longer and usually more congested paths.
3.3. Purified Transport with NDP (handley2017re, )
Transport layer in FatPaths is inspired by recent Clos transport designs, namely NDP (handley2017re, ), in that it removes virtually all TCP and Ethernet issues that hamper latency and throughput. First, if router queues fill up, only packet payload is dropped. As packet headers with all the metadata are preserved, the receiver has full information on the congestion in the network and can pull the data from the sender at a rate dictated by the evolving network conditions. Specifically, the receiver can request to change a layer , when packets within flowlets transmitted over paths belonging to arrive without payload, indicating congestion.
Second, routers enable prioritization of (1) headers of packets that lost their payload, and (2) retransmitted packets. This ensures that congested flows finish quickly and it reduces headoflineblocking. Third, senders transmit the first RTT at line rate, without probing for available bandwidth. Finally, router queues are shallow. All these elements result in a lowlatency and highthroughput transport layer that meets demands of various traffic patterns and can be implemented with existing network technology.
3.4. Randomized Workload Mapping
We optionally use random assignments, where communicating endpoints are located at routers chosen u.a.r (uniformly at random). First, one often cannot rely on locality due to schedulers or virtualization. For example, Cray machines often host processes from one job in different machine parts to increase utilization. Second, many workloads, such as distributed graph processing, have little or no locality (lumsdaine2007challenges, ). Finally, perhaps most importantly, the low diameter of used topologies, especially the ones with , mostly eliminates the need for localityaware software. We predict that this will be a future trend as reducing cost and power consumption with simultaneous increase in scale is inherently associated with reducing diameter (besta2014slim, ). However, to cover applications tuned for locality, we also evaluate nonrandomized workloads and show that FatPaths ensures the highest performance in such cases as well.
4. Path Diversity in Modern Topologies
FatPaths enables using the diversity of paths in lowdiameter topologies for highperformance routing. To develop FatPaths, we first need to understand the “nature” of this path diversity. We also justify and motivate using multipathing in lowdiameter networks. Namely, we show that lowdiameter topologies exhibit congestion due to conflicting flows even in mild traffic scenarios and we derive the minimum number of disjoint paths that would eliminate flow conflicts (§ 4.1). We then formalize the notion of “path diversity” (§ 4.2) and we use our formal measures to show that all lowdiameter topologies have few shortest but enough nonminimal paths to accommodate flow collisions, an important type of flow conflicts (§ 4.3). In evaluation (§ 7), we show that another type of flow conflicts, flow overlaps, is also alleviated by FatPaths. To the best of our knowledge, compared to recent works in lowdiameter networks (valadarsky2015, ; kathareios2015cost, ; jyothi2016measuring, ; singla2012jellyfish, ; kassing2017beyond, ; besta2018slim, ; li2018exascale, ; kawano2018k, ; harsh2018expander, ; kawano2016loren, ; truong2016layout, ; flajslik2018megafly, ; kawano2017layout, ; azizi2016hhs, ; truong2016distributed, ; al2017new, ), we provide the most extensive analysis of path diversity in lowdiameter networks so far (with respect to the number of path diversity metrics and topologies).
4.1. How Much Path Diversity Do We Need?
FatPaths uses path diversity to avoid congestion due to conflicting flows. Consider two communicating pairs of endpoints. Generated flows conflict when their paths collide (i.e., flows use an identical path) or overlap (i.e., flows share some links), see Figure 5. Collisions only depend on how communicating endpoints are attached to routers (i.e., on , , and thus also indirectly ). Intuitively, collisions measure workload demand for path diversity (multipathing). Contrarily, overlaps depend on the topology details (i.e., how routers are connected to other routers). Intuitively, overlaps capture how well a topology can sustain a workload.
To understand how much path diversity is needed to alleviate flow conflicts, we analyze the impact of topology properties (diameter , concentration , size ) and a traffic pattern on the number of colliding paths, see Figure 4. For , the number of collisions is at most three in most cases, especially when lowering (while increasing ). Importantly, this holds for the adversarial 4 oversubscribed patterns that stress the interconnect. For , at least nine collisions occur for more than 1% of router pairs, even in mild traffic patterns. While we do not consider in practical applications, we indicate that global DF links form a complete graph, demanding high path diversity at least with respect to the global links.
We consider five traffic patterns: a random permutation, a randomlymapped offdiagonal, a randomly mapped shuffle, four random permutations in parallel, and a randomly mapped 4point stencil composed of four offdiagonals. The last two patterns are 4 oversubscribed and thus expected to generate even more collisions.
Takeaway We need at least three disjoint paths per router pair to handle colliding paths in any considered workloads, assuming random mapping. As the number of colliding paths lower bounds the number of overlapping paths, the same holds for overlaps.
4.2. How Should We Measure Path Diversity?
To analyze whether lowdiameter topologies provide at least three disjoint paths per router pair, we need to first formalize the notion of “disjoint paths” and ”path diversity” in general. For example, we must be able to distinguish between partially or fully disjoint paths that may have different lengths. Thus, we first define the count of disjoint paths (CDP), minimal and nonminimal, between routers (§ 4.2.1). This measures address path collisions. Moreover, to analyze path overlaps, we define two further measures: path interference (PI, § 4.2.2) and total network load (TNL, § 4.2.3). We summarize each measure and we provide all formal details for reproducibility; these details can be omitted by readers only interested in intuition. We use several measures because any single measure that we tested cannot fully capture the rich concept of path diversity.
4.2.1. Count of Disjoint Paths (CDP)
We define the count of disjoint paths (CDP) between router sets at length as the smallest number of edges that must be removed so that no path of length at most exists from any router in to any router in .
To compute , first define the step neighborhood of a router set as “a set of routers at hops away from ”:
Now, the condition that no path of length at most exists between any router in to any router in is . To derive the values of , we use a variant of the FordFulkerson algorithm (ford1956maximal, )
(with various pruning heuristics) that removes edges in paths between designated routers in
and (at various distances ) and verifies whether . We are most often interested in pairs of designated routers and , and we use .Minimal paths are vital in routing and congestion reduction as they use fewest resources for each flow. We derive the distribution of minimal path distances and diversities . Intuitively, describes (statistically) distances between any router pairs while provides their respective counts. We have:
Note that the diameter equals .
For nonminimal paths, we consider the CDP of random router pairs , with path lengths .
4.2.2. Path Interference (PI)
We define Path Interference (PI) which is – to the best of our knowledge – the first metric that measures path overlap while considering the local topology structure. Here, paths between two router pairs and ( communicates with ; communicates with ) interfere if their total count of disjoint paths at length is lower than the sum of individual counts of disjoint paths (at ) . We denote path interference with and define it as
Path interference captures the fact that, if and communicate, the available bandwidth between and is reduced.
4.2.3. Total Network Load (TNL)
TNL is a simple upper bound on the number of flows that a network can maintain without congestion. Intuitively, it constitutes the maximum supply of path diversity offered by a topology. It uses the notion that a flow occupying a path of length “consumes” links. TNL is defined as .
Takeaway Due to the rich nature of path diversity, we suggest to use several measures, for example count of minimal as well as nonminimal disjoint paths (measuring collisions) and path interference as well as total network load (measuring overlaps).
4.3. Do We Have Enough Path Diversity?
We now use our measures to analyze path diversity in lowdiameter networks. First, selected results on minimum paths are in Figure 6. In DF and SF, most routers are connected with one minimal path. In XP, more than 30% of routers are connected with one minimal path only. In the corresponding JF networks, the results are more leveled out, but pairs of routers with one shortest part inbetween still form large fractions. FT3 and HX show the highest diversity, with very few unique minimal paths, while the matching JFs have lower diversities. The results match the structure of each topology (e.g., one can distinguish intra and interpod paths in FT3).
Takeaway: In all the considered lowdiameter topologies, shortest paths fall short: at least a large fraction of router pairs are connected with only one shortest path.
For nonminimal paths, we first summarize the results in Table 4. We report counts of disjoint paths as fractions of router radix
to make these counts radixinvariant. For example, the mean CDP of 89% in SF means that 89% of router links host disjoint paths. In general, all deterministic topologies provide higher disjoint path diversity than their corresponding JFs, but there are specific router pairs with lower diversity that lead to undesired tail behavior. JFs have more predictable tail behavior due to the Gaussian distribution of
. A closer analysis of this distribution (Figure 7) reveals details about each topology. For example, for HX, router pairs can clearly be separated into classes sharing zero, one, or two coordinate values, corresponding to the HX array structure. Another example is SF, where lower are related to pairs connected with an edge while higher in DF are related to pairs in the same group or pairs connected with specific sequences of local and global links. Considered topologies provide three disjoint “almost”minimal (one hop longer) paths per router pair.Next, we sample router pairs u.a.r. and derive full path interference distributions; they all follow the Gaussian distribution. Selected results are in Figure 8 (we omit XP and XPJF; both are nearly identical to SFJF). As the combination space is large, most samples fall into a common case, where PI is small (c.f. small fractions). We thus focus on the extreme tail of the distribution (we show both mean and tail), see Table 4. We use radixinvariant PI values (as for CDP) at a distance selected to ensure that the 99.9% tail of collisions . Thus, we analyze PI in cases where demand from a workload is larger than the “supply of path diversity” from a network (three disjoint paths per router pair). All topologies except for DF achieve negligible PI for , but the diameter2 topologies do experience PI at
. SF shows the lowest PI in general, but has (few) highinterference outliers.
In general, random JFs have higher average PI but less PI in tails, while deterministic topologies tend to perform better on average with worse tails.Topology parameters  Default topology variant  Equivalent Jellyfish  
CDP  PI  CDP  PI  
mean  1% tail  mean  99.9% tail  mean  1% tail  mean  99.9% tail  
clique  1  2  100  101  10100  100%  100%  2%  2%  –  –  –  – 
SF  2  3  29  722  10108  89%  10%  26%  79%  56%  38%  23%  45% 
XP  3  3  32  1056  16896  49%  34%  20%  41%  51%  34%  21%  41% 
HX  3  3  30  1331  13310  25%  10%  9%  67%  50%  23%  17%  37% 
DF  3  4  23  2064  16512  25%  13%  8%  74%  87%  78%  13%  26% 
FT3  4  4  18  1620  11664  100%  100%  0  0  96%  90%  5%  14% 
4.4. Final Takeaway on Path Diversity
We show a fundamental tradeoff between path length and diversity. Highdiameter topologies, such as FT, provide high path diversity, even on minimal paths. Yet, due to longer paths, more links are needed for an equivalent and performance. Lowdiameter topologies fall short of shortest paths, but do provide enough path diversity on nonminimal paths, requiring nonminimal routing. Yet, this may reduce the cost advantage of lowdiameter networks with adversarial workloads, since many nonminimal paths need to be used, consuming additional links. Workload randomization in FatPaths suffices to avoid this effect. We conclude that lowdiameter topologies host enough path diversity for alleviating flow conflicts. We now show how to effectively use this diversity in FatPaths.
5. FatPaths: Design and Implementation
FatPaths is a highperformance, simple, and robust routing architecture that uses rich path diversity in lowdiameter topologies (analyzed in § 4) to enhance Ethernet stacks in clusters, data centers, and supercomputers. FatPaths aims to accelerate both cloud computing and HPC workloads. We now summarize key design ideas behind FatPaths. First, we develop the layered routing scheme that (1) is capable of encoding the rich diversity of both minimal and nonminimal paths, and (2) can be implemented with commodity Ethernet hardware. Second, we combine layered routing with flowlet switching and a “purified” transport layer based on very recent Clos designs (handley2017re, ). The former enables very simple but powerful load balancing. The latter ensures lowlatency and highthroughput transport. The design of both load balancing and transport layer is straightforward and presented in § 3 and Figure 3. Here, we focus on the layered routing, the key element of FatPaths that enables using the rich “fat” path diversity analyzed in § 4.
5.1. Routing Model
We assume simple destinationbased routing, compatible with any relevant technology, including sourcebased systems like NDP. To compute the output port in a router for a packet addressed to a router , and simultaneously the ID of the nexthop router , a routing function is evaluated. By iteratively applying with fixed we eventually reach and finish. The forwarding function must be defined such that a path from any to any is loopfree.
5.2. Layered Routing
We use routing functions for layers, where each router uses function for a packet with a layer tag attached. The layer tags are chosen on the endpoint by the adaptivity algorithm. We use layers associated with routing functions. Each router uses the th routing function, denoted as , for a packet with a layer tag attached. All layers but one accommodate a fraction of links, maintaining nonminimal paths. One layer (associated with ) uses all links, maintaining minimal paths. A single layer constitutes a Directed Acyclic Graph (DAG). The fraction of links in one layer is controlled by a parameter . Now, the interplay between and is important. More layers (higher ) that are sparse (lower ) give more paths that are long, giving more path diversity, but also more wasted bandwidth (as paths are long). More layers that are dense reduce wasted bandwidth but also give fewer disjoint paths. Still, this may be enough as we need three paths per router pair. One ideally needs more dense layers or fewer sparse layers. Thus, an important part of deploying FatPaths is selecting the best and for a given network ( can be used if there is high minimalpath diversity in the topology.) To facilitate implementation of FatPaths, we provide configurations of layers () that ensure highperformance routing for each used topology. Files with full specifications are in a dedicated repository (see link on page 1) while performance analysis of different and is in § 6 and § 7.
5.3. Construction of Layers
We develop two schemes for constructing layers in FatPaths; we also adapt selected existing protocols.
5.3.1. Random Permutations
An overview of layer construction is in Listing LABEL:lst:layers. We start with one layer with all links for maintaining shortest paths. We use random permutations of vertices to generate random layers. Each such layer is a subset with edges sampled u.a.r.. The network may become disconnected with , but for the used values of , this is unlikely and a small number of attempts delivers a connected network.
5.3.2. Minimizing Path Overlap
We also use a variant in which, instead of randomized edge picking while creating paths within layers, we use a simple heuristic that minimizes path interference. For each router pair, we pick a set of paths with minimized overlap with paths already placed in other layers. Importantly, while computing paths, we prefer paths that are one hop longer than minimal ones, using the insights from the path diversity analysis (§ 4).
5.3.3. Adapting Existing Schemes
In addition to our two schemes for generating layers, we also adapt existing approaches that provide multipathing. These are SPAIN (mudigonda2010spain, ), PAST (stephens2012past, ), and shortest paths (singla2012jellyfish, ), three recent schemes that support (1) multipathing and (2) disjoint paths (as identified in Table 1).
5.4. Populating Forwarding Entries
The functions are deployed with forwarding tables. To derive these tables, we compute minimum paths between every two routers , within layer . Then, for each router , we populate the entry for , in with a port that corresponds to the router that is the first step on a path from to . We compute all such paths and choose a random first step port, if there are multiple options. For any hypothetical network size, constructing layers is not a computational bottleneck, given the complexity of Dijkstra’s shortest path algorithm for vertices (dijkstra1959note, ).
5.5. Implementation Details
We briefly discuss the most important implementation details.
5.5.1. Implementation of Layers
We propose two schemes to deploy layers. First, a simple way to achieve separation is partitioning of the address space. This requires no hardware support, except for sufficiently long addresses. One inserts the layer tag anywhere in the address, the resulting forwarding tables are then simply concatenated. The software stack must support multiple addresses per interface (deployed in Linux since v2.6.12, 2005). Next, similarly to schemes like SPAIN (mudigonda2010spain, ) or PAST (stephens2012past, ), one can use VLANs (frantz1999vlan, ) that are a part of the L2 forwarding tuple and provide full separation. Still, the number of available VLANs is hardware limited, and FatPaths does not require separated queues per layer.
5.5.2. Implementation of Forwarding Functions
Forwarding functions can be implemented with wellknown static schemes such as simple lookup tables, either flat Ethernet exact matching or hierarchical TCAM longest prefix matching tables. In the former, one entry maps a single input tuple to a single next hop. The latter are usually much smaller but more powerful: one entry can provide the next hop information for many input tuples.
As not all the considered topologies are hierarchical, we cannot use all the properties of longest match tables. Still, we observe that all endpoints on one router share the routes towards that router. We can thus use prefixmatch tables to reduce the required number of entries from to . This only requires exact matching on a fixed address part. As we mainly target lowdiameter topologies, space savings due to moving from to can be significant. For example, an SF with has . Such semihierarchical forwarding was proposed in, for example, PortLand (niranjan2009portland, ) and shadow MACs (agarwal2014shadow, ). Since we use a simple, static forwarding function, it can also be implemented on the endpoints themselves, using source routing (jyothi2015towards, ).
5.5.3. Addressing
To integrate FatPaths with L2/Ethernet, one can use exact match tables; they should only support masking out a fixed field in the address before lookup, which could be achieved with, for example, P4 (bosshart2014p4, ). Alternatively, one could also use a simple L3/IP scheme. First, every endpoint has an IP address of the form for each layer (, , and identify a router, an endpoint within the router, and the layer ID). Second, for the interrouter links, addresses from a disjoint range are used, e.g,. , with one subnet per link. Finally, each router has one forwarding rule for each other router, of the form , where the interrouter link address is chosen from the router’s ports according to the forwarding function .
5.5.4. FaultTolerance
Faulttolerance in FatPaths is based on preprovisioning multiple paths within different layers. For major (infrequent) topology updates, we recompute layers (mudigonda2010spain, ). Contrarily, when a failure in some layer is detected, FatPaths redirects the affected flows to a different layer. We rely on established fault tolerance schemes (mudigonda2010spain, ; jain2011viro, ; hu2016explicit, ; vanini2017letflow, ; handley2017re, ) for the exact mechanisms of failure detection. Traffic redirection is based on flowlets (vanini2017letflow, ). Failures are treated similarly to congestion: the elasticity of flowlets automatically ensures that no data is sent over an unavailable path.
Besides flowlet elasticity, the layered FatPaths design enables other faulttolerance schemes. Assuming L2/Ethernet forwarding and addressing, we propose to adapt a scheme from SPAIN (mudigonda2010spain, ) or PAST (stephens2012past, ), both of which use the concept of layered routing similar to that in FatPaths. We first identify the layer with a failed element and then reroute the affected flows to a new randomly selected layer. This is done only for endpoints directly affected by the failure; thus, the affected layer continues to operate for endpoints where no failure was detected. The utilization of the affected layer is reestablished upon a receipt of any packet from this layer. Moreover, FatPaths could limit each layer to be a spanning tree and use mechanisms such as Cisco’s proprietary PerVLAN Spanning Tree (PVST) or IEEE 802.1s MST to fall back to the secondary backup ports offered by these schemes.
Finally, assuming L3/IP forwarding and addressing, one could rely on resilience schemes such as VIRO’s (jain2011viro, ).
6. Theoretical Analysis
We now conduct a theoretical analysis.
6.1. Traffic Patterns
We focus on a recently proposed worstcase traffic pattern, developed specifically to maximize stress on the interconnect while hampering effective routing (jyothi2016measuring, ). This pattern is generated individually for each topology; it uses maximum weighted matching algorithms to find a pairing of endpoints that maximizes average flow path length, using both elephant and small flows.
6.2. Considered Schemes
We use both variants of layered routing proposed in this work. We also consider SPAIN, PAST, and shortest paths, adopted to the layered setting. Originally, SPAIN uses a set of spanning trees, using greedy coloring to minimize their number and maximize path disjointness; one tree is one layer. PAST uses one spanning tree per host, aiming at distributing the trees uniformly over available physical links. shortest paths (singla2012jellyfish, ) spreads traffic over multiple shortest paths (if available) between endpoints.
6.3. Number of Layers
SPAIN and PAST use trees as layers while FatPaths allows for arbitrary DAGs. This brings drawbacks, as each SPAIN layer can use at most links, while the topology contains links. Thus, at least layers are required to cover all minimal paths, and SPAIN may require even . Moreover, PAST needs trees by its design. By using layers that are arbitrary DAGs and contain a large, constant fraction of links, FatPaths provides sufficient path diversity with a low, number of layers.
6.4. Throughput
We also analyze maximum achievable throughput (MAT) in various layered routing schemes. MAT is defined as the maximum value for which there exists a feasible multicommodity flow (MCF) that routes a flow between all router pairs and , satisfying link capacity and flow conservation constraints. specifies traffic demand; it is an amount of requested flow from to (more details are provided by Jyothi et al. (jyothi2016measuring, )).
We test all layered routing schemes implemented in FatPaths (including SPAIN, PAST, and shortest paths) on all considered topologies, topology sizes, traffic patterns, and traffic intensity (fraction of communicating endpoint pairs). We use TopoBench, a throughput evaluation tool (jyothi2016measuring, )
that uses linear programming (LP) to derive
. We extended TopoBench’s LP formulation of MCF so that it includes layered routing. Most importantly, instead of one network for accommodating MCF, we use networks (that represent layers) for allocating flows. We also introduce constraints that prevent one flow from being allocated over multiple layers.Selected results are in Figure 9. As expected, SPAIN – a scheme developed specifically for Clos – delivers more performance on fat trees. However, it uses up to layers. The layered routing scheme that minimizes path interference generally outperforms the SPAIN variant (that we tuned to perform well on lowdiameter topologies) on other networks. Finally, also as expected, our heuristic that minimizes path overlap delivers more speedup than simple random edge picking (we only plot the former for more clarity).
7. Simulations
We now analyze the performance of the FatPaths architecture, including layered routing but also adaptive load balancing, the transport protocol based on NDP (handley2017re, ), and randomized mapping. We consider the combined performance advantages but we also investigate how each single element impacts the final performance. Specifically, we will illustrate how lowdiameter topologies equipped with FatPaths outperform novel superior fat tree designs. We use two different simulation tools that reflect two considered environments: HPC systems and cloud infrastructure.
7.1. Methodology, Parameters, Baselines
We first discuss used parameters, methodology, and baselines.
7.1.1. Topologies
We consider all the discussed topologies as specified in § 2: SF, XP, JF, HX, DF, and FT. We use their most advantageous variants (e.g., the “balanced” Dragonfly (kim2008technology, )) while fixing the network size ( varies by up to 10% as there are limited numbers of configurations of each network). Slim Fly represents a recent family of diameter2 topologies such as MultiLayer FullMesh (kathareios2015cost, ) and TwoLevel Orthogonal FatTrees (valerio1994recursively, ; valiant1982scheme, ). To achieve fixed cost and we use 2 oversubscribed fat trees.
7.1.2. Topology Parameters
We now extend the discussion on the selection of key topology parameters. We select and so that considered topologies use similar amounts of hardware. To analyze these amounts, we analyze the edge density: a ratio between the number of all the cables and endpoints . It turns out to be (asymptotically) constant for all topologies (the left plot in Figure 10) and related to . Next, higherdiameter networks such as DF require more cables. As explained before (besta2014slim, ): Packets traverse more cables on the way to their destination. We also illustrate as a function of (the right plot in Figure 10). An interesting outlier is FT. It scales with similarly to networks with , but with a much lower constant factor, at the cost of a higher and thus more routers and cables. This implies that FT is most attractive for small networks using routers with constrained . We can also observe the unique SF properties: For a fixed (low) number of cables, the required is lower by a constant factor (than in, e.g., HX), resulting in better scaling.
7.1.3. Routing and Transport Schemes
We use flowbased nonadaptive ECMP as the baseline (routing performance lower bound). Lowdiameter topologies use FatPaths while fat trees use NDP with all optimizations (handley2017re, ), additionally enhanced with LetFlow (vanini2017letflow, ), a recent scheme that uses flowlet switching for load balancing in fat trees. We also compare to a fat tree system using NDP with perpacket congestionoblivious load balancing as introduced by Handley et al. (handley2017re, ). For FatPaths, we vary and to account for different layer configurations, including (minimal paths only). Finally, we consider simple TCP, MPTCP, and DCTCP with ECN (alizadeh2011data, ; ramakrishnan2001addition, ; floyd1994tcp, ). We use these schemes to illustrate that FatPaths can accelerate not only bare Ethernet systems but also cloud computing environments that usually use full TCP stacks (isobe2014tcp, ; azodolmolky2013cloud, ).
7.1.4. Flows and Messages
We vary flow sizes (and thus message sizes as a flow is equivalent to a message) from 32 KiB to 2 MiB.
7.1.5. Metrics
We use flow completion time (FCT), which also represents throughput per flow . We also consider total time to complete a tested workload.
7.1.6. Performance Validation
Our evaluation is influenced by a plethora of parameters and effects, many of which are not necessarily related to the core paper domain. Some of them may be related to the incorporated protocols (e.g., TCP), others to the used traffic patterns. Thus, we also establish baseline comparison targets and we fix various parameters to ensure fair comparisons. To characterize TCP effects, one baseline is a star (ST) topology that contains a single crossbar switch and attached endpoints. It should not exhibit any behavior that depends on the topology structure as it does not contain any interswitch links. We use the same flow distribution and traffic pattern as in the largescale simulation, as well as the same transport protocols. This serves as an upper bound on performance. Compared to measurements, we observe the lowest realistic latency and the maximum achievable link throughput, as well as flow control effects that we did not explicitly model, such as TCP slow start. There is no additional congestion compared to measured data since we use randomized workloads. Second, as a lower bound on routing performance, we show results for flowbased ECMP as an example of a nonadaptive routing scheme, and LetFlow as an example of an adaptive routing scheme. We also include results of unmodified NDP (with oblivious load balancing) on FTs.
7.1.7. Simulation Infrastructure and Methodology
We use the OMNeT++ (varga2001omnet++, ; varga2008overview, ) parallel discrete event simulator with the INET model package (inet, ) and the htsim packetlevel simulator with the NDP reference implementation (handley2017re, ). We use OMNeT++ to enable detailed simulations of full networking stack based on Ethernet and TCP together with all overheads coming from protocols such as ARP. We use htsim as its simplified structure enables simulations of networks of much larger scales. We extend both simulators with any required schemes, such as flowlets, ECMP, layered routing, workload randomization. In LetFlow, we use precise timestamps to detect flowlets, with a low gap time of to reflect the lowlatency network. As INET does not model hardware or software latency, we add a fixed delay to each link. All our code is available online.
We extend the INET TCP stack with ECN (RFC 3168 (rfc3168, )), MPTCP (RFC 6824 (rfc6824, ), RFC 6356 (rfc6356, )), and DCTCP. We extend the default router model with ECMP (FowlerNollVo hash (kornblum2006identifying, )) and LetFlow. In LetFlow, we use precise timestamps to detect flowlets, with a low gap time of to reflect the lowlatency network. As INET does not model hardware or software latency, we add a fixed delay to each link. In htsim, we use similar parameters; they match those used by Handley et al.. We extend htsim to support arbitrary topologies, FatPaths routing and adaptivity, and our workload model. Routers use taildropping with a maximum queue size of 100 packets per port. ECN marks packets once a queue reaches more than 33 packets. Fast retransmissions use the default threshold of three segments. We also model a latency in the software stack (corresponding to interrupt throttling) to 100kHz rate. For FatPaths, we use 9KB jumbo frames, an 8packet congestion window, and a queue length of 8 fullsize packets.
7.1.8. Scale of Simulations
We fix various scalability issues in INET and OMNeT++ to allow parallel simulation of large systems, with up to 1 million endpoints. To the best of our knowledge, we conduct the largest sharedmemory simulations (endpoint count) so far in the networking community for the used precision and simulation setup.
7.1.9. Gathering Results
We evaluate each combination of topology and routing method. As each such simulation contains thousands of flows with randomized source, destination, size, and start time, and we only record perflow quantities, this suffices for statistical significance. We simulate a fixed number of flows starting in a fixed window of time, and drop the results from the first half of that window for warmup. We summarize the resulting distributions with arithmetic means or percentiles of distributions.
7.1.10. Traffic Patterns
We use the traffic patterns discussed in § 2, in both randomized and skewed nonrandomized variants. We also vary the fraction of communicating endpoints.
7.1.11. Shown Data
When some variants or parameters are omitted (e.g., we only show SFJF to cover Jellyfish), this indicates that the shown data is representative.
7.2. Performance Analysis: HPC Systems
First, we analyze FatPaths on networks based on Ethernet, but without traditional TCP transport. This setting represents HPC systems that use Ethernet for its low cost, but avoid TCP due to its performance overheads. We use htsim that can deliver such a setting.
LowDiameter Topologies with FatPaths Beat Fat Trees We analyze both Figure 2 from page 2 (randomized workload) and Figure 11 (skewed nonrandomized workload). In each case, lowdiameter topologies outperform fat trees, with up to 2 and 4 improvement in throughput for nonrandomized and randomized workload, respectively. Both fat tree and lowdiameter networks use similar load balancing based on flowlet switching and purified transport. Thus, the advantage of lowdiameter networks lies in their low diameter combined with the ability of FatPaths to effectively use the diversity of “almost” minimal paths. Answering one of two main questions from § 1, we conclude that FatPaths enables lowdiameter topologies to outperform stateoftheart fat trees.
FatPaths Uses “Fat” NonMinimal Path Diversity Well We now focus on the performance of FatPaths with heavily skewed nonrandomized workloads, see Figure 11. Nonminimal routing with FatPaths, in each lowdiameter topology, leads to an up to FCT improvement over minimal routing (i.e., “circles on topology X outperform triangles on X”). The exception is HyperX, due to its higher diversity of minimal paths (cf. Figure 6). Thus, FatPaths effectively leverages the diversity of nonminimal paths.
What Layer Setup Fares Best? We also investigate the impact of the number () and the sparsity () of layers in FatPaths on performance and resolution of collisions; see Figure 12 (layers are computed with simple random edge sampling). Nine layers (one complete and eight sparsified) suffice to produce three disjoint paths per router pair, resolving most collisions for both SF and DF (other networks follow similar patterns). To understand what resolves collisions on global channels in DF, we also consider a clique. Here, more layers are required, since highermultiplicity path collisions appear (visible in the 99% tail). We also observe that, when more layers can be used, it is better to use a higher (cf. FCT for and different ). This reduces the maximum achievable path diversity, but it also keeps more links available for alternative routes within each layer, increasing chances of choosing disjoint paths. It also increases the number of minimal paths in use across all entries, reducing total network load.
FatPaths Scales to Large Networks We also simulate largescale SF, DF, and JF (we could not simulate several similarsize networks such as FT with high path diversity that leads to very excessive memory use in the simulator). We start with SF, SFJF, and DF (), see Figure 13. A slight mean throughput decrease compared to the smaller instances is noticeable, but latency and tail FCTs remain tightly bounded. The comparatively bad tail performance of DF is due to path overlap on the global links, where the adaptivity mechanism needs to handle high multiplicities of overlapping flows. We also conduct runs with endpoints. Here, we illustrate the distribution of the FCT of flows for SF and SFJF. Our analysis indicates that flows on SF tend to finish later that on SFJF.
7.3. Performance Analysis: Cloud Systems
We also analyze FatPaths on networks with Ethernet and full TCP stack. This setting represents TCP data centers and clusters often used as cloud infrastructure (isobe2014tcp, ). Here, we use OMNeT++/INET.
We compare FatPaths to ECMP (traditional static load balancing) and LetFlow (recent adaptive load balancing), see Figure 14. The number of layers was limited to to keep routing tables small; as they are precomputed for all routers and loaded into the simulation in a configuration file (this turned out to be a major performance and memory concern). Most observations follow those from § 7.2, we only summarize TCPrelated insights.
LetFlow improves tail and short flow FCTs at the cost of long flow throughput, compared to ECMP. Both are ineffective on SF and DF, which do not provide minimalpath diversity. Nonminimal routing in FatPaths and fixes it, even with only layers. On the other topologies, even with minimal paths (), FatPaths adaptivity outperforms ECMP and LetFlow. A detailed analysis into the FCT distributions in Figure 15 shows that with minimal routing and low minimalpath diversity, there are many flows with low performance due to path collisions and overlap, although they do not vastly affect the mean throughput. FatPaths can fully resolve this problem. Shortflow FCTs are dominated by TCP flow control effects, which are not affected much by routing changes.
We also observe a cost in long flow throughput due to the higher total network load with nonminimal paths. To understand this effect better, Figure 16 shows the impact of the fraction of remaining edges in each layer, and therefore the amount of nonminimal paths, on FCT for long flows. The optimum choice of matches the findings from the Ethernet simulations in § 7.2 for SF and DF.
Besides FCT means/tails, we also consider a full completion time of a stencil workload that is representative of an HPC application, in which processes conduct local computation, communicate, and synchronize with a barrier; see Figure 17. Results follow the same performance patterns as others. An interesting outcome is JF: high values for LetFlow are caused by packet loss and do not affect the mean/99% tail (cf. Figure 14), only the total completion runtime.
7.4. Performance Analysis: Vertical Effects
To facilitate analysis of the large amounts of included performance data, we now summarize analyzes of different FatPaths design choices (“vertical” analysis). First (1), different layer configurations () for various are investigated in Figure 12 and in § 7.2 (bare Ethernet systems) as well as in Figure 16 and in § 7.3 (TCP systems). Differences (in FCT) across layer configurations are up to 4; increasing both and maximizes performance. Second (2), the comparison of adaptive load balancing (“LetFlow”) based on flowlet switching vs. static load balancing (“ECMP”) is in Figure 14 and in sec:clouds; adaptivity improves tail and short flow FCTs at the cost of long flow throughput. Third (3), the comparison of FatPaths with and without Purified Transport is omitted due to space constraints; performance with no Purified Transport is always significantly worse. (4) We also analyze performance with and without layered routing (Figure 14, “ECMP” and “LetFlow” use no layers at all); not using layers is detrimental for performance on topologies that do not provide minimalpath diversity (e.g., SF or DF). Moreover (5), we also study the impact of using only the shortest paths in FatPaths (Figure 14, baseline “”); it is almost always disadvantageous. Finally (6), the effect from workload randomization is illustrated in Figures 2 (randomization) and 11 (no randomization); randomization increases throughput by 2.
Figure 11 shows that fat trees with NDP outperform lowdiameter networks that do not use nonminimal paths (the “NDP” baseline). FatPaths, by accommodating nonminimal paths, enables lowdiameter topologies to outperform fat trees, even for up to 2 for the adversarial traffic pattern.
7.5. Final Takeaway on Performance
We are now able to answer the main question from § 1. Most importantly, a highperformance routing architecture for lowdiameter networks should expose and use diversity of almost minimal paths (because they are numerous, as opposed to minimal paths). FatPaths is a routing architecture that enables this. Moreover, it combines random workload mapping, purified transport, flowlet load balancing, and layered routing, achieving high performance on both bare Ethernet systems and full TCP stacks. Thus, it enables speedups on HPC systems such as supercomputers or tightly coupled clusters, or cloud infrastructure such as data centers.
8. Discussion
Relations Between Metrics For deeper understanding, we intuitively connect our path diversity measures to established network performance measures and bounds (e.g., bisection bandwidth (BB) or throughput proportionality (kassing2017beyond, )) in Figure 18. The figure shows how various measures vary when increasing the network load expressed by count of communicating router pairs . The values of measures are expressed with numbers of disjoint paths . In this expression, bandwidth measures are numbers of disjoint paths between two router sets; these numbers must match corresponding counts in the original definitions of bandwidth measures. For example, path count associated with BB must equal the BB cut size.
Integration with TCP For applicability in data centers and cloud services, we integrate FatPaths with simple TCP, DCTCP (alizadeh2011data, ), MPTCP (raiciu2011improving, ), and ECN (ramakrishnan2001addition, ) for congestion control. These are less interesting designs and we exclude their description. Most importantly, all these designs require minor changes to the TCP stack.
Integration with RDMA As FatPaths fully preserves the semantics of TCP, one could seamlessly use iWARP (iwarp, ) on top of FatPaths. FatPaths could also be used together with RoCE (infiniband2014rocev2, ). RoCE has traditionally relied on Ethernet with Priority Flow Control (pfc_paper, ) (PFC) for lossless data transfer. However, numerous works illustrate that PFC introduces inherent issues such as headofline blocking (mittal2018revisiting, ; le2018rogue, ; zhu2015congestion, ; guo2016rdma, ). Now, the design of FatPaths reduces counts of dropped packets to almost zero (%) due to flowlet load balancing. With its packetoriented design and a thin protocol layer over simple Ethernet, FatPaths could become the basis for RoCE. Moreover, many modern RDMA schemes (e.g., work by Lu et al. (lu2018multi, )) are similar to NDP in that they, e.g., also use packet spraying. Thus, many of our results may be representative for such RDMA environments. For example, using RDMA on top of FatPaths could provide similar advantages on lowdiameter topologies as presented in Figure 2 and 11. We leave this for future work.
Enhancing Infiniband Although we focus on Ethernet, most of the schemes in FatPaths do not assume anything Ethernetspecific and they could be straightforwardly used to enhance IB routing architecture. For example, all the insights from path diversity analysis, layered routing for multipathing, or flowlet load balancing, could also be used with IB. We leave these directions for future work.
FatPaths Limitations To facilitate applicability of our work in realworld installations, we discuss FatPaths’ limitations. First, as FatPaths addresses lowdiameter topologies, it is less advantageous on highdiameter older interconnects such as torus. This is mostly because such networks provide multiple (almost disjoint) shortest paths between most router pairs. Second, FatPaths inherits some of NDP’s limitations, namely interrupt throttling. However, similarly to NDP, we alleviate this by assuming that a single CPU core is dedicated to polling for incoming packets. Finally, even if FatPaths delivers decent performance for nonrandomized workloads (as illustrated in § 7.2 and in Figure 11), it ensures much higher performance with workload randomization. Yet, as discussed in § 3, this is (1) a standard technique in HPC systems and (2) it is not detrimental for application performance on lowdiameter networks that – by design – have very low latencies for all router pairs.
9. Related Work
FatPaths touches on various areas. We now briefly discuss related works, excluding the ones covered in previous sections.
Topologies FatPaths highperformance adaptive routing targets lowdiameter networks: Slim Fly (besta2014slim, ), Jellyfish (singla2012jellyfish, ), Xpander (valadarsky2015, ), HyperX (ahn2009hyperx, ), and Dragonfly (kim2008technology, ). FatPaths enables these topologies to achieve low latency and high throughput under various traffic patterns (uniform, skewed), and outperform similarcost fat trees.
Routing We survey routing schemes in detail in Table 1 and in § 6. FatPaths is the first one to offer generic and adaptive multipathing using both shortest and nonshortest disjoint paths.
Load Balancing Adaptive load balancing can be implemented using flows (curtis2011mahout, ; rasley2014planck, ; sen2013localflow, ; tso2013longer, ; benson2011microte, ; zhou2014wcmp, ; al2010hedera, ; kabbani2014flowbender, ; hopps2000analysis, ), flowcells (fixedsized packet series) (he2015presto, ), and packets (zats2012detail, ; handley2017re, ; dixit2013impact, ; cao2013per, ; perry2015fastpass, ; zats2012detail, ; raiciu2011improving, ; ghorbani2017drill, ). We choose an intermediate level, flowlets (variablesize packet series) (katta2016clove, ; alizadeh2014conga, ; vanini2017letflow, ; katta2016hula, ; kandula2007dynamic, ). FatPaths is the first architecture to use load balancing based on flowlets for lowdiameter networks.
Congestion and Flow Control We do not compete with congestion or flow control schemes but instead use them for more performance. FatPaths can use any such scheme in its design (mittal2015timely, ; cardwell2016bbr, ; alizadeh2011data, ; he2016acdc, ; zhuo2016rackcc, ; handley2017re, ; raiciu2011improving, ; bai2014pias, ; alizadeh2013pfabric, ; vamanan2012deadline, ; lu2016sed, ; hwang2014deadline, ; montazeri2018homa, ; jiang2008explicit, ; banavalikar2016credit, ; alasmar2018polyraptor, ).
Multipathing Many works on multipathing exist (zhou2014wcmp, ; benet2018mp, ; cao2013per, ; greenberg2009vl2, ; sen2013localflow, ; caesar2010dynamic, ; kassing2017beyond, ; mudigonda2010spain, ; suchara2011network, ; perry2015fastpass, ; aggarwal2016performance, ; suurballe1984quick, ; huang2009performance, ; sohn2006congestion, ; li2013openflow, ; bredel2014flow, ; van2011revisiting, ; benet2018mp, ; caesar2010dynamic, ; suchara2011network, ). Our work differs from them all: it focuses on path diversity in lowdiameter topologies, it considers both minimal and nonminimal paths, and it shows a routing scheme using the explored path diversity.
Network Analyses Some works analyze various properties of lowdiameter topologies, for example path length, throughput, and bandwidth (valadarsky2015, ; kathareios2015cost, ; jyothi2016measuring, ; singla2012jellyfish, ; kassing2017beyond, ; besta2018slim, ; li2018exascale, ; kawano2018k, ; harsh2018expander, ; kawano2016loren, ; truong2016layout, ; flajslik2018megafly, ; kawano2017layout, ; azizi2016hhs, ; truong2016distributed, ; al2017new, ). FatPaths offers the most extensive analysis on path diversity so far.
Encoding Path Diversity Some schemes complement FatPaths in their focus. For example, XPath (hu2016explicit, ) and source routing (jyothi2015towards, ) could be used together with FatPaths by providing effective means to encode the rich path diversity exposed by FatPaths.
10. Conclusion
We introduce FatPaths: a simple, highperformance, and robust routing architecture. FatPaths enables modern lowdiameter topologies to achieve unprecedented performance on Ethernet networks by exposing the rich (“fat”) diversity of minimal and nonminimal paths. We formalize and extensively analyze this path diversity and show that, even though the considered topologies fall short of shortest paths, they can accommodate enough nonminimal disjoint paths to avoid congestion. Our path diversity metrics and methodology can be used to analyze other properties of networks.
FatPaths routing stands on three core elements: purified transport, flowlet load balancing, and layered routing. Our theoretical analysis and simulations illustrate that all these elements contribute to the lowlatency and highbandwidth FatPaths design, outperforming very recent fat tree architectures. Even though we focus on Ethernet in this work, most of these schemes – for example adaptive flowlet load balancing and layers – are generic and they could enhance technologies such as RDMA and Infiniband.
Simulations with up to one million endpoints show that lowdiameter topologies equipped with FatPaths outperform novel superior fat trees (handley2017re, ). Our code is online and can be used to foster novel research on nextgeneration largescale compute centers.
FatPaths uses Ethernet for maximum versatility. We argue that it can accelerate both HPC clusters or supercomputers as well as data centers and other types of cloud infrastructure. FatPaths will help to bring the areas of HPC networks and cloud computing closer, fostering technology transfer and facilitating exchange of ideas.
References
 (1) 802.1qbb  prioritybased flow control. http://www.ieee802.org/1/pages/802.1bb.html. Retrieved 20150206.
 (2) Priority flow control: Build reliable layer 2 infrastructure. http://www.cisco.com/c/en/us/products/collateral/switches/nexus7000seriesswitches/white_paper_c11542809.pdf.
 (3) Agarwal, K., Dixon, C., Rozner, E., and Carter, J. B. Shadow macs: scalable labelswitching for commodity ethernet. In Proceedings of the third workshop on Hot topics in software defined networking, HotSDN ’14, Chicago, Illinois, USA, August 22, 2014 (2014), pp. 157–162.
 (4) Aggarwal, S., and Mittal, P. Performance evaluation of single path and multipath regarding bandwidth and delay. International Journal of Computer Applications 145, 9 (2016).
 (5) Ahn, J. H., Binkert, N., Davis, A., McLaren, M., and Schreiber, R. S. HyperX: topology, routing, and packaging of efficient largescale networks. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (2009), ACM, p. 41.
 (6) Al Faisal, F., Rahman, M. H., and Inoguchi, Y. A new power efficient high performance interconnection network for manycore processors. Journal of Parallel and Distributed Computing 101 (2017), 92–102.
 (7) AlFares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center network architecture. In Proceedings of the ACM SIGCOMM 2008 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Seattle, WA, USA, August 1722, 2008 (2008), pp. 63–74.
 (8) AlFares, M., Radhakrishnan, S., Raghavan, B., Huang, N., and Vahdat, A. Hedera: Dynamic flow scheduling for data center networks. In NSDI (2010), vol. 10, pp. 19–19.
 (9) Alasmar, M., Parisis, G., and Crowcroft, J. Polyraptor: embracing path and data redundancy in data centres for efficient data transport. In Proceedings of the ACM SIGCOMM 2018 Conference on Posters and Demos (2018), ACM, pp. 69–71.
 (10) Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Matus, F., Pan, R., Yadav, N., Varghese, G., et al. CONGA: Distributed congestionaware load balancing for datacenters. In Proceedings of the 2014 ACM conference on SIGCOMM (2014), ACM, pp. 503–514.
 (11) Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and Sridharan, M. Data center TCP (DCTCP). ACM SIGCOMM computer communication review 41, 4 (2011), 63–74.
 (12) Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., and Shenker, S. pFabric: Minimal nearoptimal datacenter transport. ACM SIGCOMM Computer Communication Review 43, 4 (2013), 435–446.
 (13) Allan, D., AshwoodSmith, P., Bragg, N., Farkas, J., Fedyk, D., Ouellete, M., Seaman, M., and Unbehagen, P. Shortest path bridging: Efficient control of larger ethernet networks. IEEE Communications Magazine 48, 10 (2010).
 (14) Arimilli, L. B., Arimilli, R., Chung, V., Clark, S., Denzel, W. E., Drerup, B. C., Hoefler, T., Joyner, J. B., Lewis, J., Li, J., Ni, N., and Rajamony, R. The PERCS highperformance interconnect. In IEEE 18th Annual Symposium on High Performance Interconnects, HOTI 2010, Google Campus, Mountain View, California, USA, August 1820,2010 (2010), pp. 75–82.
 (15) Association, I. T., et al. Rocev2, 2014.
 (16) Azizi, S., Hashemi, N., and Khonsari, A. Hhs: an efficient network topology for largescale data centers. The Journal of Supercomputing 72, 3 (2016), 874–899.
 (17) Azodolmolky, S., Wieder, P., and Yahyapour, R. Cloud computing networking: Challenges and opportunities for innovations. IEEE Communications Magazine 51, 7 (2013), 54–62.
 (18) Bai, W., Chen, L., Chen, K., Han, D., Tian, C., and Sun, W. PIAS: practical informationagnostic flow scheduling for data center networks. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks, HotNetsXIII, Los Angeles, CA, USA, October 2728, 2014 (2014), pp. 25:1–25:7.
 (19) Banavalikar, B. G., DeCusatis, C. M., Gusat, M., Kamble, K. G., and Recio, R. J. Creditbased flow control in lossless ethernet networks, Jan. 12 2016. US Patent 9,237,111.
 (20) Benet, C. H., Kassler, A. J., Benson, T., and Pongracz, G. Mphula: Multipath transport aware load balancing using programmable data planes. In Proceedings of the 2018 Morning Workshop on InNetwork Computing (2018), ACM, pp. 7–13.
 (21) Benson, T., Anand, A., Akella, A., and Zhang, M. Microte: Fine grained traffic engineering for data centers. In Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies (2011), ACM, p. 8.
 (22) Besta, M., Hassan, S. M., Yalamanchili, S., Ausavarungnirun, R., Mutlu, O., and Hoefler, T. Slim noc: A lowdiameter onchip network topology for high energy efficiency and scalability. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems (2018), ACM, pp. 43–55.
 (23) Besta, M., and Hoefler, T. Slim Fly: A Cost Effective LowDiameter Network Topology. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14).
 (24) Bondy, J. A., and Murty, U. S. R. Graph theory with applications, vol. 290. Macmillan London, 1976.
 (25) Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G., and Walker, D. P4: programming protocolindependent packet processors. Computer Communication Review 44, 3 (2014), 87–95.
 (26) Bredel, M., Bozakov, Z., Barczyk, A., and Newman, H. Flowbased load balancing in multipathed layer2 networks using openflow and multipathtcp. In Proceedings of the third workshop on Hot topics in software defined networking (2014), ACM, pp. 213–214.
 (27) Caesar, M., Casado, M., Koponen, T., Rexford, J., and Shenker, S. Dynamic route recomputation considered harmful. ACM SIGCOMM Computer Communication Review 40, 2 (2010), 66–71.
 (28) Cao, J., Xia, R., Yang, P., Guo, C., Lu, G., Yuan, L., Zheng, Y., Wu, H., Xiong, Y., and Maltz, D. Perpacket loadbalanced, lowlatency routing for closbased data center networks. In Proceedings of the ninth ACM conference on Emerging networking experiments and technologies (2013), ACM, pp. 49–60.
 (29) Cardwell, N., Cheng, Y., Gunn, C. S., Yeganeh, S. H., and Jacobson, V. BBR: congestionbased congestion control. ACM Queue 14, 5 (2016), 20–53.
 (30) Chen, D., Heidelberger, P., Stunkel, C., Sugawara, Y., Minkenberg, C., Prisacari, B., and Rodriguez, G. An evaluation of network architectures for next generation supercomputers. In 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (2016), IEEE, pp. 11–21.
 (31) Cherkassky, B. V., and Goldberg, A. V. On implementing the push—relabel method for the maximum flow problem. Algorithmica 19, 4 (1997), 390–410.
 (32) Cheung, H. Y., Lau, L. C., and Leung, K. M. Graph connectivities, network coding, and expander graphs. SIAM J. Comput. 42, 3 (2013), 733–751.
 (33) Clos, C. A study of nonblocking switching networks. Bell Labs Technical Journal 32, 2 (1953), 406–424.
 (34) Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to algorithms. MIT press, 2009.
 (35) Curtis, A. R., Kim, W., and Yalagandula, P. Mahout: Lowoverhead datacenter traffic management using endhostbased elephant detection. In INFOCOM, 2011 Proceedings IEEE (2011), IEEE, pp. 1629–1637.
 (36) De Sousa, A. F. Improving load balance and resilience of ethernet carrier networks with ieee 802.1 s multiple spanning tree protocol. In Networking, International Conference on Systems and International Conference on Mobile Communications and Learning Technologies, 2006. ICN/ICONS/MCL 2006. International Conference on (2006), IEEE, pp. 95–95.
 (37) Dijkstra, E. W. A note on two problems in connexion with graphs. Numerische mathematik 1, 1 (1959), 269–271.
 (38) Dixit, A., Prakash, P., Hu, Y. C., and Kompella, R. R. On the impact of packet spraying in data center networks. In INFOCOM, 2013 Proceedings IEEE (2013), IEEE, pp. 2130–2138.
 (39) Dongarra, J. J., Meuer, H. W., Strohmaier, E., et al. Top500 supercomputer sites. Supercomputer 13 (1997), 89–111.
 (40) Faanes, G., Bataineh, A., Roweth, D., Froese, E., Alverson, B., Johnson, T., Kopnick, J., Higgins, M., Reinhard, J., et al. Cray cascade: a scalable HPC system based on a Dragonfly network. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (2012), IEEE Computer Society Press, p. 103.
 (41) Flajslik, M., Borch, E., and Parker, M. A. Megafly: A topology for exascale systems. In International Conference on High Performance Computing (2018), Springer, pp. 289–310.
 (42) Floyd, R. W. Algorithm 97: shortest path. Communications of the ACM 5, 6 (1962), 345.
 (43) Floyd, S. Tcp and explicit congestion notification. ACM SIGCOMM Computer Communication Review 24, 5 (1994), 8–23.
 (44) Ford, A., Raiciu, C., Handley, M., and Bonaventure, O. RFC 6824, TCP extensions for multipath operation with multiple addresses.
 (45) Ford, L. R., and Fulkerson, D. R. Maximal flow through a network. Canadian journal of Mathematics 8, 3 (1956), 399–404.
 (46) Frantz, P. J., and Thompson, G. O. Vlan frame format, Sept. 28 1999. US Patent 5,959,990.
 (47) García, R., Duato, J., and Silla, F. Lsom: A link state protocol over mac addresses for metropolitan backbones using optical ethernet switches. In Network Computing and Applications, 2003. NCA 2003. Second IEEE International Symposium on (2003), IEEE, pp. 315–321.
 (48) Gerstenberger, R., Besta, M., and Hoefler, T. Enabling Highlyscalable Remote Memory Access Programming with MPI3 One Sided. In Proc. of ACM/IEEE Supercomputing (2013), SC ’13, pp. 53:1–53:12.
 (49) Ghorbani, S., Yang, Z., Godfrey, P., Ganjali, Y., and Firoozshahian, A. Drill: Micro load balancing for lowlatency data center networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (2017), ACM, pp. 225–238.
 (50) Ghorbani, S., Yang, Z., Godfrey, P. B., Ganjali, Y., and Firoozshahian, A. Drill: Micro load balancing for lowlatency data center neworks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2017, Los Angeles, CA, USA, August 2125, 2017 (2017).
 (51) Gojmerac, I., Ziegler, T., and Reichl, P. Adaptive multipath routing based on local distribution of link load information. In International Workshop on Quality of Future Internet Services (2003), Springer, pp. 122–131.
 (52) Grant, R., Rashti, M., Afsahi, A., and Balaji, P. RDMA Capable iWARP over Datagrams. In Par. Dist. Proc. Symp. (IPDPS), 2011 IEEE Intl. (2011), pp. 628–639.
 (53) Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. VL2: a scalable and flexible data center network. ACM SIGCOMM computer communication review 39, 4 (2009), 51–62.
 (54) Greenberg, A., Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. Towards a next generation data center architecture: scalability and commoditization. In Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow (2008), ACM, pp. 57–62.
 (55) Guo, C., Lu, G., Li, D., Wu, H., Zhang, X., Shi, Y., Tian, C., Zhang, Y., and Lu, S. Bcube: a high performance, servercentric network architecture for modular data centers. ACM SIGCOMM Computer Communication Review 39, 4 (2009), 63–74.
 (56) Guo, C., Wu, H., Deng, Z., Soni, G., Ye, J., Padhye, J., and Lipshteyn, M. Rdma over commodity ethernet at scale. In Proceedings of the 2016 ACM SIGCOMM Conference (2016), ACM, pp. 202–215.
 (57) Guo, C., Wu, H., Tan, K., Shi, L., Zhang, Y., and Lu, S. Dcell: a scalable and faulttolerant network structure for data centers. In ACM SIGCOMM Computer Communication Review (2008), vol. 38, ACM, pp. 75–86.
 (58) Gusfield, D. Very simple methods for all pairs network flow analysis. SIAM J. Comput. 19, 1 (1990), 143–155.
 (59) Handley, M., Raiciu, C., Agache, A., Voinescu, A., Moore, A. W., Antichi, G., and Wojcik, M. Rearchitecting datacenter networks and stacks for low latency and high performance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2017, Los Angeles, CA, USA, August 2125, 2017 (2017), pp. 29–42.
 (60) Harsh, V., Jyothi, S. A., Singh, I., and Godfrey, P. Expander datacenters: From theory to practice. arXiv preprint arXiv:1811.00212 (2018).
 (61) He, K., Rozner, E., Agarwal, K., Felter, W., Carter, J. B., and Akella, A. Presto: Edgebased load balancing for fast datacenter networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM 2015, London, United Kingdom, August 1721, 2015 (2015), pp. 465–478.
 (62) He, K., Rozner, E., Agarwal, K., Gu, Y. J., Felter, W., Carter, J. B., and Akella, A. AC/DC TCP: virtual congestion control enforcement for datacenter networks. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 2226, 2016 (2016), pp. 244–257.
 (63) Hopps, C. RFC 2992: Analysis of an EqualCost MultiPath Algorithm, 2000.
 (64) Hu, S., Chen, K., Wu, H., Bai, W., Lan, C., Wang, H., Zhao, H., and Guo, C. Explicit path control in commodity data centers: Design and applications. IEEE/ACM Transactions on Networking 24, 5 (2016), 2768–2781.
 (65) Huang, X., and Fang, Y. Performance study of nodedisjoint multipath routing in vehicular ad hoc networks. IEEE Transactions on Vehicular Technology 58, 4 (2009), 1942–1950.
 (66) Hwang, J., Yoo, J., and Choi, N. Deadline and incast aware tcp for cloud data center networks. Computer Networks 68 (2014), 20–34.
 (67) Isobe, T., Tanida, N., Oishi, Y., and Yoshida, K. Tcp acceleration technology for cloud computing: Algorithm, performance evaluation in real network. In 2014 International Conference on Advanced Technologies for Communications (ATC 2014) (2014), IEEE, pp. 714–719.
 (68) Iwata, A., Hidaka, Y., Umayabashi, M., Enomoto, N., and Arutaki, A. Global open ethernet (goe) system and its performance evaluation. IEEE Journal on Selected Areas in Communications 22, 8 (2004), 1432–1442.
 (69) Jain, S., Chen, Y., and Zhang, Z.L. Viro: A scalable, robust and namespace independent virtual id routing for future networks. In INFOCOM, 2011 Proceedings IEEE (2011), Citeseer, pp. 2381–2389.
 (70) Jiang, J., Jain, R., and SoIn, C. An explicit rate control framework for lossless ethernet operation. In Communications, 2008. ICC’08. IEEE International Conference on (2008), IEEE, pp. 5914–5918.
 (71) Jyothi, S. A., Dong, M., and Godfrey, P. Towards a flexible data center fabric with source routing. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research (2015), ACM, p. 10.
 (72) Jyothi, S. A., Singla, A., Godfrey, P. B., and Kolla, A. Measuring and understanding throughput of network topologies. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for (2016), IEEE, pp. 761–772.
 (73) Kabbani, A., Vamanan, B., Hasan, J., and Duchene, F. FlowBender: Flowlevel Adaptive Routing for Improved Latency and Throughput in Datacenter Networks. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (2014), ACM, pp. 149–160.
 (74) Kandula, S., Katabi, D., Sinha, S., and Berger, A. Dynamic load balancing without packet reordering. ACM SIGCOMM Computer Communication Review 37, 2 (2007), 51–62.
 (75) Karacali, B., Tracey, J. M., Crumley, P. G., and Basso, C. Assessing cloud network performance. In 2018 IEEE International Conference on Communications (ICC) (2018), IEEE, pp. 1–7.
 (76) Kassing, S., Valadarsky, A., Shahaf, G., Schapira, M., and Singla, A. Beyond fattrees without antennae, mirrors, and discoballs. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2017, Los Angeles, CA, USA, August 2125, 2017 (2017), pp. 281–294.
 (77) Kathareios, G., Minkenberg, C., Prisacari, B., Rodriguez, G., and Hoefler, T. Costeffective diametertwo topologies: Analysis and evaluation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2015), ACM, p. 36.
 (78) Katta, N., Hira, M., Kim, C., Sivaraman, A., and Rexford, J. Hula: Scalable load balancing using programmable data planes. In Proceedings of the Symposium on SDN Research (2016), ACM, p. 10.
 (79) Katta, N. P., Hira, M., Ghag, A., Kim, C., Keslassy, I., and Rexford, J. CLOVE: how I learned to stop worrying about the core and love the edge. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets 2016, Atlanta, GA, USA, November 910, 2016 (2016), pp. 155–161.
 (80) Kawano, R., Nakahara, H., Fujiwara, I., Matsutani, H., Koibuchi, M., and Amano, H. Loren: A scalable routing method for layoutconscious random topologies. In 2016 Fourth International Symposium on Computing and Networking (CANDAR) (2016), IEEE, pp. 9–18.
 (81) Kawano, R., Nakahara, H., Fujiwara, I., Matsutani, H., Koibuchi, M., and Amano, H. A layoutoriented routing method for lowlatency hpc networks. IEICE TRANSACTIONS on Information and Systems 100, 12 (2017), 2796–2807.
 (82) Kawano, R., Yasudo, R., Matsutani, H., and Amano, H. koptimized path routing for highthroughput data center networks. In 2018 Sixth International Symposium on Computing and Networking (CANDAR) (2018), IEEE, pp. 99–105.
 (83) Kim, C., Caesar, M., and Rexford, J. Floodless in seattle: a scalable ethernet architecture for large enterprises. In Proceedings of the ACM SIGCOMM 2008 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Seattle, WA, USA, August 1722, 2008 (2008), pp. 3–14.
 (84) Kim, J., Dally, W. J., and Abts, D. Flattened butterfly: a costefficient topology for highradix networks. In ACM SIGARCH Computer Architecture News (2007), vol. 35, ACM, pp. 126–137.
 (85) Kim, J., Dally, W. J., Scott, S., and Abts, D. Technologydriven, highlyscalable dragonfly topology. In 35th International Symposium on Computer Architecture (ISCA 2008), June 2125, 2008, Beijing, China (2008), pp. 77–88.
 (86) Kornblum, J. Identifying almost identical files using context triggered piecewise hashing. Digital investigation 3 (2006), 91–97.
 (87) Le, Y., Stephens, B., Singhvi, A., Akella, A., and Swift, M. M. Rogue: Rdma over generic unconverged ethernet. In Proceedings of the ACM Symposium on Cloud Computing (2018), ACM, pp. 225–236.
 (88) Lebiednik, B., Mangal, A., and Tiwari, N. A survey and evaluation of data center network topologies. arXiv preprint arXiv:1605.01701 (2016).
 (89) Leiserson, C. E., Abuhamdeh, Z. S., Douglas, D. C., Feynman, C. R., Ganmukhi, M. N., Hill, J. V., Hillis, W. D., Kuszmaul, B. C., Pierre, M. A. S., Wells, D. S., WongChan, M. C., Yang, S., and Zak, R. The network architecture of the connection machine CM5. J. Parallel Distrib. Comput. 33, 2 (1996), 145–158.

(90)
Li, S., Huang, P.C., and Jacob, B.
Exascale interconnect topology characterization and parameter
exploration.
In
2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)
(2018), IEEE, pp. 810–819.  (91) Li, Y., and Pan, D. Openflow based load balancing for fattree networks with multipath support. In Proc. 12th IEEE International Conference on Communications (ICC’13), Budapest, Hungary (2013), pp. 1–5.
 (92) Lu, Y. Sed: An sdnbased explicitdeadlineaware tcp for cloud data center networks. Tsinghua Science and Technology 21, 5 (2016), 491–499.
 (93) Lu, Y., Chen, G., Li, B., Tan, K., Xiong, Y., Cheng, P., Zhang, J., Chen, E., and Moscibroda, T. Multipath transport for RDMA in datacenters. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18) (2018), pp. 357–371.
 (94) Lui, K.S., Lee, W. C., and Nahrstedt, K. Star: a transparent spanning tree bridge protocol with alternate routing. ACM SIGCOMM Computer Communication Review 32, 3 (2002), 33–46.
 (95) Lumsdaine, A., Gregor, D., Hendrickson, B., and Berry, J. Challenges in parallel graph processing. Parallel Processing Letters 17, 01 (2007), 5–20.
 (96) Malkin, G. Rip version 2carrying additional information. Tech. rep., 1994.
 (97) McKay, B. D., Miller, M., and Širáň, J. A note on large graphs of diameter two and given maximum degree. J. Comb. Theory Ser. B 74, 1 (Sept. 1998), 110–118.
 (98) Mittal, R., Lam, V. T., Dukkipati, N., Blem, E. R., Wassel, H. M. G., Ghobadi, M., Vahdat, A., Wang, Y., Wetherall, D., and Zats, D. TIMELY: rttbased congestion control for the datacenter. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM 2015, London, United Kingdom, August 1721, 2015 (2015), pp. 537–550.
 (99) Mittal, R., Shpiner, A., Panda, A., Zahavi, E., Krishnamurthy, A., Ratnasamy, S., and Shenker, S. Revisiting network support for rdma. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (2018), ACM, pp. 313–326.
 (100) Montazeri, B., Li, Y., Alizadeh, M., and Ousterhout, J. Homa: A receiverdriven lowlatency transport protocol using network priorities. arXiv preprint arXiv:1803.09615 (2018).
 (101) Moy, J. Ospf version 2. Tech. rep., 1997.
 (102) Mudigonda, J., Yalagandula, P., AlFares, M., and Mogul, J. C. SPAIN: COTS DataCenter Ethernet for Multipathing over Arbitrary Topologies. In NSDI (2010), pp. 265–280.
 (103) Narvaez, P., Siu, K.Y., and Tzeng, H.Y. Efficient algorithms for multipath linkstate routing.
 (104) Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N., Miri, P., Radhakrishnan, S., Subramanya, V., and Vahdat, A. Portland: a scalable faulttolerant layer 2 data center network fabric. ACM SIGCOMM Computer Communication Review 39, 4 (2009), 39–50.
 (105) Oran, D. Osi isis intradomain routing protocol. Tech. rep., 1990.
 (106) Panigrahi, D. Gomory–hu trees. In Encyclopedia of algorithms. Springer, 2008, pp. 1–99.
 (107) Perlman, R. An algorithm for distributed computation of a spanningtree in an extended lan. In ACM SIGCOMM Computer Communication Review (1985), vol. 15, ACM, pp. 44–53.
 (108) Perlman, R. Rbridges: transparent routing. In INFOCOM 2004. Twentythird AnnualJoint Conference of the IEEE Computer and Communications Societies (2004), vol. 2, IEEE, pp. 1211–1218.
 (109) Perry, J., Ousterhout, A., Balakrishnan, H., Shah, D., and Fugal, H. Fastpass: A centralized zeroqueue datacenter network. ACM SIGCOMM Computer Communication Review 44, 4 (2015), 307–318.
 (110) Platform, T. N. THE TUG OF WAR BETWEEN INFINIBAND AND ETHERNET. https://www.nextplatform.com/2017/10/30/tugwarinfinibandethernet/.
 (111) Prisacari, B., Rodriguez, G., Heidelberger, P., Chen, D., Minkenberg, C., and Hoefler, T. Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks. In Proceedings of the 23rd international symposium on Highperformance parallel and distributed computing (2014), ACM, pp. 129–140.
 (112) Prisacari, B., Rodriguez, G., Jokanovic, A., and Minkenberg, C. Randomizing task placement and route selection do not randomize traffic (enough). Design Automation for Embedded Systems 18, 34 (2014), 171–182.
 (113) Prisacari, B., Rodriguez, G., Minkenberg, C., Garcia, M., Vallejo, E., and Beivide, R. Performance optimization of load imbalanced workloads in large scale dragonfly systems. In 2015 IEEE 16th International Conference on High Performance Switching and Routing (HPSR) (2015), IEEE, pp. 1–6.
 (114) Prisacari, B., Rodriguez, G., Minkenberg, C., and Hoefler, T. Bandwidthoptimal alltoall exchanges in fat tree networks. In Proceedings of the 27th international ACM conference on International conference on supercomputing (2013), ACM, pp. 139–148.
 (115) Prisacari, B., Rodriguez, G., Minkenberg, C., and Hoefler, T. Fast patternspecific routing for fat tree networks. ACM Transactions on Architecture and Code Optimization (TACO) 10, 4 (2013), 36.
 (116) Raiciu, C., Barré, S., Pluntke, C., Greenhalgh, A., Wischik, D., and Handley, M. Improving datacenter performance and robustness with multipath TCP. In Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, August 1519, 2011 (2011), pp. 266–277.
 (117) Raiciu, C., Handley, M., and Wischik, D. RFC 6356, Coupled Congestion Control for Multipath Transport Protocols.
 (118) Ramakrishnan, K., Floyd, S., and Black, D. The addition of explicit congestion notification (ecn) to ip. Tech. rep., 2001.
 (119) Ramakrishnan, K., Floyd, S., and Black, D. RFC 3168, The addition of Explicit Congestion Notification (ECN) to IP.
 (120) Rasley, J., Stephens, B., Dixon, C., Rozner, E., Felter, W., Agarwal, K., Carter, J., and Fonseca, R. Planck: Millisecondscale monitoring and control for commodity networks. In ACM SIGCOMM Computer Communication Review (2014), vol. 44, ACM, pp. 407–418.
 (121) Rekhter, Y., Li, T., and Hares, S. A border gateway protocol 4 (bgp4). Tech. rep., 2005.
 (122) Rodeheffer, T. L., Thekkath, C. A., and Anderson, D. C. Smartbridge: A scalable bridge architecture. ACM SIGCOMM Computer Communication Review 30, 4 (2000), 205–216.
 (123) Sampath, D., Agarwal, S., and GarciaLunaAceves, J. ” ethernet on air’: Scalable routing in very large ethernetbased networks. In Distributed Computing Systems (ICDCS), 2010 IEEE 30th International Conference on (2010), IEEE, pp. 1–9.
 (124) Scott, M., Moore, A., and Crowcroft, J. Addressing the scalability of ethernet with moose. In Proc. DC CAVES Workshop (2009).
 (125) Sehery, W., and Clancy, C. Flow optimization in data centers with clos networks in support of cloud applications. IEEE Transactions on Network and Service Management 14, 4 (2017), 847–859.
 (126) Seidel, R. On the allpairsshortestpath problem in unweighted undirected graphs. Journal of computer and system sciences 51, 3 (1995), 400–403.
 (127) Sen, S., Shue, D., Ihm, S., and Freedman, M. J. Scalable, optimal flow routing in datacenters via local link balancing. In Conference on emerging Networking Experiments and Technologies, CoNEXT ’13, Santa Barbara, CA, USA, December 912, 2013 (2013), pp. 151–162.
 (128) Sharma, S., Gopalan, K., Nanda, S., and Chiueh, T.c. Viking: A multispanningtree ethernet architecture for metropolitan area and cluster networks. In INFOCOM 2004. Twentythird AnnualJoint Conference of the IEEE Computer and Communications Societies (2004), vol. 4, IEEE, pp. 2283–2294.
 (129) Singla, A., Hong, C.Y., Popa, L., and Godfrey, P. B. Jellyfish: Networking data centers randomly. 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2012).
 (130) Sinha, S., Kandula, S., and Katabi, D. Harnessing tcp’sburstiness with flowlet switching. San Diego, November 83 (2004).
 (131) Sohn, S., Mark, B. L., and Brassil, J. T. Congestiontriggered multipath routing based on shortest path information. In Computer Communications and Networks, 2006. ICCCN 2006. Proceedings. 15th International Conference on (2006), IEEE, pp. 191–196.
 (132) Stephens, B., Cox, A., Felter, W., Dixon, C., and Carter, J. PAST: Scalable Ethernet for data centers. In Proceedings of the 8th international conference on Emerging networking experiments and technologies (2012), ACM, pp. 49–60.
 (133) Subramanian, K. Multichassis link aggregation on network devices, June 24 2014. US Patent 8,761,005.
 (134) Suchara, M., Xu, D., Doverspike, R., Johnson, D., and Rexford, J. Network architecture for joint failure recovery and traffic engineering. In Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems (2011), ACM, pp. 97–108.
 (135) Suurballe, J. W., and Tarjan, R. E. A quick method for finding shortest pairs of disjoint paths. Networks 14, 2 (1984), 325–336.
 (136) Touch, J., and Perlman, R. Transparent interconnection of lots of links (trill): Problem and applicability statement. Tech. rep., 2009.
 (137) Truong, N. T., Fujiwara, I., Koibuchi, M., and Nguyen, K.V. Distributed shortcut networks: Lowlatency lowdegree nonrandom topologies targeting the diameter and cable length tradeoff. IEEE Transactions on Parallel and Distributed Systems 28, 4 (2016), 989–1001.
 (138) Truong, T.N., Nguyen, K.V., Fujiwara, I., and Koibuchi, M. Layoutconscious expandable topology for lowdegree interconnection networks. IEICE TRANSACTIONS on Information and Systems 99, 5 (2016), 1275–1284.
 (139) Tso, F. P., Hamilton, G., Weber, R., Perkins, C., and Pezaros, D. P. Longer is better: Exploiting path diversity in data center networks. In IEEE 33rd International Conference on Distributed Computing Systems, ICDCS 2013, 811 July, 2013, Philadelphia, Pennsylvania, USA (2013), pp. 430–439.
 (140) Valadarsky, A., Dinitz, M., and Schapira, M. Xpander: Unveiling the secrets of highperformance datacenters. In Proceedings of the 14th ACM Workshop on Hot Topics in Networks, Philadelphia, PA, USA, November 16  17, 2015 (2015), pp. 16:1–16:7.
 (141) Valerio, M., Moser, L. E., and MelliarSmith, P. Recursively scalable fattrees as interconnection networks. In Phoenix Conference on Computers and Communications (1994), vol. 13, Citeseer, pp. 40–40.
 (142) Valiant, L. G. A scheme for fast parallel communication. SIAM journal on computing 11, 2 (1982), 350–361.
 (143) Vamanan, B., Hasan, J., and Vijaykumar, T. Deadlineaware datacenter tcp (d2tcp). ACM SIGCOMM Computer Communication Review 42, 4 (2012), 115–126.
 (144) Van der Linden, S., Detal, G., and Bonaventure, O. Revisiting nexthop selection in multipath networks. In ACM SIGCOMM Computer Communication Review (2011), vol. 41, ACM, pp. 420–421.
 (145) Vanini, E., Pan, R., Alizadeh, M., Taheri, P., and Edsall, T. Let it flow: Resilient asymmetric load balancing with flowlet switching. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 2729, 2017 (2017), pp. 407–420.
 (146) Varga, A., et al. The OMNeT++ discrete event simulation system. In Proceedings of the European simulation multiconference (ESM’2001) (2001), vol. 9, sn, p. 65.
 (147) Varga, A., and Hornig, R. An overview of the OMNeT++ simulation environment. In Proceedings of the 1st international conference on Simulation tools and techniques for communications, networks and systems & workshops (2008), ICST (Institute for Computer Sciences, SocialInformatics and Telecommunications Engineering), p. 60.
 (148) Varga, A., and Hornig, R. INET Framework for OMNeT++. Tech. rep., 2012.
 (149) Villamizar, C. Ospf optimized multipath (ospfomp).
 (150) Yuan, X., Mahapatra, S., Lang, M., and Pakin, S. Lfti: A new performance metric for assessing interconnect designs for extremescale hpc systems. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium (2014), IEEE, pp. 273–282.
 (151) Yuan, X., Mahapatra, S., Nienaber, W., Pakin, S., and Lang, M. A new routing scheme for Jellyfish and its performance with HPC workloads. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (2013), ACM, p. 36.
 (152) Yuan, X., Mahapatra, S., Nienaber, W., Pakin, S., and Lang, M. A New Routing Scheme for Jellyfish and Its Performance with HPC Workloads. In Proceedings of 2013 ACM/IEEE Supercomputing (2013), SC ’13, pp. 36:1–36:11.
 (153) Zats, D., Das, T., Mohan, P., Borthakur, D., and Katz, R. H. Detail: reducing the flow completion time tail in datacenter networks. In ACM SIGCOMM 2012 Conference, SIGCOMM ’12, Helsinki, Finland  August 13  17, 2012 (2012), pp. 139–150.
 (154) Zhang, Q., Cheng, L., and Boutaba, R. Cloud computing: stateoftheart and research challenges. Journal of internet services and applications 1, 1 (2010), 7–18.
 (155) Zhou, J., Tewari, M., Zhu, M., Kabbani, A., Poutievski, L., Singh, A., and Vahdat, A. WCMP: weighted cost multipathing for improved fairness in data centers. In Proceedings of the Ninth European Conference on Computer Systems (2014), ACM, p. 5.
 (156) Zhu, Y., Eran, H., Firestone, D., Guo, C., Lipshteyn, M., Liron, Y., Padhye, J., Raindel, S., Yahia, M. H., and Zhang, M. Congestion control for largescale rdma deployments. ACM SIGCOMM Computer Communication Review 45, 4 (2015), 523–536.
 (157) Zhuo, D., Zhang, Q., Liu, V., Krishnamurthy, A., and Anderson, T. Racklevel congestion control. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks (2016), ACM, pp. 148–154.
Appendix
We now provide full discussions, analyses, and results omitted in the main paper body to maintain its clarity.
Topology  Hierarchy  Flexibility  Input  Remarks  Deployed?  
deterministic  Slim Fly (besta2014slim, )  group hierarchical  fixed  2  “MMS” variant (besta2014slim, ; mckay1998note, )  unknown  
Dragonfly (kim2008technology, )  group hierarchical  fixed  3  “balanced” variant (kim2008technology, ) (§3.1)  PERCS (arimilli2010percs, ), Cascade (faanes2012cray, )  
HyperX (ahn2009hyperx, )  semihierarchical  fixed  2  “regular” variant, 2xoversubscribed, forms a Flattened Butterfly (kim2007flattened, )  unknown  
HyperX (ahn2009hyperx, )  semihierarchical  fixed  3  “regular” variant, 2xoversubscribed, forms a cube  unknown  
Fat tree (leiserson1996cm5, )  semihierarchical  fixed  4  2stage variant (3 router layers)  Many installations  
Complete (clique)  flat  fixed  1  HyperX, 2xoversubscribed  crossbar routers  
rand.  Jellyfish (singla2012jellyfish, )  flat  flexible  , ,  n/a  “homogeneous” variant (singla2012jellyfish, )  unknown  
Xpander (valadarsky2015, )  flat  semiflexible  n/a  Restricted to , ,  unknown 
Appendix A Formal Description of Topologies
We first extend the discussion of the considered topologies. Table 5 provides details. Now, each topology uses certain input parameters that define the structure of this topology. These parameters are as follows: (SF), (DF), (XP), and (HX).
a.1. Slim Fly
Slim Fly (besta2014slim, )
is a stateoftheart costeffective topology for large computing centers that uses mathematical optimization to minimize diameter
for a given radix while maximizing size . SF’s low diameter () ensures the lowest latency for many traffic patterns and it reduces the number of required network resources (packets traverse fewer routers and cables), lowering cost, static, and dynamic power consumption. SF is based on graphs approaching the Moore Bound (MB): The upper bound on the number of vertices in a graph with a given and . This ensures full global bandwidth and high resilience to link failures due to good expansion properties. Next, SF is group hierarchical. A group is not necessarily complete but all the groups are connected to one another (with the same number of global links) and form a complete network of groups. We select SF because it is a stateoftheart design based on optimization that outperforms virtually all other targets in most metrics and represents topologies with .Associated Parameters and depend on a parameter that is a prime power with certain properties (detailed in the original work (besta2014slim, )). Some flexibility is ensured by allowing changes to and with a large number of suitable values of the parameter . We use the suggested value of .
a.2. Dragonfly
Dragonfly (kim2008technology, ) is a group hierarchical network with and a layout that reduces the number of global wires. Routers form complete groups; groups are connected to one another to form a complete network of groups with one link between any two groups. DF comes with an intuitive design and represents deployed networks with .
Associated Parameters Input is: the group size , the number of channels from one router to routers in other groups , and concentration . We use the maximum capacity DF (with the number of groups ) that is balanced, i.e., the load on global links is balanced to avoid bottlenecks (). In such a DF, a single parameter determines all others.
a.3. Jellyfish
Jellyfish (singla2012jellyfish, ) networks are random regular graphs constructed by a simple greedy algorithm that adds randomly selected edges until no additions as possible. The resulting construction has good expansion properties (bondy1976graph, ). Yet, all guarantees are probabilistic and rare degenerate cases, although unlikely, do exist. Even if can be arbitrarily high in degenerate cases, usually with much lower . We select JF as it represents flexible topologies that use randomization and offer very good performance properties.
Associated Parameters JF is flexible. and can be arbitrary; we use parameters matching less flexible topologies. To compensate for the different amounts of hardware used in different topologies, we include a Jellyfish network constructed from the same routers for each topology; the performance differences observed between those networks are due to the different hardware and need to be factored in when comparing the deterministic topologies.
a.4. Xpander
Xpander (valadarsky2015, ) networks resemble JF but have a deterministic variant. They are constructed by applying one or more so called lifts to a clique . The lift of consists of copies of , where for each edge in , the copies of that connect vertices to , are replaced with a random matching (can be derandomized): is connected to for a random permutation . This construction yields a regular graph with and good expansion properties. The randomized lifts ensure good properties in the expectation. We select XP as it offers the advantages of JF in a deterministically constructed topology.
Associated Parameters We create XP with a single lift of arbitrary . Such XP is flexible although there are more constraints than in JF. Thus, we cannot create matching instances for each topology. We select and , which is comparable to diameter2 topologies. We also consider with multiple lifts as this ensures good properties (valadarsky2015, ), but we do not notice any additional speedup. We use , matching the diameter2 topologies.
a.5. HyperX
HyperX (ahn2009hyperx, ) is formed by arranging vertices in an dimensional array and forming a clique along each 1dimensional row. Several topologies are special cases of HX, including complete graphs, hypercubes (HCs) (bondy1976graph, ), and Flattened Butterflies (FBF) (kim2007flattened, ). HX is a generic design that represents a wide range of networks.
Associated Parameters An HX is defined by a 4tuple . is the number of dimensions and , and are
dimensional vectors (they respectively denote the array size in each dimension and the relative capacity of links along each dimension). Networks with uniform
and (for all dimensions) are called regular. We only use regular networks with . HX with is about a factor of two away from the MB () resulting in more edges than other topologies. Thus, we include higherdiameter variants with similar to that of other networks. Now, for full bisection bandwidth (BB), one should set . Yet, since HX already has the highest and (for a fixed ) among the considered topologies, we use a higher as with the other topologies to reduce the amount of used hardware. As we do not consider worstcase bisections, we still expect HX to perform well.a.6. Fat Tree
Fat tree (leiserson1996cm5, ) is based on the Clos network (clos1953study, ) with disjoint inputs and outputs and unidirectional links. By “folding” inputs with outputs, a multistage fat tree that connects any two ports with bidirectional links is constructed. We use threestage FTs with ; fewer stages reduce scalability while more stages lead to high . FT represents designs that are in widespread use and feature excellent performance properties such as full BB and nonblocking routing.
Associated Parameters A threestage FT with full BB can be constructed from routers with uniform radix : It connects endpoints using five groups of routers. Two of these groups, routers, form an edge group with endpoints. Another two groups form an aggregation layer: each of the edge groups forms a complete bipartite graph with one of the aggregation groups using the remaining ports, which are called upstream. Finally, the remaining group is called the core: each of the two aggregation groups forms a fully connected bipartite graph with the core, again using the remaining upstream ports. This also uses all ports of the core routers. Now, for FT, it is not always possible to construct a matching JF as can be fractional. In this case, we select and such that and , which potentially changes . Note also that for FT, is the number of endpoints per edge router, while in the other topologies, all routers are edge routers.
a.7. FullyConnected Graphs
We also consider fullyconnected graphs. They represent interesting cornercases, offer lower bounds on various metrics such as minimal path length, and can be used for validation purposes.
Associated Parameters A clique is defined by a single parameter , leading to . We use with the same rationale as for the HyperX topologies.
Appendix B Efficient Path Counting
Some measures for path diversity are computationally hard to derive for large graphs. Algorithms for allpairs shortest paths analysis based on adjacency matrices are well known, and we reintroduce one such method here for the purpose of reproducibility. For the disjointpaths analysis however, allpairs algorithms exist, but are not commonly known. We introduce a method by Cheung et. al (cheung2013, ) and we adapt for lengthlimited edge connectivity computation.
b.1. Matrix Multiplication for Path Counting
It is well known that for a graph represented as an adjacency matrix, matrix multiplication (MM) can be used to obtain information about paths in that graph. Variations of this include the FloydWarshall algorithm (floyd1962algorithm, ) for transitive closure and allpairs shortest paths (seidel1995all, ), which use different semirings to aggregate the respective quantities. To recapitulate how these algorithms work, consider standard MM using and operators on nonnegative integers, which computes the number of paths between each pair of vertices.
Theorem 1 ().
If is the adjacency matrix of a directed graph , iff and iff , then each cell of contains the number of paths from to with exactly steps in .
Proof.
By induction on the path length : For , and the adjacency matrix contains a in cell iff , else . Since length1 paths consist of exactly one edge, this satisfies the theorem. Now consider matrices , for for which the theorem holds since . We now prove the theorem also holds for . Matrix multiplication is defined as
(1) 
According to the theorem, is the number of length paths from to some vertex , and is the number of length paths from said vertex to . To reach from via , we can choose any path from to and any from to , giving options. As we regard all paths from to , we consider all intermediate vertices and count the total number (sum) of paths. This is exactly the count of length paths demanded by the theorem, as each length path can be uniquely split into a length and a length segment. ∎
In the proof we ignored a few details caused by the adjacency matrix representation: first, the adjacency matrix models a directed graph. We can also use the representation for undirected graphs by making sure is symmetrical (then also is symmetrical). Adjacency matrices contain the entry to indicate and for . By generalizing to be the number of length1 paths ( number of edges) from to as in the theorem, we can also represent multiedges; the proof still holds.
Finally, the diagonal entries represent selfloops in the graph, which need to be explicitly modeled. Note that also is allowed above and the intermediate vertex can be equal to and/or . Usually selfloops should be avoided by setting . Then will be the number of cycles of length passing through , and the paths counted in will include paths containing cycles. These cannot easily be avoided in this scheme^{4}^{4}4Setting before/after each step does not prevent cycles, since a path from to might pass , causing a cycle, and we cannot tell this is the case without actually recording the path.. For most measures, e.g., shortest paths or disjoint paths, this is not a problem, since paths containing cycles will naturally never affect these metrics.
On general graphs, the algorithms outlined here are not attractive since it might take up to the maximum shortest path length iterations to reach a fixed point, however since we are interested in lowdiameter graphs, they are practical and easier to reason about than the FloydWarshall algorithms.
b.1.1. Matrix Multiplication for Routing Tables
As another example, we will later use a variation of this algorithm to compute nexthop tables that encode for each source and each destination which outedge of should be used to reach . In this algorithm, the matrix entries are sets of possible next hops. The initial adjacency matrix will contain for each edge in a set with the out edge index of this edge, otherwise empty sets. Instead of summing up path counts, we union the nexthop sets, and instead of multiplying with zero or one for each additional step, depending if there is an edge, we retain the set only if there is an edge for the next step. Since this procedure is not associative, it cannot be used to form longer paths from shorter segments, but it works as long as we always use the original adjacency matrix on the right side of the multiplication. The correctness proof is analogous to the path counting procedure.
b.2. Counting Disjoint Paths
The problem of counting allpairs disjoint paths per pair is equivalent to the allpairs edge connectivity problem which is a special case of the allpairs max flow problem for uniform edge capacities. It can be solved using a spanning tree (GomoryHu tree (panigrahi2008gomory, )) with minimum cut values for the respective partitions on the edges. The minimum cut for each pair is then the minimum edge weight on the path in this tree, which can be computed cheaply for all pairs. The construction of the tree requires cuts, which cost each (e.g., using the PushRelabel scheme (cherkassky1997implementing, )).
Since we are more interested in the max flow values, rather than the mincut partitions, a simplified approach can be used: while the GomoryHu tree has max flow values and min cut partitions equivalent to the original graph, a equivalent flow tree (gusfield1990, ) only preserves the max flow values. While constructing it needs the same number of maxflow computations, these can be performed on the original input graph rather than the contracted graphs of GomoryHu, which makes the implementation much easier.
For lengthrestricted connectivity, common maxflow algorithms have to be adapted to respect the path length constraint. The GomoryHu approach does not work, since it is based on the principle that the distances in the original graph do not need to be respected. We implemented an algorithm based on the FordFulkerson method (ford1956maximal, ), using BFS (cormen2009introduction, ), which is not suitable for an allpairs analysis, but can provide results for small sets of samples.
The spanningtree based approaches only work for undirected graphs, and solve the more general maxflow problem. There are also algorithms that only solve the edgeconnectivity problem, using completely different approaches. Cheung et. al (cheung2013, ) propose an algorithm based on linear algebra which can compute allpairs connectivity in ; is the exponent for matrixmatrix multiplication. For our case of and naive matrix inversion, this is with massive space use, but there are many options to use sparse representations and iterative solvers, which might enable . Due to their construction, those algorithms also allow a limitation of maximum path length (with a corresponding complexity reduction) and the heavy computations are built on wellknown primitives with low constant overhead and good parallel scaling, compared to classical graph schemes.
b.3. Deriving Edge Connectivity
This scheme is based on the ideas of Cheung et. al. (cheung2013, ). First we adapt the algorithm for vertex connectivity, which allows lower space and time complexity than the original algorithm and might also be easier to understand. The original edgeconnectivity algorithm is obtained by applying it to a transformed graph.^{5}^{5}5Vertexconnectivity, defined as the minimum size of a cut set of vertices that have to be removed to make and disconnected, is not well defined for neighbors in the graph. The edgeconnectivity algorithm avoids this problem, but this cannot be generalized for vertexconnectivity.
We then introduce the pathlength constraint by replacing the exact solution obtained by matrix inversion with an approximated one based on iterations, which correspond to incrementally adding steps. The algorithm is randomized in the same way as the original is; we will ignore the probability analysis for now, as the randomization is only required to avoid degenerate matrices in the process and allow the use of a finite domain. The domain
is defined to be a finite field of sufficient size to make the analysis work and allow a realworld implementation; we can assume for the algorithm itself.First, we consider a connection matrix, which is just the adjacency matrix with random coefficients for the edges:
(2) 
In the edgeconnectivity algorithm we use a much larger adjacency matrix of a transformed graph here (empty rows and columns could be dropped, leaving an matrix, but our implementation does not do this since the empty rows and columns are free in a sparse matrix representation):
(3) 
Now, we assign a vector , where is the maximum vertex degree, to each vertex and consider the system of equations defined by the graph: the value of each vertex shall be the linear combination of its neighbors weighted by the edge coefficients in . To force a nontrivial solution, we designate a source vertex and add pairwise orthogonal vectors to each of its neighbors. For simplicity we use unit vectors in the respective columns of a matrix (same shape as ). So, we get the condition
(4) 
This can be solved as
(5) 
The workintensive part is inverting , which can be done explicitly and independently from , to get a computationally inexpensive allpairs solution, or implicitly only for the vectors in for a computationally inexpensive singlesource solution. To compute connectivity, we use the following theorem. The scheme outlined in the following proof counts vertexdisjoint paths of any length.
Theorem 2 ().
The size of the vertex cut set from to equals , where and is a permutation matrix selecting the incoming neighbors of .
Proof.
First, , because all nonzero vectors were injected around and all vectors propagated through the cut set of vertices to , so there cannot be more than linearly independent vectors near . Second, , because there are vertexdisjoint paths from to . Each passes through one of the outgoing neighbors of , which has one of the linearly independent vectors of assigned (combined with potentially other components). As there is a path from to trough this vertex, on each edge of this path the component of will be propagated to the next vertex, multiplied by the respective coefficient in . So, at each of the paths will contribute one orthogonal component. ∎
To count lengthlimited paths instead, we simply use an iterative approximation of the fixed point instead of the explicit solution. Since we are only interested in the ranks of submatrices, it is also not necessary to actually find a precise solution; rather, following the argument of the proof above, we want to follow the propagation of linearly independent components through the network. The first approach is simply iterating Equation 4 from some initial guess. For this guess we use zero vectors, due to
in there we still get nontrivial solutions but we can be certain to not introduce additional linearly dependent vectors:
(6)  
Comments
There are no comments yet.