A perennial question in computer networks is where to place routing functionality among components of a distributed computer system: whether it be at the end hosts or in the network itself. In data centers in particular, the research community has presented compelling arguments for route control and visibility at both locations. Their work has shown the importance of placement to critical network properties like fault tolerance and load balancing.
Link/switch failure handling is a good example of the complexities of this decision. One broad class of proposals argues for giving the network the ability to detect and route around failures [45, 28, 29]—in essence, they argue for a smart network supporting a simple edge. When working as intended, these systems are both fast and efficient at handling failures; however, for a broad class of failures (e.g., “silent” failures), the detection methods themselves can fail, leaving end hosts with little-to-no visibility or control of how their packets are handled in the network.
The other class argues for moving all failure handling to the edge of the network, essentially relegating switches and middleboxes to simple static routing policies , i.e., a simple network controlled by a smart edge. In this approach, fate sharing guarantees that no packet losses will go unnoticed; however, this comes at the cost of the ability to react to easily detectible failures quickly and locally—a feature that is essential to high network availability.
This paper explores a third option: the co-existence of an intelligent network with an intelligent edge. Even constraining ourselves to fault tolerance and load balancing, certain problems within those domains are best implemented in the former, while others are best implemented at the latter—an ideal network architecture would allow for both. To that end, we present a new data center network architecture, Volur, that facilitates this interaction. In the end, however, is not possible to allow all features at all locations in the network (fine-grained load balancing, for instance, relies on transient information that is typically not externally visible). Instead, the goal of Volur is to provide clear guidelines for where to implement features, to detail the restrictions on those features, and to present a framework to implement them in a conforming way.
The key architectural requirement of Volur is predictability of the network. Specifically, that switch routing behavior should be externally predictable by the trusted network-layer software running on the endpoints. Predictability forms the contract between the network and its end hosts; as long as routing decisions are predictable and/or infrequent, switches are free to do as they wish. In return, end hosts must allow for transient inaccuracies in prediction. As an example, switches are allowed to locally detect and reroute around failures as long as end hosts eventually become informed of the new network state.
Our system, Volur, presents a prototype implementation of predictable networking that is composed of three components: (1) switches that route using predictable functions of the packet’s header and switch configuration state, (2) a network state service that disseminates any required information to end hosts, and (3) a per-end host path choice module that models network behavior. To demonstrate its flexibility, we implement two end host applications on top of Volur that utilize the predictability of the network to locate failures and balance load in spite of concurrent in-network functionality.
To demonstrate its feasibility in practice, we built Volur network predictors for two different deployments: a large production data center at Facebook with upwards of one hundred thousand devices , and a smaller testbed. These deployments span multiple switch ASICs in switches from multiple vendors, and show that while switch functionality can be complicated, it does not need to be complicated. While we understand that not every network operator has the necessary information to implement this approach today, it is our hope that the benefits we show provide sufficient incentives for future development in this direction.
Finally, to evaluate our architecture, we utilized the aforementioned large production data center deployment; a second, modestly-sized Cloudlab testbed; and an ns-3 packet-level simulated network. Using these testing environments, we show: (1) that our predictability-supported end host failure handling system can locate non-fail-stop failures with over 0.95 precision and 0.85 recall even if there are multiple, diverse failures and even if not all hosts participate in localization, and (2) that hosts can route around those failures within a fraction of a second. We also show for at least one network feature that is disallowed in our architecture—fine-grained load balancing—end-host-based design on our architecture approach state-of-the-art in-network approaches like CONGA.
More specifically, we make the following contributions:
We present the design and implementation of a novel data center architecture, Volur, that facilitates the co-existence of an intelligent network with an intelligent edge through the contract of predictability.
We introduce a system, Volur-FL, that leverages predictability to implement extremely accurate and fine-grained failure localization and rerouting in the presence of relatively-complex network features.
We also introduce a second system, Volur-LB, that demonstrates both the flexibility of Volur and how to emulate in-network features predictably.
Finally, we demonstrate the practicality of Volur by implementing a prototype that can accurately predict and control path choice on unmodified switches in a large-scale production data center.
Today’s data center networks typically take the form of a multi-rooted tree of switches like the one shown in Fig. 1. One natural property of these tree topologies is the presence of many paths between any two end hosts. Routing protocols, which select the path to use for any particular packet, are essential to both maintaining network reliability and ensuring balanced load. These protocols are often complex and opaque to end hosts.
Central to this ecosystem is ECMP, a switch-level mechanism that randomly chooses among several options a next hop for each flow. Other examples include Link Aggregation Groups (LAGs), which operate similarly to ECMP, but among point-to-point links rather than paths; Resilient Hashing , which dictates ECMP behavior so that flow assignment is stable even as links are brought up or taken down; and more recently, in-network load balancing like LocalFlow , DRILL , DLB , and CONGA , which route based on transient workload statistics. Recent proposals for increased programmability of networks  only increase the potential for complexity and opacity of routing.
2.1 The Case for End Host Control
In contrast to the current state of the network, the research community has, over the years, made many compelling arguments for end host visibility and/or control of routing. Some have noted that, for some features, end hosts are uniquely suited to solving a particular problem such as failure localization  or performance isolation . Others have noted that end host changes are easier to implement and deploy compared to changes to the network [22, 40].
A classic example (and part of the original inspiration for the end to end argument ) is the handling of packet drops and their associated network failures. The opacity of today’s routing protocols presents a significant challenge to the identification and mitigation of these failures, particularly when they evade traditional debugging tools such as heartbeats (e.g., BFD  and BGP keepalives) and switch drop counters. Examples of failures that are not easily caught by traditional network features are:
Partial failures: Some failures are stochastic in nature . Switches may not detect these as their occasional heartbeats/keepalives can get (un)lucky and miss the problem while application traffic will continue to experience packet drops.
Silent failures: Finally, switch counters are sometimes unreliable, leading to cases of silent failures that are not reflected in any network statistic. For failures that are not noticed by the switch itself (e.g., partial or input-dependent) and are also not reflected in counters, detection is extremely difficult. These are known to occur and cause significant headache in practice [26, 45, 1].
It is because of the above classes of failures that end hosts are often seen as an attractive location in which to detect and handle failures. Researchers have also presented many other use cases for visibility and control of routing at the end host. Unfortunately, the opacity of today’s data center networks prevents this, limiting the scope/deployment of such approaches.
2.2 The Case for Network Control
A natural reaction to the desire for end host control over routing is to migrate the complexity of the network to the edge. This approach is explored by several recent data center networking proposals [19, 24, 40, 36]. Some of these proposals let networks handle routing, but give end hosts the ability to change paths on demand through an IPv6 Flow Label or similar mechanism [40, 43]—a useful workaround, but limited in the features it can support.
On the more extreme end of this spectrum are source routing approaches like those proposed in  or . These proposals successfully enable end hosts to perform fine-grained failure detection and rerouting, but a naive application of source routing to data centers surrenders at least two crucial features:
Fast failover: Easily-detectable failures like signal loss on a link are more simply and quickly handled in the network. In these cases, switches adjacent to the failure are able to detect/reroute at timescales orders of magnitude faster than the edge. This fast failover is essential for achieving high network availability.
Backward compatibility: Most current applications and operating systems are designed to be agnostic to the routing decisions of the network. While it is possible to change all applications, OSes, and/or hypervisors, an ideal solution would permit the use of legacy software.
3 Volur: A Predictable Network
|Is it possible and/or practical to predict network behavior?||Some networks are already predictable. More generally, we anticipate that the OpenFlow model may also apply here—if customers value predictability, vendors will provide it as a feature.||3.1, 5.1|
|Is predictability compatible with dynamic switch behavior?||For infrequently-changing behavior (e.g., failover), Volur disseminates versioned network state to end hosts.||3.2|
|For frequently-changing behavior (e.g., load balancing), end hosts can approximate current switch features.||4.2|
|How do we deal with unpredictable failures and other inaccuracies in prediction?||End-hosts use versioned topology to sieve out reliable drop statistics. Common-case consistent hashing limit routing changes.||4.1.1, 5.2.4|
|How do we defend against DDoS attacks that might be enabled by end host path prediction/control?||Only convey topology to the end host trusted computing base. A NAT can be used if extra protection is needed.||3.3|
In this paper, we explore the design of an architecture for the peaceful co-existence of an intelligent network with intelligent end hosts. The key architectural principle of our work is predictability of the network as a method for interoperation. In this model, switches are free to implement a wide range of routing techniques as long as they are externally predictable. End hosts are then free to implement any functionality they wish on top of the predictable network.
More specifically, our requirement is that switches route based only on the packet header and infrequently changing configuration state. Compared to pure source routing, prediction in this model is not always accurate. The point of fast failover, for instance, is that the switches know about and can react to failures faster than end hosts. Immediately after a failure, the network may not operate as end hosts expect. Instead, end hosts must tolerate a small amount of inaccuracy in return for these features.
There are several challenges in making such a system practical, which we list in Tab. 1. Thus, the primary contribution of our work is to characterize what it takes to design and implement a predictable network and to detail its benefits/limitations. Volur consists of three primary components: switches that are predictable, a Volur service that gathers and distributes the current state of switches, and end hosts that use that state to predict routes. The overall architecture is illustrated in Fig. 2.
3.1 A Predictable Switch
As mentioned, our primary design principle is simple to state: switches should route only on the packet’s header and infrequently changing configuration state. Note that this restriction only applies to functions that affect the packet’s path—features such as management, monitoring, QoS, and queuing are all orthogonal.
3.1.1 A Simple Predictable Packet Pipeline
To see why our design principle is congruent with common-case network features, we describe the implementation of a predictable network router. Due to space constraints, we limit our exposition to the subset of the pipeline necessary for forwarding a simple Ethernet and IPv4 packet without VLANs or tunneling. The switch we describe allows for both fast failover and backward compatibility.
In Sec. 5.1, we go on to show that, not only can we build a predictable switch, we can also configure some existing switches to be predictable.
L2 processing. L2 processing is typically based purely on table lookups. For instance, if the destination MAC of the packet matches the switch’s MAC address, the packet will continue to L3 processing. Otherwise, it will be switched as a raw Ethernet frame (we omit those details).
Depends on: packet header and switch’s MAC address.
L3 processing. L3 processing is also based on table lookups, but may require other features as well. The switch begins by looking up the destination IP in its forwarding table. The resulting entry may either point to an egress port, multiple egress ports, or indicate that the packet should be dropped. When there are multiple possible next hops, as is often the case in Clos networks, the switch will calculate a hash function over several subfields of each packet’s header. The result of the hash function (modulo the number of possible next hops) is used as an index into the next hop table to determine the egress port.
ECMP has traditionally been considered unpredictable, but for efficiency, modern ECMP implementations are typically deterministic. For instance, it might hash over a packet’s 5-tuple using simple functions like XOR, CRC, or table lookups [23, 18, 9]. In practice, these hash functions can be combined with hash seeds, preprocessing, bit shifting, masks, and resilient hashing techniques to improve results in various situations , but all of the above functions are predictable as long as changes are relatively infrequent.
Depends on: packet header, L3 forwarding table, multipath table, and ECMP hash configuration.
Egress modifications. Finally, before the packet is sent back out on the wire, the switch will update the src and dst MAC addresses to correspond to the next L2 hop. In addition, it will recalculate the TTL field and IP and Ethernet checksums.
Depends on: switch’s MAC address, neighbor’s MAC address, and packet header.
3.1.2 Other Network Routing Functions
The above discussion focuses on L3 forwarding in a predictable switch: how to implement it and how to predict it given the current state of the switch. ECMP is included in the set of functions that can be made predictable in this fashion. The same is true of most other forwarding functionality, e.g., encapsulation, VLANs, and QoS. There are, however, some switch routing features that cannot be made predictable. These typically involve fine-grained load balancing, e.g., CONGA  and DRILL 
. We can classify functions into these two categories based on their inputs:
Infrequently changing state: If, in addition to the packet header, the switch routing algorithm depends only on infrequently-changing state, it is considered to be predictable in our model. As an example, in L3 routing, failures and routing updates can cause unpredictable changes in the network, but as long as the changes are infrequent, end hosts network applications are expected to handle those inaccuracies. The aforementioned forwarding functionality falls squarely into this category, as do many recent data center routing proposals including WCMP , F10 , B4 , and SWAN 
Frequently changing state: If, on the other hand, the switch routing algorithm depends on frequently changing state such as instantaneous queue length or utilization, the function is considered unpredictable. The cutoff for frequency is determined by the operator and is a function of the accuracy requirements of route prediction. Examples of algorithms in this category include DLB , CONGA , DRILL , and LocalFlow —all proposals for fine-grained, in-network load balancing. They also include certain counter-based ACL, QoS, and packet processing policies found on modern switches. These functions are disallowed in Volur switches, but in Sec. 4.2 we explore the efficient and accurate end host emulation of this class of proposals.
3.2 The Volur Service
Predicting the route of a packet requires both the packet’s header and elements of the switch’s current state. For the sender of the packet, obtaining its header is simple. For the other piece of information—switch state—we introduce an aggregation service that gathers up-to-date state from every switch and disseminates it to every end host. This dissemination must be performed on any switch state change including link failures/recoveries and control plane routing updates. Replication and sharding of such a service is straightforward; for ease of explanation, we assume a logically centralized Volur service.
The primary goal of the Volur service is to disseminate switch updates as quickly as possible. There are two steps:
Switches to the Volur service. As state updates may occur at irregular intervals and must be disseminated quickly, switches mostly operate on a push model. When a state change occurs (e.g., a BGP update or link failure), switches will immediately send a diff of their state to the Volur service. The service also periodically pings each switch for a hash of their current state to ensure that it is still alive and correctly synchronized. In systems with an existing centralized SDN infrastructure, the Volur service is a natural extension to the SDN controller.
Volur service to end hosts. The second step is to disseminate the state changes to end hosts. There are two channels for state dissemination in our system. The end hosts periodically pull a full snapshot of the current network state. In addition, the Volur service broadcasts versioned, perishable state updates to end hosts. These updates are sent using UDP to ensure time bounds.
When a switch updates its state, it sends a diff of the state change to the Volur service. Let the maximum propagation delay of this message be .
Upon receipt of the update, Volur increments its version number and distributes the update to all affected hosts. Let the maximum delay of this message be .
Upon receipt of an update from the Volur service, hosts send back an acknowledgment.
If the ACK is not received after some predetermined timeout, inform the end host during its next checkpoint.
Given the above protocol, we can quantify the length of three distinct phases of prediction accuracy. Assume that the end host receives an update at time , as shown in Fig. 3. Pre-change predictions (before ) are correct. Mid-update predictions are slightly uncertain in that they can follow either state or . This period lasts from to . Post-update predictions starting from time should all follow version . If the end host, during a checkpoint, finds out that it failed to acknowledge an update, all predictions between the current checkpoint and the last one are potentially inaccurate. All inaccuracies are handled by higher-level network applications.
3.3 End Hosts
Given a predictable network, end host operation is relatively straightforward. We provide to each end host a switch predictor that takes a switch state object and an input packet header. The output of the predictor is a next hop and output packet header.
Predicting a packet’s path. For every packet, path prediction is just a matter of iteratively chaining the next hop and packet header predictions of each intervening switch. The switch state object is obtained from the Volur service as described above; the end hosts already have the initial packet header.
Controlling a packet’s path. To control a packet’s path, end hosts only need to find a packet header that maps onto a target/acceptable round-trip path. As switch operations are typically not cryptographically secure, it is often possible to create an efficient inverse for them. In Sec. 5.1, we show that such techniques can be used to generate headers for specific paths in our large production data center network in under 12 s. Solutions are not guaranteed to exist, but operators work hard to ensure that hash functions cover the entire network evenly.
End hosts have at least a few options when trying to craft a header to hit one of paths. In general, they need bits that are otherwise unused by the network. For instance: IPv6 flow labels (20 bits) are intended for purposes like ours; port numbers (up to 32 bits) can also be used, but in the case of source ports, this may require minor OS changes in the way ports are allocated; and finally, IP addresses (up to 64/256 bits) are possible as well, for instance by giving each server a /24 or /120. End hosts can also combine bit regions to obtain a larger ‘address space’.
Legacy hosts can continue to send packets without modification, and those packets will be load balanced with ECMP just as they are today.
Preventing malicious control of paths. As a corollary, our architecture allows for efficient defenses against DDoS attacks. A potential concern with our system is that it may allow malicious users in multi-tenant data centers to launch a targeted DDoS attack against individual network elements. To that end, we note that without up-to-date switch state, the network is not more predictable than it is today—the configuration state space is very large and constantly changing. Further, because cluster and fabric switches and links have extremely high capacity, it would be difficult for the attacker to determine whether any particular trial succeeded at steering to a particular path without access to data center internal traceroute. Thus, the Volur service only distributes state to the trusted computing base, and not untrusted applications/VMs.
Even so, if more security is necessary, the VM layer can pick a random source port or flow label for each connection, similar to the NATs that many VMMs already use. If even a small part of the header is randomized, steering is difficult.
Changing paths mid-connection. Beyond controlling a single packet’s path is controlling the path of an entire TCP connection. For new connections, this is just a matter of choosing a suitable 5-tuple for the connection. To reroute existing connections to avoid a failed or congested network component, Volur must change the packet headers without disrupting TCP’s ability to demultiplex traffic. IPv6 flow labels are a good candidate for this. Otherwise, e.g., in the case of TCP source ports, the VMM/OS may need to rewrite the packet headers.
To be more concrete about this second option, when Alice wishes to change the path of a connection with Bob, she might decide on a TCP source port, , that results in the target forward and reverse paths. Alice will send the new source port to Bob asynchronously in a separate connection. This must be done out-of-band because in the case of a failed path, the original connection may not be usable. When Bob acknowledges the new source port, Alice installs packet rewrite rules as follows:
Egress: Alice overwrites the src port number with .
Ingress: Alice remembers the original src port number in a hash table so that when a response comes in, she can insert the original port transparently.
Bob installs similar rewrite rules:
Egress: Bob overwrites the dst port number with .
Ingress: Bob remembers the original dst port number in a hash table so that when a response comes in, he can insert the original port transparently.
Both ends of the connection can initiate such a path change, but to prevent flapping, we designate the client that called connect() to be responsible for most path changes.
4 Case Studies
Predictability provides a rich interface for end hosts, and we show two uses of that predictability. The first is a fault localization service that showcases the flexibility of end hosts in Volur despite static load balancing and failure reaction in the network. The second is a load balancing mechanism that simultaneously demonstrates the power of our approach and shows how to emulate state-of-the-art dynamic network behavior predictably.
4.1 Volur-FL: Fault Localization
The goal of Volur-FL is attribute packet drops to specific components. At a high level, we model the fault localization problem as an optimization problem [27, 32]. The intuition is that if many lossy paths from many vantage points across the data center converge at a single component, we can implicate the component as possibly faulty. We show how careful accounting and analysis can overcome any inaccuracies that may arise from concurrent network changes.
4.1.1 Collecting Drop Statistics
Volur-FL first collects drop statistics for each path. For TCP traffic (the majority of data center traffic), drop information is already readily available in the form of retransmission statistics. Volur-FL uses Linux eBPF (Extended Berkeley Packet Filters)  to gather these statistics on a per-connection basis.
Specifically, we track two TCP variables: pktsSent, the number of packets sent and pktsRetrans, the number of packets retransmitted. Hosts poll these statistics every 10 s. Note that these variables track control packets that are ACKed (e.g., SYN/FIN packets) in addition to data packets. It is important to track control packets since, for black holes, no data packets will be sent, only SYNs. These statistics are approximations of the ground truth as in-flight packets, cumulative ACKs, and spurious retransmits can affect these numbers; however, our evaluations show that this approximation is effective in practice.
Non-TCP traffic is slightly more complex as not all protocols acknowledge packet receipt (e.g., UDP). For them to be used in fault localization, they must be extended with simple ACK packets or some other type of coordination to detect when traffic is dropped; the ACKs do not need to be used for any other purpose. The rest of this paper assumes TCP traffic.
End hosts attribute the drops to paths as described in the preceding section. To handle uncertainty during the mid-update period, they evenly attribute drops to all applicable predictions. For example, suppose that a single TCP connection has 100 packets. If there are two possible versions, we attribute 50 packets to each path. If there are four, we attribute 25.
4.1.2 Implicating Components
Volur-FL uses path drop statistics to then implicate faulty components. This step can be viewed as a classic inference problem: given observed drops, we infer drop rate for each component and then flag them as faulty if the rate is high.
System model. We model the impact of failures with a directed bipartite graph (Fig. 4). The top partition consists of various network components. We focus on links, switches, and routing table entries (RTEs), which are among the most common network failure granularities. The bottom partition consists of paths and their drop statistics. A component has an edge to a path if the path contains the component.
In this model, each path is observed to have transmitted and dropped packets, and each component has an unknown loss rate we want to infer. We assume drops are independent for simplicity. Our goal is to find loss rate for all components such that the sum of mis-predicted drops over all paths is minimized.
This formulation results in an estimated loss rate for every component,, rather than a binary up/down determination. If any loss rate is above an operator-specified threshold, it is flagged as a possible failure. Though our model is relatively simple, it can be extended to handle additional component types or failure patterns through the same inference framework.
Localization Algorithm. Finding component loss rates that minimize is a multivariate optimization problem. We solve it using ideas from coordinate descent . Specifically, we initialize all components with zero loss rate, and greedily find the component loss rate that minimizes in isolation. This process is iterative.
There are a few properties that our algorithm must satisfy in order to be practical. First, it must handle the fact that retransmits can be caused by drops on both the forward and reverse path, with no reliable way to differentiate between the two. This is complicated by the fact that cumulative ACKs mean that drops on the reverse path are less likely to cause retransmissions than drops on the forward path. Second, even if ACKs were accurate, drops may occur due to congestion and attribution can be inaccurate. Finally, the algorithm must be able to compute the component loss rates very efficiently.
To address the difference between the forward and reverse path, we consider the two halves separately. Note that the reverse path can be predicted from headers by swapping the source and destination addresses/ports. When calculating the optimal drop rate for a particular component, we conservatively consider only forward paths as congestion statistics on them are much more accurate. However, after we greedily choose the component that minimizes , its drop rate can be used to explain drops of flows that cross it in either direction. Assuming sufficiently diverse traffic, all paths should be covered by some connection’s forward path.
The other challenges are handled by the procedure:
Initialize to be the set of all components, and set the drop rate .
For each in , consider all forward paths it touches, . Find the loss rate for the component, , that minimizes assuming all other are unchanged:
Note that computing this step is very efficient. Because is piecewise linear, we only need to check values of that make one of those terms inside the summation . Further, the function is convex, implying that a binary search can find the global minimum.
Given the candidate for all , pick the component that minimizes and fix its drop rate .
Remove all explained drops from and remove from .
If some paths have unexplained drops above threshold and max iteration not yet reached, repeat from step 2.
4.2 Volur-LB: Load Balancing
The second application we explore, Volur-LB, demonstrates the emulation of in-network load balancing (specifically, CONGA ) in a Volur architecture. We note that dynamic load balancing is a well-studied area with many other proposals implemented at both the end host and in the network. The choice of protocols is therefore not an endorsement, but instead an opportunity to study the Volur-friendly emulation of a routing algorithm with “frequently changing inputs” as defined in Sec. 3.1.2.
At a high level, CONGA switches perform two functions. First, they tag passing packets with congestion metrics and feed that information back to the source ToR. Second, the source ToR waits for a sufficiently long inter-packet gap, rerouting each flowlet toward the least-congested path. In Volur-LB, we separate these two functions explicitly and offload the second (flowlet rerouting) to end hosts.
4.2.1 Collecting Congestion Metrics
Like CONGA, Volur-LB gathers congestion metrics via in-band feedback. As a packet travels from the source to the destination ToR, switches tag the packet with their current load if it is larger than previously tagged values (see  for details). The destination ToR then feeds these path-level congestion metrics back to the source ToR by piggybacking the information on normal traffic. For every feedback-carrying packet, the destination ToR sends a single path-level metric, choosing amongst them in round-robin fashion.
At the end of the above process, the source ToRs have a lowest-utilized path toward every destination (multiple in the case of ties). In addition, as none of these operations affect routing, they can all be done without losing predictability.
Where we begin to differ from CONGA is with an extra step to transfer the congestion metrics to servers in the source rack. Volur-LB uses two mechanisms. First, the ToR switch uses incoming traffic to the rack to opportunistically piggyback the congestion metrics to its member servers. For every packet sent to a member server, the ToR switch tags it in its egress pipeline with a (destination rack, best path to the rack) tuple. The destination rack is chosen in a round-robin fashion, and if there are multiple best paths, a hash of the packet header is used to break tie. In theory, congestion metrics kept at servers would be less up-to-date compared to what their ToR switches maintain. However, for servers that communicate often with others, their congestion metrics would be refreshed timely by incoming ACKs or data packets. The second mechanism allows servers to query their ToR for the best path to a destination leaf using UDP packets. Servers send those requests to ToRs at connection setup in parallel with their SYN packets. The on-demand query allows servers to steer to good default paths after long silence.
4.2.2 Flowlet Steering
In parallel with congestion metric collection, end hosts in Volur-LB monitor inter-packet spacing to detect flowlets . For every new flowlet, the server steers the flowlet toward the destination’s last ‘best path’. Since end hosts know when and where flowlets are rerouted, predictability is maintained. Extension of Volur-FL to flowlets instead of flows is straightforward.
Our approach maintains the metrics and features of CONGA with minimal extra overhead (some additional header data on ToR-server packets). Pushing the decision to servers increases the latency of feedback and decreases the rate at which feedback arrives at the decision point, but per our evaluation in Sec. 5.3, the effects are negligible.
We leverage a few evaluation platforms. To evaluate the feasibility of predictable networks we implement one on top of a large data center at Facebook. To test the performance of failure localization in a more controllable environment, we use an 80-machine Cloudlab testbed. Finally, to test the relative performance of Volur-LB and CONGA, we simulate the necessary hardware changes in ns-3. We show that:
Some of today’s networks are already predictable without modifying hardware or nonparticipating end hosts.
Volur-FL is effective in locating a diverse set of failures and is robust to topology updates.
Volur-LB can closely approximate the performance of state-of-the-art in-network approaches.
5.1 Feasibility of Volur
We begin by demonstrating the feasibility of prediction using a prototype implementation of Volur on a large production data center at Facebook. The data center has upwards of one hundred thousand devices and hosts a variety of applications, from frontend web servers and caching to backend storage and data analytics . For the most part, servers are connected into the network with a single 10 Gbps link, while interconnect switches use 40 Gbps links.
All of the switches are based on chipsets from one of the biggest manufacturers of switch ASICs, but span multiple vendors. These switches support a diverse set of configurations for routing. Just for ECMP, the options included flexible field selection, hash seeds, pre- and post-processing steps, and many possible hash functions. Our predictor faithfully reproduced the path computation pipeline of these switches along with the effects of all of these configuration options. It gathered the options from switches in order to perform predictions.
Our prototype did not require any modifications to either switch configurations or OS configurations—the network, as configured, already approximated a predictable network. We also verified the feasibility of our system on top of a testbed of Cavium switches and ASICs, but we omit those results here due to space constraint.
5.1.1 Predicting Paths
To test the accuracy of our predictor, we ran UDP traceroutes between servers in the data center, and compared ground truths with our path prediction engine. The ToR switch was already configured with an ECMP group. When replicating the relevant configuration options within our path deduction engine, we are able to replicate 100% of the results recorded by the UDP traceroute experiment. We also built the predictor’s inverse for the purpose of efficiently generating headers for a target path.
Overhead of prediction. In addition to verifying that our prediction engine can accurately predict paths, we also tested the efficiency of the engine when trying to find a header for a particular network path. For this experiment, we chose a fixed source and destination server in the data center. There were hundreds of potential paths between the two machines through a topology similar to the one described in . As our inverse function is only able to reverse a single switch’s routing function at a time, generating a header for a specific multi-hop path involved a multi-step process. The first step is to use the inverse function to generate a valid header for one of the switches (with a preference for the switch with the largest ECMP group). We then use our predictor to check the generated header on the second switch. If the header maps to the correct routing choice and does not require a reserved port, we accept the header, otherwise, we start again.
Fig. 5 shows a CDF of the execution time of the above algorithm on a server with a 2.60 GHz Intel Xeon E5-2670. For the test, we picked a specific path and gave the generator control over the UDP source port of the header We then tracked the time it took to find a target header for the given path. The algorithm was always able to find a valid header with a median execution time of 3.4 s.
5.1.2 Controlling Paths
To demonstrate path control, we implemented and deployed to a small set of nodes in the aforementioned data center an iptables user-space application called ‘ECMP-interpose’. ECMP-interpose automatically and transparently modifies connection parameters when a TCP timeout occurs. Modifications are as described in Sec. 3.3.
More concretely, ECMP-interpose installs rules into iptables that match on relevant incoming and outgoing TCP traffic and relay packets via the NFQUEUE target to a user-level packet queue. For each connection, we install rules matching the TCP source port into the INPUT and OUTPUT chains in the filter table on both end hosts as described previously. iptables allows rules for several connections to be consolidated via range and set matches for performance. After modification, it computes the new TCP checksum and relays packets back to the kernel.
Effect of rerouting. We evaluate our prototype with a simple rerouting experiment. We run a constant-rate TCP connection from a source to a destination in a different cluster in the same data center. At 2 seconds, we fail the connection. When the sender gets a timeout (via tcp_retransmit_timer() in the Linux networking stack), ECMP-interpose automatically switches to an alternate path. While the timeout took 500 ms, ECMP-interpose’s switchover was nearly instantaneous—a more aggressive failure detection method would have minimized interruption of connectivity. We conclude that we can successfully and transparently and selectively change ECMP routes of live connections by interposing on these connections and modifying their port numbers. Route changes are instant and stable.
5.2 Volur-FL Evaluation
We evaluate Volur-FL by asking several questions:
Can we localize different types of failures and how sensitive is our approach to the failure’s drop rate?
How much does the aggregation period length matter?
What about multiple, potentially heterogeneous faults?
How much does a stale view of topology affect results?
Testbed. We answer these questions using an 80-machine Cloudlab  testbed. The machines were interconnected via a 10 Gbps network. Each physical machine emulates either a server or a predictable software switch. We use GRE-tunnel  to implement an overlay Clos network. The resulting topology has 12 racks with 4 servers each. The racks’ 12 ToRs are split into 3 clusters with 4 aggregation switches in each. Each aggregation switch connects to two core switches, for a total of 8 core switches. We use Linux tc to limit link bandwidth to 1 Gbps, and emulate RED queues with ECN marking threshold at 30 KB . We configure Linux to use DCTCP.
To collect drop statistics, we use Linux eBPF  with bcc . Unless stated otherwise, drop statistics were polled every 10 seconds. We use recall and precision, averaged over 50 runs, as metric for fault localization. Recall is the percentage of faults that have been predicted and precision is the percentage of predictions that are correct.
Workload. We generate traffic according to a realistic workload based on empirically observed traffic patterns in deployed data centers . The web search workload is heavy-tailed: a small fraction of flows contribute most of the traffic. Flows arrive according to a Poisson process between server pairs evenly. We inject an offered load of of total host access link bandwidth.
Failures. We injected failures into the network at random time while running our failure localization application in the background. The set of failures we tested were drawn from those emphasized by recent literature [29, 45, 42] and they cover the range of failure behaviors listed in Sec. 2.1. In particular, several types of components can fail silently in our testbed: links, switches, and individual routing table entries. Failures can either be fail-stop or stochastic with some drop rate.
5.2.1 Localization of a Single Failure
We first evaluate our localization precision and recall for a single failure. We inject failures at either a link, switch, or routing table entry, at random location. We tested various drop rates ranging from 1% to 100%. The mean time from failure to end host notification was seconds (much of this was due to our use of a 10 s aggregation period).
Fig. 7 shows that, for most cases, our algorithm has perfect precision and recall. This is because greedy is optimal when there is only a single instance of failure.
5.2.2 Effect of Aggregation Period Length
Lower-rate failures can be detected by increasing the aggregation period. This represents a tradeoff. A longer aggregation period can filter out transient noise from the data, making our predictions more accurate, and also decrease the overhead of collection. This, however, increases failure detection latency.
Up until now, we have been using a 10 seconds aggregation period. In this experiment, we test how long the aggregation period needs to be to detect a 0.1% loss rate link failure. In principle, any persistent failure with loss rate greater than the steady-state congestion loss rate of the network can be located with a long enough aggregation period.
In Fig. 8, we show the precision and recall for windows ranging from 20 s to 60 s. As we aggregate over longer period, detection becomes more accurate, reaching 90% precision and recall for 0.1% loss rate with a 60 second aggregation period.
5.2.3 Multiple, Simultaneous Failures
Volur-FL also extends to multiple simultaneous, possibly heterogeneous, failures. We injected a random mix of failures and look at the precision and recall for our algorithm. The failures are randomly chosen: they can be link failures, switch failures, or routing table corruptions. Their drop rates are sampled uniformly between 1% and 100%.
Fig. 9 shows the average precision/recall for up to 10 simultaneous failures. Across the experiments, our system maintains a precision above 0.95 and a recall above 0.85, even when failure count is high. As the failure count increases, recall decreases. This is expected as our algorithm is greedy and assumes that a few larger failures are more likely than many smaller failures.
5.2.4 Impact of Stale State
Part of Volur’s design is that switches can make routing changes on-the-fly, as long as those changes are infrequent. In this subsection, we evaluate the impact of stale state on failure localization. More specifically, we try to locate a single 10% drop rate failure in the presence of a topology-changing switch reboot.
We first configure a random aggregation or core switch to silently drop 10% of packets. Later, we reboot a random aggregation switch, which is properly detected/disseminated via the Volur service.
We evaluated two different switch failover policies with and without state dissemination. The first policy, ‘’, remaps all flows using a simple modulo function. The result is that most flows change paths after a failure. The second, resilient hashing (‘resHash’), uses a simple, predictable function that limits the number of flows that need to change paths after a single failure.
Fig. 10 shows that without resilient hashing or topology dissemination, precision and recall falls to around 0.5, with successes limited to cases where the failures are in separate subtrees and therefore most traffic is predicted correctly. With resilient hashing, both numbers rise to 0.84 as resilient hashing avoids remapping every flow. Thus, a large number of path predictions are still correct even with stale network state. For both failover strategies, adding topology dissemination brings precision and recall back above 0.95.
5.3 Volur-LB Evaluation
We evaluate the performance of Volur-LB with a 12-switch, 72-host ns-3 simulation. We show that Volur-LB achieves an average flow completion time (FCT) within 1.05x of CONGA at low to moderate load, and within 1.1x at very high load for both symmetric and asymmetric topologies.
Architecture. We used a 6-leaf 6-spine topology with 10 Gbps links and 2:1 leaf oversubscription ratio. In the symmetric topology, all links have 10 Gbps capacity. In the asymmetric topology, each leaf has 2 randomly picked uplinks out of 6 uplinks with half capacity. In the worst case, a leaf to leaf path can have 4 paths out of 6 paths with only 5 Gbps capacity. The degree of asymmetry is high.
End hosts use DCTCP and queues use RED with ECN marking, with a threshold of 65 MTU and 700 KB (467 MTU) buffers .
Workload. We generated flows according to the enterprise workload in [2, 40] with arrival rate to match different offered traffic load. Traffic were generated using a simple client-server program at each host. All traffic went through the spine to stress the load balancing properties of the fabric. Each client established 6 persistent TCP connections with every server.
We use flow completion time (FCT) as evaluation metric. We average over 5 runs. We compare:
ECMP: Our baseline is ECMP, in which each switch makes local, uniform-random load balancing decisions.
CONGA: We use the default parameters: , s, and flowlet timeout of s. We validated our implementation with the testbed results from .
Volur-LB: Finally, we implement Volur-LB as described in Sec. 4.2. Where applicable, we use the same configuration parameters as our CONGA implementation.
6 Related Work
Over the years, the research community has pointed out deficiencies in data center network routing, both in load balancing and fault tolerance. Broadly speaking, proposed solutions fall into one of three categories:
Network control. The network is a straightforward place to address these deficiencies. For load balancing, systems like CONGA  and DRILL  add functionality to switches so that they can react to traffic bursts at very short time scale. For failure localization, a similar trend has been to augment in-network monitoring [7, 33, 10, 15, 45, 28] Although these approaches are powerful, when the mechanisms themselves fail, end hosts are left with no recourse.
End host control. Another class of prior work attempts to address deficiencies at the end host. A few of these propose to either work around the network’s opacity or move bits of routing functionality to the end hosts. In particular, LetFlow  and CLOVE  both make a case for end host load balancing as do several other approaches [22, 16, 43]. The same is true of failure detection [1, 14]. Our work is complementary to these systems as we seek to support future routing innovation so that new proposals are not hamstrung by ECMP’s interface.
More extreme are proposals like XPATH , which argue for source routing in the data center. These give full route control and visibility to the end hosts, but they come at the cost of essential features like fast in-network failover.
Network and end host coordination. The idea of an intelligent network assisting intelligent hosts has been explored in other areas. For example, ECN-based transport [3, 11] provides end hosts with information about the utilization of the network.
In a similar vein, other proposals have sought to increase visibility into the network by tagging packets with their path as they pass switches [36, 39]. The caveat with this approach is that locating failures with tags requires successful delivery; if none of the target packets make it through to the destination, the route will remain hidden.
In comparison, Volur provides end hosts with a complete and up-to-date view of routing in the network, greatly expanding the options for and efficacy of end host functions.
This paper presents an architecture that facilitates the co-existence of route control both in the network and at end hosts. Our results show that the architecture is both feasible and flexible. Using it, we demonstrate a failure handling mechanism that is both accurate and responsive. We also demonstrate an end host load balancing mechanism that emulates state of the art in-network approaches predictably. Finally, we verify the feasibility of our approach by building a test deployment on a large, otherwise unmodified production data center network.
-  A. Adams, P. Lapukhov, and H. Zeng. NetNORAD: Troubleshooting networks via end-to-end probing, 2016. https://code.facebook.com/posts/1534350660228025/netnorad-troubleshooting-networks-via-end-to-end-probing/.
-  M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese. CONGA: Distributed congestion-aware load balancing for datacenters. In SIGCOMM, 2014.
-  M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data center TCP (DCTCP). In SIGCOMM, 2010.
-  A. Andreyev. Introducing data center fabric, the next-generation Facebook data center network. https://code.facebook.com, Nov. 2014.
-  P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. P4: Programming protocol-independent packet processors. SIGCOMM CCR, 44(3):87–95, July 2014.
-  P. Bratach and P. Lumbis. Equal Cost Multipath Load Sharing - Hardware ECMP, 2017. https://docs.cumulusnetworks.com/display/DOCS/Equal+Cost+Multipath+Load+Sharing+-+Hardware+ECMP.
-  J. Case, R. Mundy, D. Partain, and B. Stewart. Introduction and applicability statements for internet standard management framework, 2002. https://tools.ietf.org/html/rfc3410.
-  J. Corbet. Extending extended BPF, 2014. https://lwn.net/Articles/603983/.
-  M. Davies. Traffic distribution techniques utilizing initial and scrambled hash values, Oct. 26 2010. US Patent 7,821,925.
-  C. Estan, K. Keys, D. Moore, and G. Varghese. Building a better netflow. In SIGCOMM, 2004.
-  S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Trans. Netw., 1:397–413, 1993.
-  S. Ghorbani, Z. Yang, B. Godfrey, Y. Ganjali, and A. Firoozshahian. Drill: Micro load balancing for low-latency data center networks. In SIGCOMM, 2017.
-  B. Gregg. BCC: Dynamic tracing tools for linux, 2017. https://iovisor.github.io/bcc/.
-  C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien. Pingmesh: A large-scale system for data center network latency measurement and analysis. In SIGCOMM, 2015.
-  N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown. I know what your packet did last hop: Using packet histories to troubleshoot networks. In NSDI, 2014.
-  K. He, E. Rozner, K. Agarwal, W. Felter, J. Carter, and A. Akella. Presto: Edge-based load balancing for fast datacenter networks. In SIGCOMM, 2015.
-  C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven wan. In SIGCOMM, 2013.
-  C. Hopps. Analysis of an equal-cost multi-path algorithm. RFC 2992 (Informational), 2000.
-  S. Hu, K. Chen, H. Wu, W. Bai, C. Lan, H. Wang, H. Zhao, and C. Guo. Explicit path control in commodity data centers: Design and applications. In NSDI, 2015.
-  B. Hubert. Gre and other tunnels, 2017. http://lartc.org/howto/lartc.tunnel.gre.html.
-  S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a globally-deployed software defined wan. In SIGCOMM, 2013.
-  A. Kabbani, B. Vamanan, J. Hasan, and F. Duchene. FlowBender: Flow-level adaptive routing for improved latency and throughput in datacenter networks. In CoNEXT, 2014.
-  M. Kalkunte. High speed trunking in a network device, Mar. 16 2010. US Patent 7,680,107.
-  N. Katta, M. Hira, A. Ghag, C. Kim, I. Keslassy, and J. Rexford. Clove: How i learned to stop worrying about the core and love the edge. In HotNets. ACM, 2016.
-  D. Katz and D. Ward. Bidirectional forwarding detection (BFD), 2010. https://tools.ietf.org/html/rfc5880.
-  A. Lê-Quôc. Learning from AWS’ gray failures. https://www.datadoghq.com/blog/gray-aws-failures/, October 2013.
-  M. gorzata Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks. Science of computer programming, 53(2):165–194, 2004.
-  Y. Li, R. Miao, C. Kim, and M. Yu. Flowradar: A better netflow for data centers. In NSDI, 2011.
-  V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson. F10: A fault-tolerant engineered network, 2013.
-  B. Matthews, B. Kwan, and P. Agarwal. Dynamic load balancing, Jan. 15 2013. US Patent 8,355,328.
-  N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: Enabling innovation in campus networks. In SIGCOMM, 2008.
-  R. N. Mysore, R. Mahajan, A. Vahdat, and G. Varghese. Gestalt: Fast, unified fault localization for networked systems. In USENIX ATC, pages 255–267, Philadelphia, PA, June 2014. USENIX Association.
-  P. Phaal, S. Panchen, and N. McKee. InMon corporation’s sFlow: A method for monitoring traffic in switched and routed networks. RFC 3176 (Informational), 2001.
-  R. Ricci, E. Eide, and The CloudLab Team. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login:, 39(6), Dec. 2014.
-  E. Rosen, A. Viswanathan, and R. Callon. Multiprotocol label switching architecture. RFC 3031, 2001.
-  A. Roy, J. Bagga, H. Zeng, and A. C. Snoeren. Passive realtime datacenter fault detection and localization. In NSDI, 2017.
-  J. H. Saltzer, D. P. Reed, and D. D. Clark. End-to-end arguments in system design. ACM Trans. Comput. Syst., 2:277–288, 1981.
-  S. Sen, D. Shue, S. Ihm, and M. J. Freedman. Scalable, optimal flow routing in datacenters via local link balancing. In CoNEXT, 2013.
-  P. Tammana, R. Agarwal, and M. Lee. Simplifying datacenter network debugging with pathdump. In OSDI, 2016.
-  E. Vanini, R. Pan, M. Alizadeh, T. Edsall, and P. Taheri. Let it flow: Resilient asymmetric load balancing with flowlet switching. In NSDI, 2017.
-  S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
-  X. Wu, D. Turner, G. Chen, D. Maltz, X. Yang, L. Yuan, and M. Zhang. NetPilot: Automating datacenter network failure mitigation. In SIGCOMM, 2012.
-  H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury. Resilient datacenter load balancing in the wild. In To appear in SIGCOMM, 2017.
-  J. Zhou, M. Tewari, M. Zhu, A. Kabbani, L. Poutievski, A. Singh, and A. Vahdat. WCMP: Weighted cost multipathing for improved fairness in data centers. In EuroSys, 2014.
-  Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng. Packet-level telemetry in large datacenter networks. In SIGCOMM, 2015.