007: Democratically Finding The Cause of Packet Drops

02/20/2018 ∙ by Behnaz Arzani, et al. ∙ 0

Network failures continue to plague datacenter operators as their symptoms may not have direct correlation with where or why they occur. We introduce 007, a lightweight, always-on diagnosis application that can find problematic links and also pinpoint problems for each TCP connection. 007 is completely contained within the end host. During its two month deployment in a tier-1 datacenter, it detected every problem found by previously deployed monitoring tools while also finding the sources of other problems previously undetected.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

007 has an ambitious goal: for every packet drop on a TCP flow in a datacenter, find the link that dropped the packet and do so with negligible overhead and no changes to the network infrastructure.

This goal may sound like an overkill—after all, TCP is supposed to be able to deal with a few packet losses. Moreover, packet losses might occur due to congestion instead of network equipment failures. Even network failures might be transient. Above all, there is a danger of drowning in a sea of data without generating any actionable intelligence.

These objections are valid, but so is the need to diagnose “failures” that can result in severe problems for applications. For example, in our datacenters, VM images are stored in a storage service. When a VM boots, the image is mounted over the network. Even a small network outage or a few lossy links can cause the VM to “panic” and reboot. In fact, 17% of our VM reboots are due to network issues and in over of these none of our monitoring tools were able to find the links that caused the problem.

VM reboots affect customers and we need to understand their root cause. Any persistent pattern in such transient failures is a cause for concern and is potentially actionable. One example is silent packet drops [pingmesh]. These types of problems are nearly impossible to detect with traditional monitoring tools (e.g., SNMP). If a switch is experiencing these problems, we may want to reboot or replace it. These interventions are “costly” as they affect a large number of flows/VMs. Therefore, careful blame assignment is necessary. Naturally, this is only one example that would benefit from such a detection system.

There is a lot of prior work on network failure diagnosis, though one of the existing systems meet our ambitious goal. Pingmesh [pingmesh] sends periodic probes to detect failures and can leave “gaps” in coverage, as it must manage the overhead of probing. Also, since it uses out-of-band probes, it cannot detect failures that affect only in-band data. Roy et al. [alex] monitor all paths to detect failures but require modifications to routers and special features in the switch (§10). Everflow [everflow] can be used to find the location of packet drops but it would require capturing all traffic and is not scalable. We asked our operators what would be the most useful solution for them. Responses included: “In a network of links its a reasonable assumption that there is a non-zero chance that a number ( of these links are bad (due to device, port, or cable, etc.) and we cannot fix them simultaneously. Therefore, fixes need to be prioritized based on customer impact. However, currently we do not have a direct way to correlate customer impact with bad links”. This shows that current systems do not satisfy operator needs as they do not provide application and connection level context.

To address these limitations, we propose 007, a simple, lightweight, always-on monitoring tool. 007 records the path of TCP connections (flows) suffering from one or more retransmissions and assigns proportional “blame” to each link on the path. It then provides a ranking of links that represents their relative drop rates. Using this ranking, it can find the most likely cause of drops in each TCP flow.

007 has several noteworthy properties. First, it does not require any changes to the existing networking infrastructure. Second, it does not require changes to the client software—the monitoring agent is an independent entity that sits on the side. Third, it detects in-band failures. Fourth, it continues to perform well in the presence of noise (e.g. lone packet drops). Finally, it’s overhead is negligible.

While the high-level design of 007 appear simple, the practical challenges of making 007 work and the theoretical challenge of proving it works are non-trivial. For example, its path discovery is based on a traceroute-like approach. Due to the use of ECMP, traceroute packets have to be carefully crafted to ensure that they follow the same path as the TCP flow. Also, we must ensure that we do not overwhelm routers by sending too many traceroutes (traceroute responses are handled by control-plane CPUs of routers, which are quite puny). Thus, we need to ensure that our sampling strikes the right balance between accuracy and the overhead on the switches. On the theoretical side, we are able to show that 007’s simple blame assignment scheme is highly accurate even in the presence of noise.

We make the following contributions: (i) we design 007, a simple, lightweight, and yet accurate fault localization system for datacenter networks; (ii) we prove that 007 is accurate without imposing excessive burden on the switches; (iii) we prove that its blame assignment scheme correctly finds the failed links with high probability; and (iv) we show how to tackle numerous practical challenges involved in deploying 007 in a real datacenter.

Our results from a two month deployment of 007 in a datacenter show that it finds all problems found by other previously deployed monitoring tools while also finding the sources of problems for which information is not provided by these monitoring tools.

2 Motivation

007 aims to identify the cause of retransmissions with high probability. It is is driven by two practical requirements: (i) it should scale to datacenter size networks and (ii) it should be deployable in a running datacenter with as little change to the infrastructure as possible. Our current focus is mainly on analyzing infrastructure traffic, especially connections to services such as storage as these can have severe consequences (see §1[Netpoirot]). Nevertheless, the same mechanisms can be used in other contexts as well (see §9). We deliberately include congestion-induced retransmissions. If episodes of congestion, however short-lived, are common on a link, we want to be able to flag them. Of course, in practice, any such system needs to deal with a certain amount of noise, a concept we formalize later.

There are a number of ways to find the cause of packet drops. One can monitor switch counters. These are inherently unreliable [netpilot] and monitoring thousands of switches at a fine time granularity is not scalable. One can use new hardware capabilities to gather more useful information [mohammad]. Correlating this data with each retransmission reliably is difficult. Furthermore, time is needed until such hardware is production-ready and switches are upgraded. Complicating matters, operators may be unwilling to incur the expense and overhead of such changes [Netpoirot]. One can use PingMesh [pingmesh] to send probe packets and monitor link status. Such systems suffer from a rate of probing trade-off: sending too many probes creates unacceptable overhead whereas reducing the probing rate leaves temporal and spatial gaps in coverage. More importantly, the probe traffic does not capture what the end user and TCP flows see. Instead, we choose to use data traffic itself as probe traffic. Using data traffic has the advantage that the system introduces little to no monitoring overhead.

As one might expect, almost all traffic in our datacenters is TCP traffic. One way to monitor TCP traffic is to use a system like Everflow. Everflow inserts a special tag in every packet and has the switches mirror tagged packets to special collection servers. Thus, if a tagged packet is dropped, we can determine the link on which it happened. Unfortunately, there is no way to know in advance which packet is going to be dropped, so we would have to tag and mirror every TCP packet. This is clearly infeasible. We could tag only a fraction of packets, but doing so would result in another sampling rate trade-off. Hence, we choose to rely on some form of network tomography [tomo1, tomo2, tomo3]. We can take advantage of the fact that TCP is a connection-oriented, reliable delivery protocol so that any packet loss results in retransmissions that are easy to detect.

If we knew the path of all flows, we could set up an optimization to find which link dropped the packet. Such an optimization would minimize the number of “blamed” links while simultaneously explaining the cause of all drops. Indeed past approaches such as MAX COVERAGE and Tomo [netdiagnoser, risk] aim to approximate the solution of such an optimization (see §007: Democratically Finding The Cause of Packet Drops (Extended Version) for an example). There are problems with this approach: (i) the optimization is NP-hard [bertsimas]. Solving it on a datacenter scale is infeasible. (ii) tracking the path of every flow in the datacenter is not scalable in our setting. We can use alternative solutions such as Everflow or the approach of [alex] to track the path of SYN packets. However, both rely on making changes to the switches. The only way to find the path of a flow without any special infrastructure support is to employ something like a traceroute. Traceroute relies on getting ICMP TTL exceeded messages back from the switches. These messages are generated by the control-plane, i.e., the switch CPU. To avoid overloading the CPU, our administrators have capped the rate of ICMP responses to 100 per second. This severely limits the number of flows we can track.

Figure 1: Observations from a production network: (a) CDF of the number of flows with at least one retransmission; (b) CDF of the fraction of drops belonging to each flow in each second interval.

Given these limitations, what can we do? We analyzed the drop patterns in two of our datacenters and found: typically when there are packet drops, multiple flows experience drops. We show this in Figure 1a for TCP flows in production datacenters. The figure shows the number of flows experiencing drops in the datacenter conditioned on the total number of packets dropped in that datacenter in second intervals. The data spans one day. We see that the more packets are dropped in the datacenter, the more flows experience drops and of the time, at least flows see drops when we condition on total drops. We focus on the case because lower values mostly capture noisy drops due to one-off packet drops by healthy links. In most cases drops are distributed across flows and no single flow sees more than of the total packet drops. This is shown in Figure 1b (we have discarded all flows with drops and cases where the total number of drops was less than 10). We see that in of cases, no single flow captures more than of all drops.

Based on these observations and the high path diversity in datacenter networks [f10], we show that if: (a) we only track the path of those flows that have retransmissions, (b) assign each link on the path of such a flow a vote of , where is the path length, and (c) sum up the votes during a given period, then the top-voted links are almost always the ones dropping packets (see §5)! Unlike the optimization, our scheme is able to provide a ranking of the links in terms of their drop rates, i.e. if link has a higher vote than , it is also dropping more packets (with high probability). This gives us a heat-map of our network which highlights the links with the most impact to a given application/customer (because we know which links impact a particular flows).

3 Design Overview

Figure 2: Overview of 007 architecture

Figure 2 shows the overall architecture of 007. It is deployed alongside other applications on each end-host as a user-level process running in the host OS. 007 consists of three agents responsible for TCP monitoring, path discovery, and analysis.

The TCP monitoring agent detects retransmissions at each end-host. The Event Tracing For Windows (ETW) [etw] framework111Similar functionality exists in Linux. notifies the agent as soon as an active flow suffers a retransmission.

Upon a retransmission, the monitoring agent triggers the path discovery agent (§4) which identifies the flow’s path to the destination IP (DIP).

At the end-hosts, a voting scheme (§5) is used based on the paths of flows that had retransmissions. At regular intervals of s the votes are tallied by a centralized analysis agent to find the top-voted links. Although we use an aggregation interval of s, failures do not have to last for s.

007’s implementation consists of 6000 lines of C++ code. Its memory usage never goes beyond  KB on any of our production hosts, its CPU utilization is minimal (-), and its bandwidth utilization due to traceroute is minimal (maximum of  KBps per host). 007 is proven to be accurate (§5) in typical datacenter conditions (a full description of the assumed conditions can be found in §9).

4 The Path Discovery Agent

The path discovery agent uses traceroute packets to find the path of flows that suffer retransmissions. These packets are used solely to identify the path of a flow. They do not need to be dropped for 007 to operate. We first ensure that the number of traceroutes sent by the agent does not overload our switches (§4.1). Then, we briefly describe the key engineering issues and how we solve them (§4.2).

4.1 ICMP Rate Limiting

Generating ICMP packets in response to traceroute consumes switch CPU, which is a valuable resource. In our network, there is a cap of on the number of ICMP messages a switch can send per second. To ensure that the traceroute load does not exceed , we start by noticing that a small fraction of flows go through tier-3 switches (). Indeed, after monitoring all TCP flows in our network for one hour, only  went through a  switch. Thus we can ignore switches in our analysis. Given that our network is a Clos topology and assuming that hosts under a top of the rack switch (ToR) communicate with hosts under a different ToR uniformly at random (see §6 for when this is not the case):

Theorem 1.

The rate of ICMP packets sent by any switch due to a traceroute is below  if the rate  at which hosts send traceroutes is upper bounded as

(1)

where , , and , are the numbers of ToR, , and  switches respectively, is the number of pods, and  is the number of hosts under each ToR.

See §007: Democratically Finding The Cause of Packet Drops (Extended Version) for proof. The upper bound of  in our datacenters is . As long as hosts do not have more than  flows with retransmissions per second, we can guarantee that the number of traceroutes sent by 007 will not go above . We use as a threshold to limit the traceroute rate of each host. Note that there are two independent rate limits, one set at the host by 007 and the other set by the network operators on the switch (). Additionally, the agent triggers path discovery for a given connection

no more than once every epoch to further limit the number of traceroutes. We will show in §

5 that this number is sufficient to ensure high accuracy.

4.2 Engineering Challenges

Using the correct five-tuple. As in most datacenters, our network also uses ECMP. All packets of a given flow, defined by the five-tuple, follow the same path [ECMP]. Thus, traceroute packets must have the same five-tuple as the flow we want to trace. To ensure this, we must account for load balancers.

TCP connections are initiated in our datacenter in a way similar to that described in [ananta]. The connection is first established to a virtual IP (VIP) and the SYN packet (containing the VIP as destination) goes to a software load balancer (SLB) which assigns that flow to a physical destination IP (DIP) and a service port associated with that VIP. The SLB then sends a configuration message to the virtual switch (vSwitch) in the hypervisor of the source machine that registers that DIP with that vSwitch. The destination of all subsequent packets in that flow have the DIP as their destination and do not go through the SLB. For the path of the traceroute packets to match that of the data packets, its header should contain the DIP and not the VIP. Thus, before tracing the path of a flow, the path discovery agent first queries the SLB for the VIP-to-DIP mapping for that flow. An alternative is to query the vSwitch. In the instances where the failure also results in connection termination the mapping may be removed from the vSwitch table. It is therefore more reliable to query the SLB. Note that there are cases where the TCP connection establishment itself may fail due to packet loss. Path discovery is not triggered for such connections. It is also not triggered when the query to the SLB fails to avoid tracerouting the internet.

Re-routing and packet drops. Traceroute itself may fail. This may happen if the link drop rate is high or due to a blackhole. This actually helps us, as it directly pinpoints the faulty link and our analysis engine (§5) is able to use such partial traceroutes.

A more insidious possibility is that routing may change by the time traceroute starts. We use BGP in our datacenter and a lossy link may cause one or more BGP sessions to fail, triggering rerouting. Then, the traceroute packets may take a different path than the original connection. However, RTTs in a datacenter are typically less than 1 or 2 ms, so TCP retransmits a dropped packet quickly. The ETW framework notifies the monitoring agent immediately, which invokes the path discovery agent. The only additional delay is the time required to query the SLB to obtain the VIP-to-DIP mapping, which is typically less than a millisecond. Thus, as long as paths are stable for a few milliseconds after a packet drop, the traceroute packets will follow the same path as the flow and the probability of error is low. Past work has shown this to be usually the case [ffc].

Our network also makes use of link aggregation (LAG) [linkagg]. However, unless all the links in the aggregation group fail, the L3 path is not affected.

Router aliasing [aliasing]. This problem is easily solved in a datacenter, as we know the topology, names, and IPs of all routers and interfaces. We can simply map the IPs from the traceroutes to the switch names.

To summarize, 007’s path discovery implementation is as follows: Once the TCP monitoring agent notifies the path discovery agent that a flow has suffered a retransmission, the path discovery agent checks its cache of discovered path for that epoch and if need be, queries the SLB for the DIP. It then sends  appropriately crafted TCP packets with TTL values ranging from . In order to disambiguate the responses, the TTL value is also encoded in the IP ID field [ip]. This allows for concurrent traceroutes to multiple destinations. The TCP packets deliberately carry a bad checksum so that they do not interfere with the ongoing connection.

5 The Analysis Agent

Here, we describe 007’s analysis agent focusing on its voting-based scheme. We also present alternative NP-hard optimization solutions for comparison.

5.1 Voting-Based Scheme

007’s analysis agent uses a simple voting scheme. If a flow sees a retransmission, 007 votes its links as bad. Each vote has a value that is tallied at the end of every epoch, providing a natural ranking of the links. We set the value of good votes to  (if a flow has no retransmission, no traceroute is needed). Bad votes are assigned a value of , where  is the number of hops on the path, since each link on the path is equally likely to be responsible for the drop.

The ranking obtained after compiling the votes allows us to identify the most likely cause of drops on each flow: links ranked higher have higher drop rates (Theorem 2). To further guard against high levels of noise, we can use our knowledge of the topology to adjust the links votes. Namely, we iteratively pick the most voted link 

and estimate the portion of votes obtained by all other links due to failures on 

. This estimate is obtained for each link  by (i) assuming all flows having retransmissions and going through  had drops due to  and (ii) finding what fraction of these flows go through  by assuming ECMP distributes flows uniformly at random. Our evaluations showed that this results in a reduction in false positives.

1:    Set of all links
2:    Set of all possible paths
3:    Number of votes for
4:    Set of most problematic links
5:   
6:   while  do
7:       
8:       
9:       for  do
10:           if   then
11:               Adjust the score of
12:           end if
13:       end for
14:   end while
15:   return  
Algorithm 1 Finding the most problematic links in the network.

007 can also be used to detect failed links using Algorithm 1. The algorithm sorts the links based on their votes and uses a threshold to determine if there are problematic links. If so, it adjusts the votes of all other links and repeats until no link has votes above the threshold. In Algorithm 1, we use a threshold of 

of the total votes cast based on a parameter sweep where we found that it provides a reasonable trade-off between precision and recall. Higher values reduce false positives but increase false negatives.

Here we have focused on detecting link failures. 007 can also be used to detect switch failures in a similar fashion by applying votes to switches instead of links. This is beyond the scope of this work.

5.2 Voting Scheme Analysis

Can 007 deliver on its promise of finding the most probable cause of packet drops on each flow? This is not trivial. In its voting scheme, failed connections contribute to increase the tally of both good and bad links. Moreover, in a large datacenter such as ours, occasional, lone, and sporadic drops can and will happen due to good links. These failures are akin to noise and can cause severe inaccuracies in any detection system [gestalt], 007 included. We show that the likelihood of 007 making these errors is small. Given our topology (Clos):

Theorem 2.

For , 007 will find with probability  the  bad links that drop packets with probability  among good links that drop packets with probability if

where is the total number of flows between hosts, and  are lower and upper bounds, respectively, on the number of packets per connection, and

(2)

The proof is deferred to the appendices due to space constraints. Theorem 2 states that under mild conditions, links with higher drop rates are ranked higher by 007. Since a single flow is unlikely to go through more than one failed link in a network with thousands of links, it allows 007 to find the most likely cause of packet drops on each flow.

A corollary of Theorem 2 is that in the absence of noise (), 007 can find all bad links with high probability. In the presence of noise, 007 can still identify the bad links as long as the probability of dropping packets on non-failed links is low enough (the signal-to-noise ratio is large enough). This number is compatible with typical values found in practice. As an example, let  and  be the  and  percentiles respectively of the number of packets sent by TCP flows across all hosts in a  hour period. If , the drop rate on good links can be as high as . Drop rates in a production datacenter are typically below  [RAIL].

Another important consequence of Theorem 2 is that it establishes that the probability of errors in 007’s results diminishes exponentially with , so that even with the limits imposed by Theorem 1 we can accurately identify the failed links. The conditions in Theorem 2 are sufficient but not necessary. In fact, §6 shows how well 007 performs even when the conditions in Theorem 2 do not hold.

5.3 Optimization-Based Solutions

One of the advantages of 007’s voting scheme is its simplicity. Given additional time and resources we may consider searching for the optimal sets of failed links by finding the most likely cause of drops given the available evidence. For instance, we can find the least number of links that explain all failures as we know the flows that had packet drops and their path. This can be written as an optimization problem we call the binary program. Explicitly,

(3)
subject to

where  is a  routing matrix; is a vector that collects the status of each flow during an epoch (each element of  is  if the connection experienced at least one retransmission and  otherwise); is the number of links; is the number of connections in an epoch; and  denotes the number of nonzero entries of the vector . Indeed, if the solution of (3) is , then the -th element of  indicates whether the binary program estimates that link  failed.

Problem (3) is the NP-hard minimum set covering problem [combinatorial] and is intractable. Its solutions can be approximated greedily as in MAX COVERAGE or Tomo  [netdiagnoser, risk] (see appendix). For benchmarking, we compare 007 to the true solution of (3

) obtained by a mixed-integer linear program (MILP) solver 

[mosek]. Our evaluations showed that 007 (Algorithm 1) significantly outperforms this binary optimization (by more than in the presence of noise). We illustrate this point in Figures 4 and 10, but otherwise omit results for this optimization in §6 for clarity.

The binary program (3) does not provide a ranking of links. We also consider a solution in which we determine the number of packets dropped by each link, thus creating a natural ranking. The integer program can be written as

(4)
subject to

where is the set of natural numbers and  is a  vector that collects the number of retransmissions suffered by each flow during an epoch. The solution  of (4) represents the number of packets dropped by each link, which provides a ranking. The constraint  ensures each failure is explained only once. As with (3), this problem is NP-hard [bertsimas] and is only used as a benchmark. As it uses more information than the binary program (the number of failures), (4) performs better (see §6).

In the next three sections, we present our evaluation of 007 in simulations (§6), in a test cluster (§7), and in one of our production datacenters (§8).

6 Evaluations: Simulations

We start by evaluating in simulations where we know the ground truth. 007 first finds flows whose drops were due to noise and marks them as “noise drops”. It then finds the link most likely responsible for drops on the remaining set of flows (“failure drops”). A noisy drop is defined as one where the corresponding link only dropped a single packet. 007 never marked a connection into the noisy category incorrectly. We therefore focus on the accuracy for connections that 007 puts into the failure drop class.

Performance metrics. Our measure for the performance of 007 is accuracy, which is the proportion of correctly identified drop causes. For evaluating Algorithm 1, we use recall and precision. Recall is a measure of reliability and shows how many of the failures 007 can detect (false negatives). For example, if there are failed links and 007 detects of them, its recall is . Precision is a measure of accuracy and shows to what extent 007’s results can be trusted (false positives). For example, if 007 flags links as bad, but only of those links actually failed, its precision is .

Simulation setup. We use a flow level simulator [sims] implemented in MATLAB. Our topology consists of links, pods, and  ToRs per pod. Each host establishes connections per second to a random ToR outside of its rack. The simulator has two types of links. For good links, packets are dropped at a very low rate chosen uniformly from to simulate noise. On the other hand, failed links have a higher drop rate to simulate failures. By default, drop rates on failed links are set to vary uniformly from % to %, though to study the impact of drop rates we do allow this rate to vary as an input parameter. The number of good and failed links is also tunable. Every seconds of simulation time, we send up to packets per flow and drop them based on the rates above as they traverse links along the path. The simulator records all flows with at least one drop and for each such flow, the link with the most drops.

We compare 007 against the solutions described in §5.3. We only show results for the binary program (3) in Figures 4 and 10 since its performance is typically inferior to 007 and the integer program (4) due to noise. This also applies to MAX COVERAGE or Tomo  [netdiagnoser, risk, rachit] as they are approximations of the binary program (see [tr]).

6.1 In The Optimal Case

The bounds of Theorem 2 are sufficient (not necessary) conditions for accuracy. We first validate that 007 can achieve high levels of accuracy as expected when these bounds hold. We set the drop rates on the failed links to be between . We refer the reader to [alex] for why these drop rates are reasonable.

Accuracy. Figure 3 shows that 007 has an average accuracy that is higher than in almost all cases. Due to its robustness to noise, it also outperforms the optimization algorithm (§ 5.3) in most cases.

Recall & precision. Figure 4 shows that even when failed links have low packet drop rates, 007 detects them with high recall/precision.

We proceed to evaluate 007’s accuracy when the bounds in Theorem 2 do not hold. This shows these conditions are not necessary for good performance.

Figure 3: When Theorem 2 holds.
Figure 4: Algorithm 1 when Theorem 2 holds.

6.2 Varying Drop Rates

Our next experiment aims to push the boundaries of Theorem 2 by varying the “failed” links drop rates below the conservative bounds of Theorem 2.

Single Failure. Figure 5a shows results for different drop rates on a single failed link. It shows that 007 can find the cause of drops on each flow with high accuracy. Even as the drop rate decreases below the bounds of Theorem 2, we see that 007 can maintain accuracy on par with the optimization.

Multiple Failures. Figure 5b shows that 007 is successful at finding the link responsible for a drop even when links have very different drop rates. Prior work have reported the difficulty of detecting such cases [alex]. However, 007’s accuracy remains high.

Figure 5: 007’s accuracy for varying drop rates.

6.3 Impact of Noise

Single Failure. We vary noise levels by changing the drop rate of good links. We see that higher noise levels have little impact on 007’s ability to find the cause of drops on individual flows (Figure 6a).

Multiple Failures. We repeat this experiment for the case of failed links. Figure 6

b shows the results. 007 shows little sensitivity to the increase in noise when finding the cause of per-flow drops. Note that the large confidence intervals of the optimization is a result of its high sensitivity to noise.

Figure 6: 007’s accuracy for varying noise levels.

6.4 Varying Number of Connections

In previous experiments, hosts opened  connections per epoch. Here, we allow hosts to choose the number of connections they create per epoch uniformly at random between . Recall, from Theorem 2, that a larger number of connections from each host helps 007 improve its accuracy.

Single Failure. Figure 7a shows the results. 007 accurately finds the cause of packet drops on each connection. It also outperforms the optimization when the failed link has a low drop rate. This is because the optimization has multiple optimal points and is not sufficiently constrained.

Multiple Failures. Figure 7

b shows the results for multiple failures. The optimization suffers from the lack of information to constrain the set of results. It therefore has a large variance (confidence intervals). 007 on the other hand maintains high probability of detection no matter the number of failures.

Figure 7: Varying the number of connections.

6.5 Impact of Traffic Skews

Single Failure.

We next demonstrate 007’s ability to detect the cause of drops even under heavily skewed traffic. We pick

ToRs at random ( of the ToRs). To skew the traffic, 80% of the flows have destinations set to hosts under these ToRs. The remaining flows are routed to randomly chosen hosts. Figure 8a shows that the optimization is much more heavily impacted by the skew than 007. 007 continues to detect the cause of drops with high probability () for drop rates higher than .

Figure 8: 007’s accuracy under skewed traffic.

Multiple Failures. We repeated the above for multiple failures. Figure 8b shows that the optimization’s accuracy suffers. It consistently shows a low detection rate as its constraints are not sufficient in guiding the optimizer to the right solution. 007 maintains a detection rate of at all times.

Hot ToR. A special instance of traffic skew occurs in the presence of a single hot ToR which acts as a sink for a large number of flows. Figure 9 shows how 007 performs in these situations. 007 can tolerate up to skew, i.e., of all flows go to the hot ToR, with negligible accuracy degradation. However, skews above negatively impact its accuracy in the presence of a large number of failures (). Such scenarios are unlikely as datacenter load balancing mitigates such extreme situations.

Figure 9: Impact of a hot ToR on 007’s accuracy.

6.6 Detecting Bad Links

In our previous experiments, we focused on 007’s accuracy on a per connection basis. In our next experiment, we evaluate its ability to detect bad links.

Single Failure. Figure 10 shows the results. 007 outperforms the optimization as it does not require a fully specified set of equations to provide a best guess as to which links failed. We also evaluate the impact of failure location on our results (Figure 11).

Figure 10: Algorithm 1 with single failure.

Multiple Failures. We heavily skew the drop rates on the failed links. Specifically, at least one failed link has a drop rate between  and , while all others have a drop rate in . This scenario is one that past approaches have reported as hard to detect [alex]. Figure 12 shows that 007 can detect up to failures with accuracy above . Its recall drops as the number of failed links increase. This is because the increase in the number of failures drives up the votes of all other links increasing the cutoff threshold and thus increasing the likelihood of false negatives. In fact if the top links had been selected 007’s recall would have been close to .

Figure 11: Impact of link location on Algorithm 1.
Figure 12: Algorithm 1 with multiple failures. The drop rates on the links are heavily skewed.

6.7 Effects of Network Size

Finally, we evaluate 007 in larger networks. Its accuracy when finding a single failure was , , , and on average in a network with and pods respectively. In contrast, the optimization had an average accuracy of , , , and respectively. Algorithm 1 continues to have Recall for up to pods (it drops to for pods). Precision remains  for all pod sizes.

We also evaluate both 007 and the optimization’s ability to find the cause of per flow drops when the number of failed links is . We observe that both approach’s performance remained unchanged for the most part, e.g., the accuracy of 007 in an example with  failed links is .

7 Evaluations: Test Cluster

We next evaluate 007 on the more realistic environment of a test cluster with ToRs and a total of  links. We control  hosts in the cluster, while others are production machines. Therefore, the  switches see real production traffic. We recorded  hours of traffic from a host in production and replayed it from our hosts in the cluster (with different starting times). Using Everflow-like functionality [everflow] on the ToR switches, we induced different rates of drops on  to ToR links. Our goal is to find the cause of packet drops on each flow §7.2 and to validate whether Algorithm 1 works in practice §7.3.

7.1 Clean Testbed Validation

We first validate a clean testbed environment. We repave the cluster by setting all devices to a clean state. We then run 007 without injecting any failures. We see that in the newly-repaved cluster, links arriving at a particular ToR switch had abnormally high votes, namely in average. We thus suspected that this ToR is experiencing problems. After rebooting it, the total votes of the links went down to , validating our suspicions. This exercise also provides one example of when 007 is extremely effective at identifying links with low drop rates.

7.2 Per-connection Failure Analysis

Can 007 identify the cause of drops when links have very different drop rates? To find out, we induce a drop rate of  and  on two different links for an hour. We only know the ground truth when the flow goes through at least one of the two failed links. Thus, we only consider such flows. For of these, 007 was able to attribute the packet drop to the correct link (the one with higher drop rate).

Figure 13: Distribution of the difference between votes on bad links and the maximum vote on good links for different bad link drop rates.

7.3 Identifying Failed Links

We next validate Algorithm 1 and its ability to detect failed links. We inject different drop rates on a chosen link and determine whether there is a correlation between total votes and drop rates. Specifically, we look at the difference between the vote tally on the bad link and that of the most voted good link. We induced a packet drop rate of , , and  on a  to ToR link in the test cluster.

Figure 13 shows the distribution for the various drop rates. The failed link has the highest vote out of all links when the drop rate is  and . When the drop rate is lowered to , the failed link becomes harder to detect due to the smaller gap between the drop rate of the bad link and that of the normal links. Indeed, the bad link only has the maximum score in  of the instances (mostly due to occasional lone drops on healthy links). However, it is always one of the  links with the highest votes.

Figure 13 also shows the high correlation between the probability of packet drop on a links and its vote tally. This trivially shows that 007 is accurate in finding the cause of packet drops on each flow given a single link failure: the failed link has the highest votes among all links. We compare 007 with the optimization problem in (4). We find that the latter also returns the correct result every time, albeit at the cost of a large number of false positives. To illustrate this point: the number of links marked as bad by (4) on average is , , and  times higher than the number given by 007 for the drop rates of , , and  respectively.

What about multiple failures? This is a harder experiment to configure due to the smaller number of links in this test cluster and its lower path diversity. We induce different drop rates ( and ) on two links in the cluster. The link with higher drop rate is the most voted  of the time. The second link is the second highest ranked  of the time and the third  of the time. It always remained among the  most voted links. This shows that by allowing a single false positive (identifying three instead of two links), 007 can detect all failed links  of the time even in a setup where the traffic distribution is highly skewed. This is something past approaches [alex] could not achieve. In this example, 007 identifies the true cause of packet drops on each connection  of the time.

8 Evaluations: Production

We have deployed 007 in one of our datacenters222The monitoring agent has been deployed across all our data centers for over years.. Notable examples of problems 007 found include: power supply undervoltages [EOS], FCS errors [monia], switch reconfigurations, continuous BGP state changes, link flaps, and software bugs [cisco]. Also, 007 found every problem that was caught by our previously deployed diagnosis tools.

8.1 Validating Theorem 1

Table 1 shows the distribution of the number of ICMP messages sent by each switch in each epoch over a week. The number of ICMP messages generated by 007 never exceed (Theorem 1).


&
Table 1: Number of ICMPs per second per switch (). We see .

8.2 TCP Connection Diagnosis

In addition to finding problematic links, 007 identifies the most likely cause of drops on each flow. Knowing when each individual packet is dropped in production is hard. We perform a semi-controlled experiment to test the accuracy of 007. Our environment consists of thousands of hosts/links. To find the “ground truth”, we compare its results to that obtained by EverFlow. EverFlow captures all packets going through each switch on which it was enabled. It is expensive to run for extended periods of time. We thus only run EverFlow for hours and configure it to capture all outgoing IP traffic from random hosts. The captures for each host were conducted on different days. We filter all flows that were detected to have at least one retransmission during this time and using EverFlow find where their packets were dropped. We then check whether the detected link matches that found by 007. We found that 007 was accurate in every single case. In this test we also verified that each path recorded by 007 matches exactly the path taken by that flow’s packets as captured by EverFlow. This confirms that it is unlikely for paths to change fast enough to cause errors in 007’s path discovery.

8.3 VM Reboot Diagnosis

During our deployment, there were  VM reboots in the datacenter for which there was no explanation. 007 found a link as the cause of problems in each case. Upon further investigation on the SNMP system logs, we observe that in  cases, there were transient drops on the host to ToR link a number of which were correlated with high CPU usage on the host. Two were due to high drop rates on the ToR. In another , the endpoints of the links found were undergoing configuration updates. In the remaining  instances, the link was flapping.

Finally, we looked at our data for one cluster for one day. 007 identifies an average of links as dropping packets per epoch. The average across all epochs of the maximum vote tally was . Out of the links dropping packets are server to ToR links ( were due to a single ToR switch that was eventually taken out for repair), are -ToR links and were due to - link failures.

9 Discussion

007 is highly effective in finding the cause of packet drops on individual flows. By doing so, it provides flow-level context which is useful in finding the cause of problems for specific applications. In this section we discuss a number of other factors we considered in its design.

9.1 007’s Main Assumptions

The proofs of Theorems 1 and 2 and the design of the path discovery agent (§4) are based on a number of assumptions:

ACK loss on reverse path. It is possible that packet loss on the reverse path is so severe that loss of ACK packets triggers timeout at the sender. If this happens, the traceroute would not be going over any link that triggered the packet drop. Since TCP ACKs are cumulative, this is typically not a problem and 007 assumes retransmissions in such cases are unlikely. This is true unless loss rates are very high, in which case the severity of the problem is such that the cause is apparent. Spurious retransmissions triggered by timeouts may also occur if there is sudden increased delay on forward or reverse paths. This can happen due to rerouting, or large queue buildups. 007 treats these retransmissions like any other.

Source NATs. Source network address translators (SNATs) change the source IP of a packet before it is sent out to a VIP. Our current implementation of 007 assumes connections are SNAT bypassed. However, if flows are SNATed, the ICMP messages will not have the right source address for 007 to get the response to its traceroutes. This can be fixed by a query to the SLB. Details are omitted.

L2 networks. Traceroute is not a viable option to find paths when datacenters operate using L2 routing. In such cases we recommend one of the following: (a) If access to the destination is not a problem and switches can be upgraded one can use the path discovery methods of [pathdump, alex]. 007 is still useful as it allows for finding the cause of failures when multiple failures are present and for individual flows. (b) Alternatively, EverFlow can be used to find path. 007’s sampling is necessary here as EverFlow doesn’t scale to capture the path of all flows.

Network topology. The calculations in §5 assume a known topology (Clos). The same calculations can be carried out for any known topology by updating the values used for ECMP. The accuracy of 007 is tied to the degree of path diversity and that multiple paths are available at each hop: the higher the degree of path diversity, the better 007 performs. This is also a desired property in any datacenter topology, most of which follow the Clos topology [googletopo, microsofttopo, f10].

ICMP rate limit. In rare instances, the severity of a failure or the number of flows impacted by it may be such that it triggers 007’s ICMP rate limit which stops sending more traceroute messages in that epoch. This does not impact the accuracy of Algorithm 1. By the time 007 reaches its rate limit, it has enough data to localize the problematic links. However, this limits 007’s ability to find the cause of drops on flows for which it did not identify the path. We accept this trade-off in accuracy for the simplicity and lower overhead of 007.

Unpredictability of ECMP. If the topology and the ECMP functions on all the routers are known, the path of a packet can be found by inspecting its header. However, ECMP functions are typically proprietary and have initialization “seeds” that change with every reboot of the switch. Furthermore, ECMP functions change after link failures and recoveries. Tracking all link failures/recoveries in real time is not feasible at a datacenter scale.

9.2 Other Factors To Consider

007 has been designed for a specific use case, namely finding the cause of packet drops on individual connections in order to provide application context. This resulted in a number of design choices:

Detecting congestion. 007 should not avoid detecting major congestion events as they signal severe traffic imbalance and/or incast and are actionable. However, the more prevalent () forms of congestion have low drop rates  [monia]. 007 treats these as noise and does not detect them. Standard congestion control protocols can effectively react to them.

007’s ranking. 007’s ranking approach will naturally bias towards the detection of failed links that are frequently used. This is an intentional design choice as the goal of 007 is to identity high impact failures that affect many connections.

Finding the cause of other problems. 007’s goal is to identify the cause of every packet drop, but other problems may also be of interest. 007 can be extended to identify the cause of many such problems. For example, for latency, ETW provides TCP’s smooth RTT estimates upon each received ACK. Thresholding on these values allows for identifying “failed” flows and 007’s voting scheme can be used to provides a ranked list of suspects. Proving the accuracy of 007 for such problems requires an extension of the analysis presented in this paper.

VM traffic problems. 007’s goal is to find the cause of drops on infrastructure connections and through those, find the failed links in the network. In principle, we can build a 007-like system to diagnose TCP failures for connections established by customer VMs as well. For example, we can update the monitoring agent to capture VM TCP statistics through a VFP-like system [vfp]. However, such a system raises a number of new issues, chief among them being security. This is part of our future work.

In conclusion, we stress that the purpose of 007 is to explain the cause of drops when they occur. Many of these are not actionable and do not require operator intervention. The tally of votes on a given link provide a starting point for deciding when such intervention is needed.

10 Related Work

Finding the source of failures in distributed systems, specifically networks, is a mature topic. We outline some of the key differences of 007 with these works.

The most closely related work to ours is perhaps [alex], which requires modifications to routers and both endpoints a limitation that 007 does not have. Often services (e.g. storage) are unwilling to incur the additional overhead of new monitoring software on their machines and in many instances the two endpoints are in seperate organizations [Netpoirot]. Moreover, in order to apply their approach to our datacenter, a number of engineering problems need to be overcome, including finding a substitute for their use of the DSCP bit, which is used for other purposes in our datacenter. Lastly, while the statistical testing method used in [alex] (as well as others) are useful when paths of both failed and non-failed flows are available they cannot be used in our setting as the limited number of traceroutes 007 can send prevent it from tracking the path of all flows. In addition 007 allows for diagnosis of individual connections and it works well in the presence of multiple simultaneous failures, features that [alex] does not provide. Indeed, finding paths only when they are needed is one of the most attractive features of 007 as it minimizes its overhead on the system. Maximum cover algorithms [maxcover, pathdump] suffer from many of the same limitations described earlier for the binary optimization, since MAX COVERAGE and Tomo are approximations of (3). Other related work can be loosely categorized as follows:

Inference and Trace-Based Algorithms [sherlock, netmedic, everflow, pingmesh, rinc, tulip, alex, lossradar, flowradar, verification]

use anomaly detection and trace-based algorithms to find sources of failures. They require knowledge/inference of the location of logical devices, e.g. load balancers in the connection path. While this information is available to the network operators, it is not clear which instance of these entities a flow will go over. This reduces the accuracy of the results.

Everflow  [everflow] aims to accurately identify the path of packets of interest. However, it does not scale to be used as an always on diagnosis system. Furthermore, it requires additional features to be enabled in the switch. Similarly, [rachit, pathdump] provides another means of path discovery, however, such approaches require deploying new applications to the remote end points which we want to avoid (due to reasons described in [Netpoirot]). Also, they depend on SDN enabled networks and are not applicable to our setting where routing is based on BGP enabled switches.

Some inference approaches aim at covering the full topology, e.g. [pingmesh]. While this is useful, they typically only provides a sampled view of connection livelihood and do not achieve the type of always on monitoring that 007 provides. The time between probes for [pingmesh] for example is currently 5 minutes. It is likely that failures that happen at finer time scales slip through the cracks of its monitoring probes.

Other such work, e.g. [tulip, alex, lossradar, flowradar] require access to both endpoints and/or switches. Such access may not always be possible. Finally, NetPoirot  [Netpoirot] can only identify the general type of a problem (network, client, server) rather than the responsible device.

Network tomography [tomo1, tomo2, tomo3, tomo4, risk, shrink, tomo5, tomo6, tomo7, tomo8, duffield2, gestalt, geoff] typically consist of two aspects: (i) the gathering and filtering of network traffic data to be used for identifying the points of failure [tomo5, tomo1] and (ii) using the information found in the previous step to identify where/why failures occurred [tomo2, tomo3, tomo4, gestalt, netdiagnoser, duffield2, newtomo]. 007 utilizes ongoing traffic to detect problems, unlike these approaches which require a much heavier-weight operation of gathering large volumes of data. Tomography-based approaches are also better suited for non-transient failures, while 007 can handle both transient and persistent errors. 007 also has coverage that extends to the entire network infrastructure, and does not limit coverage to only paths between designated monitors as some such approaches do. Work on analyzing failures [tomo1, tomo4, tomo5, gestalt] are complementary and can be applied to 007 to improve our accuracy.

Anomaly detection [anom1, anom3, assaf, anom4, anomalbert, crovella2014, anomallydetection]

find when a failure has occurred using machine learning 

[assaf, anom1]

and Fourier transforms 

[anomalbert]. 007 goes a step further by finding the device responsible.

Fault Localization by Consensus [netprofiler] assumes that a failure on a node common to the path used by a subset of clients will result in failures on a significant number of them. NetPoirot [Netpoirot] illustrates why this approach fails in the face of a subset of problems that are common to datacenters. While our work builds on this idea, it provides a confidence measure that identifies how reliable a diagnosis report is.

Fault Localization using TCP statistics [tcp1, tcp2, trat, snap, alex] use TCP metrics for diagnosis. [tcp1] requires heavyweight active probing. [tcp2] uses learning techniques. Both [tcp2], and T-Rat [trat] rely on continuous packet captures which doesn’t scale. SNAP [snap] identifies performance problems/causes for connections by acquiring TCP information which are gathered by querying socket options. It also gathers routing data combined with topology data to compare the TCP statistics for flows that share the same host, link, ToR, or aggregator switch. Given their lack of continuous monitoring, all of these approaches fail in detecting the type of problems 007 is designed to detect. Furthermore, the goal of 007 is more ambitious, namely to find the link that causes packet drops for each TCP connection.

Learning Based Approaches [dtree1, conext15, dtree2, Netpoirot] do failure detection in home and mobile networks. Our application domain is different.

Application diagnosis [app1, mogul] aim at identifying the cause of problems in a distributed application’s execution path. The limitations of diagnosing network level paths and the complexities associated with this task are different. Obtaining all execution paths seen by an application, is plausible in such systems but is not an option in ours.

Failure resilience in datacenters [f10, ankit, conga, mptcp, datacenterresilience, datacenter2, fattire, datacenter3, bodik, olaf] target resilience to failures in datacenters. 007 can be helpful to a number of these algorithms as it can find problematic areas which these tools can then help avoid.

Understanding datacenter failures [RAIL, gill2011] aims to find the various types of failures in datacenters. They are useful in understanding the types of problems that arise in practice and to ensure that our diagnosis engines are well equipped to find them. 007’s analysis agent uses the findings of [RAIL].

11 Conclusion

We introduced 007, an always on and scalable monitoring/diagnosis system for datacenters. 007 can accurately identify drop rates as low as in datacenters with thousands of links through monitoring the status of ongoing TCP flows.

12 Acknowledgements

This work was was supported by grants NSF CNS-1513679, DARPA/I2O HR0011-15-C-0098. The authors would like to thank T. Adams, D. Dhariwal, A. Aditya, M. Ghobadi, O. Alipourfard, A. Haeberlen, J. Cao, I. Menache, S. Saroiu, and our shepherd H. Madhyastha for their help.

References

Appendix A Application example: VM reboots

In the introduction (§ 1), we describe an instance in which the failure detection capacities of 007 can be useful: pinpointing the cause of VM reboots. Indeed, in our datacenters, VM images are stored in a storage service. When a customer boots a VM, the image is mounted over the network. Thus, even a small network outage can cause the host kernel to “panic” and reboot the guest VM. We mentioned that over of VM reboots caused by network issues in our datacenters cannot be explained using currently deployed monitoring systems. To further illustrate how important this issue can be, Figure 14 shows the number of unexplained VM reboots due to network problems in one day of operations: there were on average  VM reboots per hour due to unexplained network problems.

Figure 14: Number of network related reboots in a day.

Appendix B Network tomography example

Knowing the path of all flows, it is possible to find with confidence which link dropped a packet. To do so, consider the example network in Figure 15. Suppose that the link between nodes 2 and 4 drops packets. Flows 1–2 and 3–2 suffer from drops, but 1–3 does not. A set cover optimization, such as the one used by MAX COVERAGE and Tomo [netdiagnoser, risk], that minimizes the number of “blamed” links will correctly find the cause of drops. This problem is however equivalent to a set covering optimization problem that is known to be NP-complete [combinatorial].

Figure 15: Simple tomography example.

Appendix C Proofs

Number of pods
Number of top of the rack (ToR) switches per pod
Number of tier-1 switches per pod
Number of tier-2 switches
Level 1 link Link between ToR and tier-1 switch
Level 2 link Link between tier-1 and tier-2 switch
Set of all ToR switches in pod
Set of all tier-1 switches in pod
Set of all ToR switches ()
Set of all tier-1 switches ()
Set of all tier-2 switches
Number of failed links in the network
Upper bound on the number of packets per connection
Lower bound on the number of packets per connection
Probability that a good link drops a packet
Probability that a failed link drops a packet
Probability that a good link receives a vote
Probability that a bad link receives a vote
Probability that a good link causes a retransmission (drops at least one packet)
Probability that a bad link causes a retransmission (drops at least one packet)
Table 2: Notation and nomenclature
Definition 1 (Clos topology).

A Clos topology has  pods each with  top of the rack (ToR) switches under which lie  hosts. The ToR switches are connected to  tier-1 switches by a complete network ( links). Links between tier-0 and tier-1 switches are referred to as level 1 links. The tier-1 switches within each pod are connected to  tier-2 switches by another complete network ( links). Links between these switches are called level 2 links. This notation is illustrated in Figure 16.

Remark 1 (Communication and failure model).

Assume that connection occur uniformly at random between hosts under different ToR switches. Since the number of hosts under each ToR switch is the same, this is equivalent to saying that connections occur uniformly at random directly between ToR switches. Also, assume that link failure and connection routing are independent and that links drop packets independently across links and across packets.

Remark 2 (Notation).

We use calligraphic letter () to denote sets and boldface font (

) to denote random variables. Also, we write 

to mean the set of integers between  and , i.e., .

c.1 Proof of Theorem 1

Proof.

Start by noticing that the number of hosts below each ToR switch is the same, so that we can consider that traceroute are sent on flows uniformly at random between ToR switches at a rate . Moreover, note that routing probabilities are the same for links on the same level, so that the traceroute rate depends only on whether the link is on level 1 or level 2.

Since the probability of a switch routing a connection through any link is uniform, the traceroute rate of a level 1 link is given by

(5)

Similarly for a level 2 link:

(6)

where the second fraction represents the probability of a host connecting to another host outside its own pod, i.e., of going through a level 2 link. Since  links are connected to a tier-1 switch and  links are connected to a tier-2, the rate of ICMP packets at any links is bounded by . Taking yields (1). ∎

c.2 Proof of Theorem 2

We prove the following more precise statement of Theorem 2.

Theorem 3.

In a Clos topology with  and , 007 will rank with probability  the  bad links that drop packets with probability  above all good links that drop packets with probability  as long as

(7)

where  and  are lower and upper bounds, respectively, on the number of packets per connection,

(8)

and

(9)

with and being the probabilities of a good and bad link receiving a vote, respectively, being the total number of connections between hosts, and 

denoting the Kullback-Leibler divergence between two Bernoulli distributions with probabilities of success 

and .

Before proceeding, note that the typical scenario in which  and , as in our data center, the condition on the number of pods from Theorem 3 reduces to .

Proof.

The proof proceeds as follows. First, we show that if a link has higher probability of receiving a vote, then it receives more votes if a large enough number of connections () are established. We do so using large deviation theory [ldp], so that we can show that this does not happen actually decreases exponentially in .

Lemma 1.

If , 007 will rank bad links above good links with probability  for  as in (9).

With Lemma 1 in hands, we then need to relate the probabilities of a link receiving a vote () to the link drop rates (). This will allow us to derive the signal-to-noise ratio condition in (7). Note that the probability of a link receiving a vote is the probability of a flow going through the link and that a retransmission occurs (i.e., some link in the flow’s path drops at least one packet). Hence, we relate these probabilities by exploiting the combinatorial structure of ECMP in the Clos topology.

Lemma 2.

In a Clos topology with  and , it holds that for  bad links

(10a)
(10b)

where  and  are the probabilities of a retransmission occurring due to a bad and a good link, respectively.

Before proving these lemmata, let us see how they imply Theorem 3. From the (10) in Lemma 2, it holds that

(11)

for . Thus, in a Clos topology, if the probability of retransmission due to a bad link is large enough compare to a good link, i.e.,  for as in (8), then we have that the probability of a bad link receiving a vote is larger than that of a good link ().

Still, (11) gives a relation in terms of the probabilities of retransmission () instead of the packet drop rates () as in (7). To obtain (7), note that the probability  of retransmission during a connection with  packets due to a link that drops packets with probability  is . Since  is monotonically increasing in , we have that . Similarly, . Using the fact  yields (7). ∎

We now proceed with the proofs of Lemmata 1 and 2.

Proof of Lemma 1.

We start by noting that in a datacenter-sized Clos network, almost every connection has a hop count of . In our datacenter, this happens to  of connections. Therefore, we can approximate links votes by assuming all bad votes have the same value. Thus, suffices to determine how many votes each link has.

Since links cause retransmissions independently across connections (see Remark 1), the number of votes received by a bad link is a binomial random variable  with parameters , the total number of connections, and , the probability of a bad link receiving a vote. Similarly, let  be the number of votes on a good link, a binomial random variable with parameters  and . 007 will correctly rank the bad links if , i.e., when bad links receive more votes than good links. This event contains the event  for . Using the union bound  [probBook], the probability of 007 identifying the correct links is therefore bounded by

(12)

To proceed, note that the probabilities in (12) can be bounded using the large deviation principle [ldp]. Indeed, let  be a binomial random variable with parameters  and . For  it holds that

(13a)
(13b)

where is the Kullback-Leibler divergence between two Bernoulli distributions with probabilities of success  and  [infotheory]. Explicitly,

Substituting the inequalities (13) into (12) yields (9). ∎

Figure 16: Illustration of notation for Clos topology used in the proof of Lemma 2
Proof of Lemma 2.

Before proceeding, let , , and  denote the set of ToR, tier-1, and tier-2 switches respectively (Figure 16). Also let  and , , denote the tier-0 and tier-1 switches in pod  respectively. Note that  and . Note that we use subscripts to denote the switch tier and superscripts to denote its pod. To clarify the derivations, we maintain this notation for indices. For instance, is the -th tier-0 switch from pod , i.e., , and  is the -th tier-2 switch. Note that tier-2 switches do not belong to specific pods. We write  to denote the level 1 link that connects  to  (as in Figure 16) and use  to refer to the probability of link  causing a retransmission. Note that is also a function of the number of packets in a connection, but we omit this dependence for clarity.

The bounds in (10) are obtained by decomposing the events that 007 votes for a level 1 or level 2 link into a union of simpler events. Before proceeding, note that each connection only goes through one link in each level and in each direction, so that events such as “going through a ToR to tier-1 link” are disjoint.

Starting with level 1, let  be the event that a connection goes through link , i.e., a link that connects a ToR to a tier-1 switch in any pod. This event happens with probability

(14a)

given that there are  level 1 links and that connections occur uniformly at random. The link  will get a vote if one of five things occur: (i) it causes a retransmission; (ii) the connection stays within the pod and some other link causes a retransmission; (iii) the connection leaves the pod and a link between a tier-1 and tier-2 switch causes the retransmission; (iv) the connection leaves the pod and a link between a tier-2 and tier-1 switch causes the retransmission; or (v) the connection leaves the pod and a link between a tier-1 and ToR switch in the other pod causes the retransmission. Formally, the link  receives a vote if a connection goes through it (event ) and either of the following occurs:

  • event : causes a retransmission, i.e.,

    (14b)
  • event : the connection also goes through some , , and  causes a retransmission. Therefore,

    (14c)
  • event : the connection also goes through some and  causes a retransmission, which occurs with probability

    (14d)
  • event : the connection also goes through some , , and  causes a retransmission, so that

    (14e)
  • event : the connection also goes through some , , and  causes a retransmission. Thus,

    (14f)

Similarly for level 2, let  be the event that a connection goes through link , so that its probability is

(15a)

For this link to receive a vote either (i) it causes a retransmission; (ii) a level 1 link from the origin pod causes a retransmission; (iii) a link between a tier-2 and tier-1 switch causes the retransmission; or (iv) a level 1 link in the destination pod causes the retransmission. Thus, link  gets a vote if a connection goes through  (event ) and either of the following occurs:

  • event : causes a retransmission, i.e.,

    (15b)
  • event : the connection also goes through some and  causes a retransmission. Then,

    (15c)
  • event : the connection also goes through some , , and  causes a retransmission, which yields

    (15d)
  • event : the connection also goes through some , , and  causes a retransmission. Therefore,

    (15e)

To obtain the lower bound in (10a), note that a bad link receives at least as many votes as retransmissions it causes. Therefore, the probability of 007 voting for a bad link is larger than the probability of that link causing a retransmission. Explicitly, using the fact that failure and routing are independent and , (14) and (15) give

The assumption that  makes the first term smaller than the second and yields (10a).

In contrast, the upper bound in (10b) is obtained by applying the union bound [probBook] to (14) and (15). Indeed, this leads to the following inequalities for the probability of 007 voting for a good level 1 and level 2 link:

(16a)
(16b)

where  and  denote the probability of a good level 1 and level 2 link being voted bad, respectively. Note that once again used the independence between failures and routing. From (16), it is straightforward to see that .

To obtain (10b), we first bound (16) by assuming that all  bad links belong to the event  and , , that maximize  and . For a good level 1 link, it is straightforward to see from (14) that since , event  has the largest coefficient. Thus, taking all links to be good except for  bad links satisfying  one has

(17)

which holds for . Similarly for a good level 2 link, since  it holds from (15) that event  has the largest coefficient. Therefore,

(18)

which holds for . Straightforward algebra shows that for , , from which (10a) follows. ∎

Appendix D Greedy solution of the binary program

1:  : set of failed links
2:  : set of failed connections
3:  
4:  while  do
5:      link that explains the most number of additional failures
6:      failures explained by 
7:     
8:     
9:  end while
10:  return  
Algorithm 2 Finding the most problematic links in the network.

When discussing optimization-based alternatives to 007’s voting scheme, we presented the following problem which we dubbed the binary program

(19)
subject to

where  is a  routing matrix; is a  vector that collects the status of each flow during an epoch (each element of  is  if the connection experienced at least one retransmission and  otherwise); is the number of links; is the number of connections in an epoch; and  denotes the number of nonzero entries of the vector .

Problem (19) can be described as one of looking for the smallest number of links that explains all failures. To see this is the case, start by noting that  is an  whose -th entry describe whether link  is believed to have failed or not. Thus, since  is the routing matrix, the  vector  describes whether  explains a possible failure in that connection or not: if , then  does not explain a possible failure in the -th connection; if , then  explains a possible failure in the -th connection. Hence, the constraint  can be read as “explain each failure at least once”. Note that in an attempt to explain all failure, may explain failures that did not occur. This is the reason we used the term “possible failure” earlier. Recalling that the objective of (19) is to minimize the number of ones in , i.e., the number of links marked as “failed”, gives the interpretation from the beginning of the paragraph.

As we noted before, the binary program is NP-hard in general. Its solution is therefore typically approximated using a greedy procedure. Given the interpretation from the last paragraph, we can describe the greedy solution of (19) as in Algorithm 2 [combinatorial]. The algorithm proceeds as follows. Start with an empty set of failed links  and a set of unexplained failures . At each step, find the single link  that explains the largest number of unexplained failures, add it to , and remove from  all the failures it explains. We then iterate until  is empty. Note that this is the procedure followed by MAX COVERAGE and Tomo [netdiagnoser, risk]. They both therefore approximate the solution of (19).