Detecting Network Disruptions At Colocation Facilities

Colocation facilities and Internet eXchange Points (IXPs) provide neutral places for concurrent networks to daily exchange terabytes of data traffic. Although very reliable, these facilities are not immune to failure and may experience difficulties that can have significant impacts on exchanged traffic. In this paper we devise a methodology to identify collocation facilities in traceroute data and to monitor delay and routing patterns between facilities. We also present an anomaly detection technique to report abnormal traffic changes usually due to facilities outages. We evaluate this method with eight months of traceroute data from the RIPE Atlas measurement platform and manually inspect the most prominent events, that are: an IXP outage, a DDoS attack, and a power failure in a facility. These case studies validate the benefits of the proposed system to detect real world outages from traceroute data. We also investigate the impact of anomalies at the metropolitan-level and identify outages that span across up to eight facilities.



page 6


Internet Anomaly Detection based on Complex Network Path

Detecting the anomaly behaviors such as network failure or Internet inte...

Graph Convolutional Networks for traffic anomaly

Event detection has been an important task in transportation, whose task...

Network Phenotyping for Network Traffic Classification and Anomaly Detection

This paper proposes to develop a network phenotyping mechanism based on ...

Seek and Push: Detecting Large Traffic Aggregates in the Dataplane

High level goals such as bandwidth provisioning, accounting and network ...

A Novel Hybrid Method for Network Anomaly Detection Based on Traffic Prediction and Change Point Detection

In recent years, computer networks have become more and more advanced in...

Using Bursty Announcements for Early Detection of BGP Routing Anomalies

Despite the robust structure of the Internet, it is still susceptible to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

[id=Romain]Internet eXchange Points (IXPs) and colocation facilitiesIXPs, usually hosting access switches in colocation facilities [ref:9], have played an important role in the evolution of the Internet[ref:15, ref:101]. [id=Romain]By joining IXPs, lower tier networks are able to exchange traffic at a much lower cost than using traditional transit networks[ref:13], and at the same time experience better network performance[ref:17]. Lower tier networks located in different geographic locations are able to interconnect directly with each other by establishing peering links over the IXP infrastructure. [id=Romain]Consequently, IXPs are now critical parts of the global Internet[ref:35, ref:38, ref:26] where terabytes of traffic is exchanged daily[ref:16]. The success of IXPs is a double-edged sword as it also means that a disruptive event at a single IXP, or one of its colocation facility, can have an impact across numerous networks. In one example, a power outage on April 8, 2018, at a facility of one of the largest IXP in the world, DE-CIX Frankfurt, had broadly affected Internet connectivity across all Germany[thousandeyes:decix-outage]. The increasing criticality of IXPs provides a clear motivation for the development of techniques to detect disruptions and performance degradations at colocation facilities. These aspects led researchers, governments and intelligence agencies [ref:81, ref:82] to consider IXPs as critical parts of the worldwide traffic peering [ref:26]. On a daily basis, terabytes of traffic are exchanged through IXPs[ref:16]. However, heavy traffic anomalies, due to misconfigurations, power failures, DoS attacks, etc., can also take place between ASes over facility peering links. The deployment of an accurate anomaly detection mechanism under facility crossings remains an important challenge.

[id=Romain]This work investigates the use of data plane information to detect anomalies at IXP facilities. We leverage existing large scale traceroute results collected by the RIPE Atlas measurement platform to model usual traffic patterns at colocation facilities and detect abnormal changes. The proposed method is composed of two parts: (i) Using IP addresses obtained in traceroutes we infer border routers connected to IXPs and links crossing colocation facilities. (ii) Then we identify abnormal delay and routing changes between facilities using non-parametric statistics. This work aims to detect anomalies at IXP facilities using data plane information. Between May and December 2015, we analyze a huge ammount of RIPE Atlas traceroutes and perform the following steps: (i)Using the reported IP addresses we infer border routers [id=Alex]connected tolocated in IXPs, then (ii)detect the facility-crossing links adjacent to the IXP hops, and finally (iii)identify delay changes and routing problems between facilities using non-parametric statistical methods [ref:10].

Both colocation facility detection[ref:9] and abnormal delay/routing detection[cing:infocom03, mahajan:sosp03, deng:icnp08, ref:10] have been separately addressed in the literature. [id=Alex]We study both topics jointly and in the process improve the existing methods. But to our knowledge this work is the first to study both topics jointly. This work complements the literature with several contributions:

  • We design a unified method to detect facility outages, outages between ASes peering at an IXP, and facility or IXP maintenances.

  • This includes a method, called the Rule-based Constrained Facility Search (RCFS), for the detection of colocation facilities in traceroute data.

  • We also adapt existing delay and routing monitoring methods to the particular case of colocation facilities.

  • The implemented system provides a unique view on inter and intra facility delays and packet forwarding patterns.

  • Reported alarms are ranked by order of importance so network operators can focus only on important anomalies.

[id=Romain] Our system is evaluated with eight months of traceroute data from the RIPE Atlas platform. We manually check the most prominent detected outages and cross-validate our results with information made publicly available by the community or IXP operators. We also study the geographical span of outages and found events affecting multiple facilities in the same metropolitan area.

The rest of this paper is organised as follows: First, we define the necessary background (Section II) and our datasets (Section III). Then, we describe the proposed method (Section IV), evaluate it (Section V) and report the existing limitations (Section VI). Finally, we close with the conclusions (Section VII).

Ii Background

First, we introduce the peering infrastructure terminology and provide background to infrastructure outages.

Ii-a Internet eXchange Point (IXP)

An IXP is a network infrastructure that facilitates public peering among participant ASes [ref:13]. IXPs usually operate at layer 2 and provide low latency and high throughput solutions to their customers.

Their infrastructure can range from a single switch with a few interconnected members to a distributed [id=Romain]system spanning across multiple continents.intercontinental level. [id=Romain]Large IXPs are deployed Big IXPs usually install switches in various locations (e.g., colocation centers) to provide [id=Romain]their services to numerous multiple networks access to the IXP infrastructure. [id=Romain]The IXP members routers are directly connected to ports of the IXP switches. Each IXP member then brings its own router and connects one port to the IXP switch [id=Alex]and the other to the media leading back to the AS member. Upon agreement of the member to the IXP terms and conditions and the successful assignment of an IXP address, the AS is ready to exchange traffic with other IXP participating networks [ref:13].

Ii-B Colocation Facility (Colo)

Colos are buildings that provide secure places for networks to bring in their equipment and interconnect. [id=Alex]Cooling, fire protection, stable power, backup generators, high bandwidth cables are only few of the services they offer to satisfy their customer needs [ref:9]. Big IXPs usually install access switches in multiple colos in the city they operate. Colo customers may then utilize the IXP infrastructure to exchange traffic with members of remote colos [ref:9]. [id=Alex]Large companies may also operate multiple colos under the same city and connect them via cross-connect links [ref:9]. This distributed interconnection offers better protection and stability during an outage [ref:49].

[id=Alex]Colos provide various services to attract new customers. Cooling, fire protection, stable power, backup generators, high bandwidth cables are only few of the services they offer to satisfy their customer needs. Large companies may operate multiple colos under the same city and connect them via cross-connect links [ref:9]. This distributed interconnection offers better protection and stability during an outage [ref:49].

The peering options usually [id=Romain]available atestablished in Colos are: 1) Public peering, where the IXP infrastructure is utilized for the communication (Fig. 1A&B), 2) Private peering, where the two peers directly exchange data either via private IXP links (tethering [ref:9]) or by directly connecting with each other (cross-connect [ref:41], Fig. 1 C,D and E) and 3) Remote peering, where a network not present in the facility remotely connects to the IXP [ref:27].

In this paper, we identify public peering and under certain conditions, also private peering links. The monitoring of remote peering links is not part of this work.

Ii-C Peering Infrastructure Outages

Colos and IXPs play an important role to interconnect thousands of peers around the world [ref:49]. Power failures, human errors, attacks and natural disasters affecting these infrastructures may be [id=Romain]critical for the Internet connectivity of thousands userscrucial for the Internet stability for thousands of users. Yet, only the most severe outages [id=Romain]are publicly reported bybecome known to the public due to reports from mailing lists (e.g., NANOG, outages mailing lists), news websites, and local operators. Consequently, evaluating a facility outage detection tool is challenging due to the lack of ground truth data.

To the best of our knowledge, [ref:49] was the first study that built an automatic tool for detecting peering infrastructure outages in near real time. The authors proposed the use of location information conveyed by BGP communities [ref:45], IXP and colocation websites to geolocate the source of BGP update messages. Observing multiple BGP updates in a short period, triggered their investigation module to examine if they were interconnected with the same colocation facility or IXP. Finally, they used active traceroutes as a mean to validate the disruption in the data plane.

[id=Romain] [id=Alex]Compared to this past study our work has a more limited scope since it uses only data plane measurements. Yet for the observed facilities we expect to have better monitoring results. Our work has several advantages compared to this past study. BGP is a control plane protocol that reveals some inter-domain connections but based on their routing policies networks usually avoid announcing all their peering on BGP (e.g. private peerings). Using data plane information and the large scale deployment of RIPE Atlas we strive to monitor more peering links and focus only on those that are actually in use. In addition traceroute provides RTT data that allows us to identify detrimental delay increases within and between facilities, for example caused by DDoS attacks (see Section V-C2). This type of events has no impact on the control plane so it is undetected by methods using BGP data.

Iii Datasets

This work aims to detect [id=Romain]network disruptionstraffic anomalies at colos using traceroutes [id=Romain]data.passing through the affected entity. First we seek to identify routers located in colos and then monitor unusual [id=Romain]routingtraffic and delay patterns for the facilities intra and inter links [id=Alex](Fig. 1). To achieve this, we leverage multiple datasets:

RIPE Atlas built-in and user IPv4 Paris traceroutes measurements from May until December 2015. The built-in measurements [id=Romain]consist in traceroutes done every 30 minutes from all Atlas probes (about 10k probes) towards all DNS root servers and a few servers operated by RIPE NCC database consists of measurements from around 10,000 probes towards important destinations[ref:2]. [id=Romain]We focus on the IPv4 traceroutes that target instances of the DNS root servers at 30-minutes intervals. [id=Romain]In order to be closer to end-users and achieve lower latencies, numerous DNS root server instances are deployed at IXPs.The DNS instances tend to be close to IXPs (thus facilities) to achieve wide user coverage and low latencies. [id=Romain]Consequently the root DNS servers make excellent traceroute targets to monitor colos over time.Their distributed nature makes them excellent target candidates to monitor multiple colo paths. [id=Romain]The user measurements are used for a different purpose.On the other hand, the user measurements are performed on demand consuming RIPE Atlas credits. [id=Romain]Because they are initiated on demand and are usually lasting for a short period of time, we do not use the user measurements Since, they do not have a stable pattern, we do not use this database for anomaly detection but only to detect additional peering relationships between colo members (Section IV-B).

PeeringDB [id=Romain]database[ref:3] [id=Romain]provides diverse details about IXPs. For example, IP prefixes used for peering LANs, the facilities where IXPs are present, and ASN of member networks. This database ishosting independent entries maintained by IXP and network operators. We query PeeringDB to first identify IXP addresses in the traceroute path and then extract candidate facilities. Although, our closest available snapshot is one year [id=Romain]after the traceroute measurementslater (24/09/2016) we assume it is still accurate.

CAIDA’s Internet Topology Data Kit (ITDK [ref:4]). [id=Romain]Our facility detection algorithm seeks for IXP addresses in traceroute, but IXP connected routers do not always answer with their IXP interface [ref:11]. We utilize ITDK’s IP to alias resolution dataset to identify alias IXP interfaces in the traceroute’s path. We use the aliases resolved by MIDAR [ref:5] and iffinder [ref:6] which yield the highest confidence with very few false positives. Since ITDK becomes available every 6 months we use the closest snapshot produced on August 2015.

Routeviews prefix-to-AS map. [id=Alex] PeeringDB provides AS to facility mappings. To retrieve the ASN of IP addresses found in traceroutes we use daily dumps derived from Routeviews between May and December 2015. We make such conversions from traceroutes only when it is necessary to identify the facility (Rule 3&5 of section IV-B). PeeringDB provides [id=Romain]only an AS to facility [id=Romain]mappingsmap. [id=Romain]To retrieve the ASN of IP addresses found in traceroute resultsFor this reason we use daily dumps derived from Routeviews between May and December 2015[id=Romain] to map network IPs to their corresponding ASN.

All datasets are also available for IPv6, something not taken under consideration in this study. Networks may establish different peering decisions over IPv6 paths [ref:77, ref:78]. Furthermore, IPv6 may be affected differently by RTT delays [ref:31]. [id=Romain]It is interesting to observe how they are affected during an anomaly. A comparison between IPv4 and IPv6 anomalies would be interesting but this is left for future work.

Iv Methodology

[id=Romain]Using traceroute data we aim to detect delay and forwarding anomalies at colos. To achieve this goal we propose to model the usual delay and forwarding patterns observed between colos and detect deviant patterns. [id=Alex]We seek to model the usual delay and forwarding patterns between facility members and detect the abnormal ones using traceroute data. This work aims at modelling usual delay and forwarding pattern between facility members and detecting abnormal ones using traceroute data. [id=Romain]Specifically, the proposed system performs the following steps every one hour:The proposed system performs the following steps each hour: 1) It [id=Alex]examines the built-in measurements to detect IXP connected routers (Section IV-A) then 2) identifies the colos involved in the peering communication (Section IV-B & IV-C) and finally 3) computes It scans the built-in measurements to extract possible colo information (Section IV-A) then ii)detects and extracts the pattern between colos (Section IV-B & IV-C) and finally iii)proceeds with delay and forwarding patterns and detects anomalies (Section IV-D).

Notation Meaning
Alias(X) The alias interfaces of IPX (Includes IPX)
AS(X) The ASN of IPX
F(X)= { , … , } Facilities where the network X is present
TABLE I: Notations used in the methodology section.

Iv-a IXP Identification

The goal of this step is to isolate IP addresses related to colos. Since IXPs are usually located at colos and peering LANs are easily identifiable in traceroutes, we use IXP addresses to find traceroutes traversing colos. [id=Romain]Naturally, colos exist which don’t host an IXP. [id=Alex]Identifying addresses related to those may be possible [ref:9] however, we ignore them for accuracy reasons.

In the beginning, we parse the built-in traceroutes and extract the IP path. We sanitize it by removing hops with errors or invalid IPs (i.e. *). For the remaining clean path we query PeeringDB [id=Romain]to check if any of the observed IP addresses belong to IXP peering LANs.searching if any IP belongs in the subnet of an IXP. If such an IP is found [id=Alex]we extract the IXP and the previous IP hop (e.g., those of IPA&B in Fig. 2)a triplet (IPA-IPIXP-IPB) is extracted and stored in a form of pairs (IPA-IPIXP) and (IPIXP-IPB). Since an IXP appears in the path, we conclude that the [id=Romain]traceroute traversed a public peering link.networks use public peering.

Private peering is harder to identify because no IXP address appears in the path (like in Fig. 1C,D,E). Instead we check if alias IPs are used that are assigned to routers with IXP interfaces. [id=Alex]Upon finding two sequential ones we extract the corresponding hops (e.g., those of IPA&C in Fig. 1C). If we find one we extract the two surrounding IP addresses, for example in Fig. 1C two pairs are extracted, (IPA-IPC) and (IPC-IPD), one for each IXP connected router.

[id=Alex]At the end of this step, we obtain IP hops related to colos. In the next step we use the IP of the first and the IP of the second hop to detect the corresponding near-end and far-end colo. We call them near and far-end with respect to their order in the traceroute path.At the end of this step, we obtain IP pairs related to colos and their corresponding RTTs. In the next step we identify the actual location of each extracted IP.

Iv-B Facility Detection Phase (RCFS)

To identify the colos, [id=Romain]we improvewe modify the constrained facility search (CFS) methodproposed in[ref:9][id=Romain] to better exploit the information provided by prioritize specific entries of PeeringDB. [id=Alex]The original algorithm combined IXP information from multiple sources but in doing so it ignored useful mappings between the IXP address and the AS of the customer (refer to rule 2).

In our experiments with PeeringDB we observed that the IXP to facility mappings are more reliable than the AS to facility mappings. [id=Romain]This is the case mainly because IXPs [id=Romain]tend to carefully maintain their PeeringDB entriesutilize PeeringDB as a front-end of their costumers list to attract new customers [ref:13]. However, AS entries may be outdated or contain limited information, e.g. for security concerns. Based on these observations, we propose a rule (RCFS) model which strategically constrains [id=Alex]the facility search for the specific case of PeeringDB.

Following the example of Fig. 2 for each extracted hop we first consider the IXPs in the alias list (Rule 1&2) then the remaining AS information of this list (Rule 3) and finally, if we could not identify the colo, the IXP and AS information of the next hops (Rule 4&5). If possible, we avoid IP to AS conversions for border routers as they are prone to errors. For the near-end pairs of Fig. 2 we first consider the IXP information of Alias(IPA) (Rule 1&2) then the AS information of AS(Alias(IPA)) (Rule 3) and finally, if we could not identify the colo, the IXP and AS information of the second IPs in the pairs (Rule 4&5). We avoid using next hop IPs before Rule 4 and 5 because they may involve traceroute problems, e.g. routers answering from another interface [ref:86] or load balancing not be mitigated by Paris traceroute [ref:80].

Fig. 1: Connectivity links between facilities. Public peering inter (A) and intra link (B), private peering intra (C) and inter links (D,E).
Fig. 2: Rule example of a path with extracted hops those of IPA & IPB. IPB & IPC are IXP addresses.

[id=Alex]The detection of the near-end colo is thoroughly described below using IPA. The same method can be applied for the detection of the far-end using IPB. Each rule receives as input the candidate colos of the previous rule then constrains them and forwards them to the next rule. The algorithm stops when either a single candidate remains or none due to conflicts between rules. Note that rules can be skipped. For example, if there is no alias for an IP then R3 output is R3 input.
Rule 0: User yielded information. [id=Romain]WeBefore beginning the analysis, we allow the user to specify IP interfaces that [id=Romain]are known as belonginghe already knows that they belong to a colo router. If such an IP is found then we conclude that the traceroute traversed the specified colo. [id=Alex]We use this rule to map a few known IPs close to the DNS Root servers [ref:10] that were missing from PeeringDB.
Rule 1: Facilities of the IXP. We begin by looking for IXP addresses in the Alias(IPA) list. If such addresses are found we fetch from PeeringDB all the colos of the identified IXP. Notice that a router may be connected to multiple IXPs. In Fig. 2, the router of IPA is connected to IXPAC. This allows to further constrain the candidate colos: . Since we obtained several colos, we proceed to Rule 2.
Rule 2: Facilities of the IXP address. IXP operators report on PeeringDB the addresses they assign to their customer networks. If in Rule 1 we observe such an IXP address, we retrieve the customer’s ASN from PeeringDB’s IXP page. Then the colos where the customer is present. We intersect these results with the ones of Rule 1.[id=Alex]If the intersection is empty we stop and return an error that signals conflicting informations. In our example, the router owner is thus . [id=Alex]This rule was not useful for the near-end. It is important though for the far-end as it reveals the AS of the far-end connected peer which is not always visible in the traceroute.
Rule 3: Facilities of the alias ASNs. [id=Alex]When the IXP data are not sufficient we instead focus on networks peering inside the colo. First, using Routeviews we convert each non-IXP address of the [id=Alex]AliasAlias(IPA) list to the corresponding ASN. ThenWhen neither an IXP is detected nor a single colo has been identified we change our focus from the IXP to the AS entries in PeeringDB. For each non-IXP address in the Alias(IPA) list we obtain the corresponding ASNs from the Routeviews data and then for each AS we fetch the candidate colos from the ASN pages in PeeringDB. [id=AlexFinal]The final results of this rule are the colos where all the ASes and the IXP (if any from R1/R2) are present.The candidate colos should be the ones of the previous rule, where all the alias ASes are present. In our example, the router with IPA is connected to [id=Romain]routers ofthe router of and . [id=Alex]If ASB used addresses of its domain to establish the peering interconnection it is visible in the alias list thus, . If the IP addresses used for these peering links belong to and then we can constrain the facility search using the facilities corresponding to and . Assume that this happens for , then .

[id=Romain]We only use networks that we can identify in this rule. We ignore IP addresses missing from Routeviews and ASNs which neither contain an entry nor report their colos in PeeringDB. [id=AlexFinal]Sometimes, incomplete AS/IXP entries may return no candidate colos. We ignore those. If all R1, R2, R3 are skipped we stop the facility identification here.Rule 3 is a terminal rule; if no information has been revealed until this rule, we stop the facility detection. For accuracy reasons we require to have obtained at least some candidate colos by Rule 3. Rule 4&5 focus on next hops and may be contaminated by traceroute errors. [id=Alex]The following rules focus on the next hops. Although useful we avoid using them sooner because of traceroute problems, e.g. routers answering from another interface [ref:86] or load balancing not be mitigated by Paris traceroute [ref:80].
Rule 4: Facilities of next hop (IXP). Midar’s alias resolution has a small false positive rate yet false negatives are possible. This means that IXP addresses may be missing from the alias list in Rule 1. Assuming that routers answer with their inbound interface, the IXP of the near-end is surely observed after crossing the IXP link (IPBC in Fig. 2). [id=Alex]Rule 4 takes advantage of this observation to constrain with colos of undetected IXPs thatidentify IXPs that remained undetected in the alias list but appeared in the next hop. We consider as next hops all those that appeared in the built-in and user defined measurements during the same day.
Rule 5: Facilities of next hop (AS). As a last resort, [id=Alex]following the idea of Rule 4 we pick all the alias[id=Alex](es)( interfaces of IPA and for each IP we resolve the next hop AS(es). Our goal is to reveal additional peerings between and networks that were not observed in Rule 3. [id=Romain]Then weIf we find such a relationship we can then utilize the colos of [id=Romain]these ASes to constrain our facility search.the other peer to constrain the ones remaining. However, [id=Romain]for colos that are geographically close, cross-connect links may cause wrong inferencesproblems may arise when multiple colos exist close to each other. Cross-connect links may result the other peer to be member of another colo (e.g., Fig. 1D&E). In order to mitigate this problem, we independently intersect each new [id=Alex]peer’speer AS colos with the results of Rule 4. [id=Alex]Among the independent intersections we pick as candidate the colo which appeared in the majority if it accounts at least 75 of those intersections. Then we count the number of new peer ASes per colo, and consider a colo as successfully identified if it account for at least 75% of these ASes.

[id=Romain]The 75% threshold is empirically found with our traceroute dataset. [id=Alex]Compared to the 100% threshold which discards 32% of the IPs the 75% discards 26.7% as depicted in Fig. 3 (left). This allows identification of a few additional colos, 5.3% in total. Between May and December 2015 the 75% threshold allows identification of a few additional colos, 5.3% in total as depicted in Fig. 3 [id=AlexFinal](left). Although, there is the possibility of some incorrect inferences, we consider this threshold beneficial for identifying routers used to establish multiple peering relationships.

Fig. 3: Dropped IPs based on the sliding threshold. With a 50 threshold every IP is accepted meanwhile with a 100 threshold only consistent IPs are accepted; 32 (left) and 6.3 (right) are lost.

[id=Alex]At the end of this step, we temporarily store the near-end and far-end identifications. In the next step we make the final colo decision. Note that for the case of intra-colo links the near-end and far-end colos are the same (like in Fig. 1B&C). Upon successful identification of the near-end colo, we repeat the algorithm using as input the far-end IP pairs, that is the pairs having IPB as first hop in our example (Fig. 2). If this step is successful, the pairs are transformed to their near-end and far-end colos. In the case of intra-colo links the far-end and near-end colo are the same like in Fig. 1B&C.

Iv-C Temporal Consistency

[id=Romain]Some of our datasets are daily updated, hence outlier values may temporarily appear and punctually impact the facility identification results. To address this issue we check the stability of the IP to facility mappingspairs across time and clean aberrant results.

Fig. 3 (right) depicts that only a few IPs (6.3) are unstable. The goal of this step is to check the stability of the IP pairs across time and clean aberrant results. Although all the pairs that have either IPA or one of its aliases as the first hop are consistent within the same day this is not guaranteed for individual dates. Each day we load a different next hop database and Routeviews dataset. For ths reason, we need to parse the output of the previous step an additional time. For each of the pairs above we pick as correct the colo where they matched in the majority of dates, above a 87 threshold. Pairs that failed to pass threshold are removed from the detected dataset. Fig. 3 (right) depicts that only a few IPs (6.3) are unstable. The 87 threshold allows to map 1.5 of those to their colo. More conservative studies can set the thresholds to higher values. [id=Alex]The goal of this step is to check the stability of the previous results across time and clean aberrant results. Each day we load a different next hop database and Routeviews dataset. This may result in a slightly different information causing some IP addresses either to fail the detection or to match on a different colo.

[id=Alex]We parse the output of the previous step an additional time. For each pair , we pick as correct the colo where IPA was matched the majority of times. We use a 87 threshold under which we remove the pair from the detected dataset. [id=Alex]Fig. 3 depicts that only a few IPs (6.3) are unstable. The 87 threshold allows to map 1.5 of those to their colo. More conservative studies can set the thresholds to higher values.

Iv-D Anomaly Detection

Links between colos are critical. [id=Romain]Disruptions on these linksan anomaly there may cause connectivity problems to thousand [id=Alex]Internet users. [id=Alex]We adjust the techniques of [ref:10] and build a simple tool to detect abnormal patterns for the specific case of colos.We build a simple tool to monitor delay and forwarding patterns and report abnormal patterns. First, [id=Alex]we computeit computes the forwarding pattern of each colo (Section IV-E) then, the delay of each near-end link towards the far-end (Section IV-F) and, lastly [id=Alex]we compareit compares those two patterns to computed references to detect anomalies [id=Alex](Section IV-G).

Iv-E Forwarding Model

[id=Alex]We collect traceroutes for each Atlas probe, extract the colos, and count the number of times links between colos are traversed. From all different probes, we aggregate those counters to produce the forwarding pattern of each near-end colo. We compare it to a computed reference that represents usual patterns (see Section IV-G). We propose a forwarding model to monitor the number of packets passing from a near-end facility to its neighboring facilities. We call this the forwarding pattern of the near-end colos. These patterns are compared to a computed reference that represents usual patterns (see Section IV-G).

Figure 6 illustrates an example of this pattern for the near-end (A) towards each far-end (B,C,D,E). The usual forwarding pattern computed at hour t-1 is . In the next hour, we notice an unusual decrease in the packets towards B and C accompanied by a similar increase towards facility D.

[id=Romain]To detect anomalous patterns, we test for homogeneity with the chi-squared test and the following null hypothesis:We formulate a Null hypothesis and use the chi-squared to test for homogeneity.

Null hypothesis: The observed data for an hour is consistent with the normal reference computed from the previous hours.
Alternative: The observed data is not consistent with the normal reference.

(a) Time t-1
(b) Time t
Fig. 6: Usual (a) and anomalous (b) forwarding patterns for colo A towards the far-end colos B, C, D and E.

[id=Romain]Our goal is to search for time periods where the Null hypothesis is rejected in favor of the alternative. [id=Alex]Since Atlas probes traceroute the same destination every 30 minutes, under normal network conditions, we expect the current pattern to be consistent with the distribution of the reference. For the chi-squared test we set a significant level of 0.01 under which we reject the null hypothesis and report an alarm. Since, this type of test does not work properly for small expected values (), [id=Romain]we sum counts from far-end colos with less than 5 packets into one variable.we group all those links under a single variable expressing the sum of packets crossing through their links.

Usually only few paths are responsible for a forwarding anomaly. Suppose is the anomalous pattern and the computed reference. We reuse the responsibility metric defined in [ref:10] to detect which path caused an anomaly:

The responsibility metric values range from [-1, 1]. Negative values stand for paths with an unusually low number of packets. Positives values represent an unusually high and values close to zero for normal situations.

Iv-F Delay Change Detection

[id=Romain] Estimating delays for intra and inter facility links is not a trivial task because of traceroute limitations, such as path asymmetry and RTT variability

[vries:aims15, schwartz:infocom10, ref:10]. The proposed detection model monitors an estimator of the delay required for the traceroutes packets to traverse intra and inter facility links. An accurate estimation that a link is affected by delay is not trivial due to the path asymmetry problem [id=Alex](need ref. Any idea?)

In [ref:10] a technique is proposed to address these challenges when a link is observed by a sufficient number of probes with different return paths. That technique monitors the shifts in the distribution of the median differential RTT() and distinguishes strong alarms. [id=Alex]Since, colo links are usually monitored by multiple probes from different ASes with disparate return paths, we implement that monitoring technique. We implement their methodology since the facility links are usually monitored by multiple probes from different ASes.

From Section IV-A we extracted the IP hops and the RTT values. In Section IV-B we found the facilities. So for the traceroute of Fig. 2 . We group all values from the same near-end towards the same far-end colo and calculate the median over those to ensure that an anomaly will trigger only if the majority of RTT values between the two facilities get affected. [id=Alex]Note that multiple routers of colo A and colo B may be involved in this grouping. [id=AlexFinal]Like in [ref:10]

, we calculate confidence intervals for both the observed and the reference

to detect significant statistical changes. If those confidence intervals stop to ovelap we report an alarm like those of Fig. 11.

Iv-G Reference Computation

[id=Romain]The normal reference to detect anomalies is computed every hour using exponential smoothing for both the forwarding and delay patterns: Each hour we compute a reference value and compare with the current observed values to detect anomalies. We do this for both the forwarding and the pattern using exponential smoothing:

Where is the reference of the monitored link, and the reference forwarding pattern of the facility. Likewise, and are the reference patterns of the previous hour and, and are the current observed values. The exponential smoothing parameter , controls the importance of new measures as opposed to previous observed ones. In our case, we set a = 0.03 to mitigate faster the impact caused by sudden anomalous bursts.

The initial reference value and are quite important when is small. To solve the cold start problem, we calculate them over the median of the values observed during the first day of our analysis. [id=Romain]We maintain a different reference for each facility and update them at each one-hour time bin.For each facility we maintain the reference in memory and updates them with new observations.

V System Evaluation

Now we evaluate our proposed rules and their assistance in identifying facilities. Then, we discuss the anomaly detection results and present the most significant detected disruptions. Anomaly detection systems are usually evaluated in terms of true positives and false posititives, however, in our case such validation is challenging since confidential information is needed both for the facility detection and alarm validation.

We verify the correctness of our anomaly detection system by checking our top reported alarms, and the public reports of major outages that took place in 2015. In our results we report anomalies caused by an IXP outage on May, a power failure in colocation facility at mid November and a DDoS attack at the end of November.

V-a RCFS Evaluation

We evaluate our facility identification method, RCFS, with the built-in measurements from May until December 2015. On average, each day we analyze 12 million traceroutes and extract [id=Alex]14.500 unique router interfaces potentially located in colos forming 14 million IP interconnections. This large number of interconnections is due to the rich inter-IXP connectivity [ref:44]. 14 million IP pairs corresponding to 14.500 unique router interfaces that are potentially located in colos. Recall that for each IXP address in the traceroute path, we extract two pairs (e.g. (IPA, IPIXP) and (IPIXP, IPB)). We use IPA of the first pair to identify the near-end facility and IPIXP of the second pair to identify the far-end. The second IP in each pair assists only in the facility discovery, i.e., in Rule 4 and 5. The difference between the number of compared pairs and interfaces is due to the rich inter-IXP connectivity [ref:44].

[id=Romain]To understand the contribution of each RCFS rule, we picked a smaller dataset, from May to August, and inspected in details the identification results per rule. For this smaller dataset RCFS consistently identifies facilities for 4000 of the 14500 interfaces (28%). In a snapshot of 4 months, between May and August, our algorithm is consistent each day identifying 4000 (28%) out of the 14500 interfaces. We picked a smaller dataset to perform this evaluation however, similar results should apply for the whole 2015 period.

Table II and Fig. 7 depict the contribution of each rule for the facility identification. We observe that the first three rules are responsible for 39 of the detected facilities. For the rest 61, [id=Alex]next hop information is required (Rule 4&5). Rule 4&5 require information from the next hop. Rule 4 is responsible for the majority of identifications since, the near-end colo requires knowledge of the IXP whose IP interface usually appears in the next hop, [id=Alex]e.g. like in Fig. 1A&B. In our experiments we also used Rule 0 to map a few IPs () [id=Alex]not listed in PeeringDB but on the IXP websites.Those IPs were not listed in PeeringDB but were in the IXP websites.

Average % Rule 0 Rule 1 Rule 2 Rule 3 Rule 4 Rule 5
Successes % 0.097 5.391 18.188 15.289 46.731 14.301
Failures % 0 0.604 2.803 23.823 19.224 53.543
Unique Router Interfaces observed per day: 14.500
TABLE II: Average facility converges & failures by each rule.
Fig. 7: Daily of unique IPs whose facility was identified be each rule(left) or failed to do so(right).

[id=Alex]Similarly, [id=Romain]the right-hand side plot of Fig. 7Similarly, Fig. 7 right reveals the rules where the identification failed. RCFS fails identifying the colo when the search space becomes empty. [id=Alex]Failures at the last rule 5 usually mean that we are unable to detect sufficient peering relationships to constrain the candidate colos. a failure in Rule 5 may also mean that the pair was unable to converge due to insufficient information. As shown in Fig. V-A, this is the case for about 85% of the IPs. For these interfaces we would require additional active traceroutes [id=Alex]either from RIPE Atlas or from other sources (e.g. CAIDA Ark).

From the above figures, we clearly observe that our results are consistent and stable over time. This is required for the anomaly detection to safely identify pattern discrepancies.

figureCovergence vs threshold failure for the IPs of rule5 figureThe Duration of alarms of the detection modules.

V-B Anomaly Detection Evaluation

[id=Alex]From the 4000 daily interfaces we observe in total 264 facilities where both the near-end and the far-end facility were identified. Identification of the far-end is challenging since the far-end router typically replies with the IXP interface. The AS connected one usually does not appear in the path and the next hop interface might belong to a different AS (section VI-A).

[id=Alex]From the 4000 daily interfaces between May and December 2015 we monitor links between 264 facilities. From those colos, 156 were reported as anomalous at least once. Between May and December 2015, we observe 156 out of the 264 colos to be reported at least once. In total we found that the observed patterns deviate from the computed references 13135 times for the forwarding analysis and 19850 times for the delay analysis. 61 of the forwarding alarms last less than 1 hour while, 81.6 and 90.8 last less than 3 and 8 hours respectively (Fig. V-A). Similarly, 59 of the differential RTT alarms last less than 1 hour while 81. and 90.3 last less than 3 and 6 hours respectively. [id=Alex]Long lasted alarms are an indication either of permanent routing changes affecting the link or of a strong alarm corrupting the reference value, like in Fig. 11. Under both cases the alarm will continue to be reported until the reference converges to new observations.

[id=Alex]The cause of network disruptions in colos is usually short lived yet traffic patterns can be affected for multiple hours. For instance, for the AMS-IX outage described in Section V-C1, a 10 minute disruption in the IXP affected the forwarding patterns of colos for 2 hours. Longer lasting alarms are indications either of permanent routing changes (Fig. 15) or of a strong alarm corrupting the reference value (Fig. 11). In both cases, the alarm will continue to be reported until the reference converges to new observations.

[id=Romain]Our system is sensitive to small pattern changes, but, as described below, we can focus only on significant anomalies by ranking alarms based on their deviation from the reference. Since, plenty of alarms are reported by our system [id=Alex]in the next subsections we distinguish strong alarms and rank them based on their deviation from the reference.

V-B1 Ranking Delay Anomalies

When we detect a differential RTT anomaly we calculate the deviation between the reference and the observed confidence intervals (eq 6 in [ref:10]). We use this metric to rank the differential RTT anomalies [id=Alex]after we remove those where both a delay change and a forwarding anomaly occurred. First, we remove all the anomalies where both a delay change and a forwarding anomaly occurred. This is because a change in the forwarding pattern is likely to affect the median RTT and thus to create a false alarm. [id=Romain]By ignoring anomalies where both a forwarding and a delay change occurred only clean delay data remain with deviations that can be ranked. Among the top-80 alarms in Fig. 8 we observe 3 outstanding cases[id=Romain] with otherwise stable differential patterns.

Fig. 8: Top-80 and forwarding alarms.

The first case affected Equinix London LD5 on 2015-9-4 22:00 UTC [id=Alex]and lastedfor 2 hours. We observe both intra and inter facility links getting congested (Fig. 11). The second was due to a DDoS attack at the end of November ([id=Romain]described in details in Section V-C2). The third stands for [id=Romain]delay changes an anomaly on a link that connects Interxion Frankfurt(FRA1-12) with Speedbone Berlin on 2015-6-8 19:00 UTC. Validation from public sources was only possible for the DDoS outage.

Fig. 11: Top1 delay alarm affecting intra links of Equinix LD5 (up) and inter links towards Equinix LD8 (down).

V-B2 Ranking Forwarding Anomalies

[id=Alex]To rank a forwarding anomaly We use the mean squared error to quantify the change in the forwarding patterns of each near-end facility:

where is the number of far-end colos, and the observed and the reference forwarding pattern towards far-end colos.

All top alarms[id=Alex](Fig. 8 right plot) report [id=Romain]significant changes in the number of traceroute passing through a link like the example illustrated in heavy traffic spikes like those of Fig. V-B2. Such spikes may happen due to [id=Romain]inter-facility link failures or routing changes.BGP policy changes. [id=Romain]We hypothesize that the two one-month apart alarms of Fig. V-B2 are due to maintenance work. The 1 month difference between the two alarms in Fig. V-B2 might be an indication of a planned maintenance.

figureTop1&4 forwarding alarms from Equinix Frankfurt(FR7) to Equinix Amsterdam(AM7) at 10-6-22:00 & 11-5-22:00 UTC. figureUsual number of affected facilities in the top10 cities. * annotates unusual outages.

V-B3 Detecting Metropolitan Outages

To further validate our system, we quantify the impact of detected anomalies on colos of the same metropolitan area. We first extract each facility’s address and use the Google Maps API to obtain the local city’s GPS coordinates. Then, using those coordinates we calculate the Vincenty distance between each colo and map the ones closer than 50km to the same metropolitan area. [id=Romain][id=Alex]We ignore periods of time we know there is a problem with the atlas controller or our data collection

[id=Alex]Atlas probe controller failures can cause an incorrect metropolitan alarm. [id=Alex]We filter out such cases using the responsibility metric (). Our expectation is that during a huge anomaly, the links closer to the source will lose traffic while, other nearby ones will see an increasing load from the back-up paths between colos. We consider observing different links with unusual high and low number of packets, ( and ) an alternative approach to mitigate database failures. We filter out all other forwarding metropolitan alarms where .

[id=Alex]We focus only on the metropolitan alarms with the largest impact by filtering out all those that don’t include forwarding anomalies with . Table III annotates the top-10 metropolitan areas based on the number of such observed alarms. Usually most alarms affect only a few facilities in those cities. We [id=Romain]found a few instancesobserve though a few hours where the alarms [id=Romain] spanned across multiple facilitieswere more severe (Fig. V-B2). As an example, the AMS-IX outage [ref:54] caused forwarding anomalies to links between 8 local facilities[id=Romain] (see Sec.V-C1).

Cities Frankfurt Amsterdam London Vienna Tokyo
Alarms 459 399 216 125 77
Cities Los Angeles New York Milano Munich Hong Kong
Alarms 65 59 57 53 52
TABLE III: Top10 cities based on the alarms observed.

V-C [id=Romain]Case StudiesIncidents of Network Outages

To validate our system we looked for [id=Romain]events in 2015 that were publicly disclosedpublicly available outages events in 2015 reported either on mailing lists [ref:52, ref:53] or by network operators. Usually the source of the outage and the entities affected are not made publicly available. In some cases, even the exact duration of the event is not reported, however local news websites and reports can provide good hints for an estimation. We discuss three major events that occurred in May and November 2015 that our system reports.

V-C1 Amsterdam Exchange Point (AMS-IX) Outage

In May-13 between 10:00-12:30 an outage at Amsterdam’s core Internet switch infrastructure [ref:55] caused online problems in several parts of the Netherlands. According to news websites the cause was due to a technical fault inside the IXP [ref:54]. Using our system we report up to 8 local facilities [id=Alex]with an unusual low and an unsual high forwarding pattern. For example, the having their traffic affected. We observe facility links with an unusual low and an unusual high number of traceroutes [id=Alex]like those of Fig. 15. The outage caused sharp decreases of the number of traceroutes [id=Alex]towards Interxion Science Park (Fig. 15 (a)). Simultaneously, Science Park members seem to have used backup paths leading to facilities both inside (Fig. 15 (c)) and outside of the country (Fig. 15 (b)). We observe that the paths leading inside the country, towards Equinix AM7

, did not revert back to their usual values probably because of new routes selected after the outage (

Fig. 15 (c)). passing through Global Switch Amsterdam inter-links (Fig. 15 (b)) and the IXP intra-links in NIKHEF.On the other hand, Interxion Science Park seems to provide backup paths inside and outside of the country (Fig. 15 (b)). In addition, we observe that the number of traceroutes for the Interxion Science Park did not revert back to its usual values probably because of new routes selected after the outage (Fig. 15 (c)).

(a) Global Switch Amsterdam
Interxion Science Park
(b) Interxion Science Park
Interxion Frankfurt
(c) Interxion Science Park Equinix AMS South East (AM7)
Fig. 15: Routing changes observed during the AMS-IX outage.

V-C2 DDoS attack against DNS Root Servers

On November 30, 2015 from 6:50 to 9:30 UTC, and on December 1 from 05:10 to 6:10 UTC, the DNS root servers received an unusually high number of [id=Romain]spoofed queries [ref:56].[id=Alex]well formed queries with spoofed source IP making it unclear from where the traffic had originated [ref:56]. queries. The queries were well formed and the source IP was spoofed making it unclear from where the traffic had originated [ref:56]. [id=Romain]Each root server has been differently affected by this malicious traffic [ref:58] but overall the DNS root infrastructure stayed operational during the attack.In overall, the anycasted DNS root instances handled well the traffic, although each one had been affected in a different way [ref:58].

[id=Romain]Our system reports both delay and forwarding anomalies during the attack, mostly forWe report both delay changes and forwarding anomalies that affected links in Amsterdam (Fig. 21A-D) and London facilities (Fig. 21E). Fig. 21A indicates that the Global Switch links toward Digital Realty (Wenckebachweg) Amsterdam got affected during both attacks by a severe forwarding anomaly possibly due to [id=Alex]high rates of packet losses from overloaded routers. At the same time, between 06:00 and 09:00 UTC, we report links of the same facility towards Equinix Science Park(AM3) [id=Alex](Fig. 21B) with an unusual differential RTT but without an obvious change in the forwarding pattern. got affected by a change in the differential RTT pattern without the corresponding routing pattern to be affected (Fig. 21A&B).

Fig. 21: DDoS outage affecting links of Global switch towards a)Digital realty(Wenckebachweg) and b)Equinix science Park(AM3). Links of c&d)Equinix AMS south east(AM7) towards Equinix science Park(AM3) and links of e)Equinix London Docklands(LD8) towards Telehouse Docklands North.

Links from Equinix Amsterdam South East(AM7) towards Equinix Science Park(AM3) also experienced high delays, possibly due to the same congested router in the far end facility (Fig. 21C). The increased traffic pattern during the same hour in Fig. 21D also [id=Romain]confirmsverifies that the DNS service [id=Romain]hosted near theclose to Science Park handled queries of many other unresponsive services. [id=Romain]These results corroborate withThis can be validated by the results of [ref:58] reporting that many of those services were stressed by sustained traffic during the attack period. It is important also to note that although the two attacks were chronologically close events, the impact of the first one was much stronger for the Amsterdam facilities. During the second attack no RTT change was observed and only routing anomalies were reported.

V-C3 Telecity Sovereign House outage

On November 17, 2015, a power outage affected London’s Sovereign House facility where both its primary and secondary supplies failed to start up. No official announcement was made but reports from network operators appeared around 2PM local time [ref:95] and continued until the night of the 18th[ref:96]. Although the visibility of this facility is limited in our datasets, [id=Romain]during the outage our system reports a clear drop in the number of traceroutes between London’s Sovereign House facility and Telehouse-London(Docklands East) we are able to detect the start of the outage [id=Alex]from links towards “Telehouse - London (Docklands East)” that got affected (Fig. 22). [id=Romain]This illustrates the benefits of the our system to detect outages although the number of traceroutes crossing the facility might be low.

Fig. 22: Outage affecting the links of Sovereign House towards Telehouse-London(Docklands East) on 2015-11-17 from 14:00 to 15:00 UTC.

Using the proposed colocation facility detection and anomaly detection algorithms, we can evaluate the impact of these events on the physical infrastructure. These insights are not available to previous anomaly detection work [ref:10] as they work only at the IP-layer.

Vi Limitations

Vi-a Facility Identification

In large cities multiple facilities may be connected via cross connect links. When constraining the facility search through alias resolution, the alias ASN may be a member of another facility[id=Alex]. For example, in Fig.1D [id=Romain]if is observed in traceroutes if the peering interconnection is established by then the identification of the near-end [id=Romain]maywill fail: (e.g., like in Fig. 1D). In this case, if the peering interconnection is established by the algorithm will fail: .

[id=Romain] When searching for the far-end facility, we assume that the AS connected to the IXP is given by the second IP in the far-end IP pair. Our current implementation maps IP addresses to ASN by simply matching the longest prefix from Routeviews data. But if this IP belongs to an inter-AS link we might infer a wrong ASN. This issue could be addressed by methods like MAP-IT[mapit:imc16] which is something we are planning to investigate in future work. MPLS possible used in the IXP border routers may also affect the facility decision. To recall, when searching the far-end facility the AS connected interface does not appear in the path. As a last resort, we assume that the AS appears in the next hop since the IXP router is an ingress router to it’s network. In the case of MPLS though, a different AS may appear instead [vanaubel2017through].

Vi-B Live Monitoring Limitations

[id=Romain] Our system requires multiple vantage points and a common repository to collect all traceroute results (i.e. Atlas probes and controllers for our current implementation). Outages appearing close to the data collection may prevent timely access to the data thus impair the performance of our system. This case happened during the AMS-IX outage as the RIPE infrastructure is mainly located in Amsterdam[ref:55]. [id=Alex]In our implementation, we monitored facilities at 1 hour intervals. Although, it is possible to monitor at smaller intervals, RIPE Atlas suffers from limitations when real-time measurements are involved. Using our system we monitored facilities at 1 hour intervals. RIPE Atlas performs the DNS measurements every 30 minutes. Although, it may be theoretically possible to monitor at smaller intervals RIPE Atlas suffers from some limitations when real-time measurements are involved. When an Atlas probe connects to the Internet, a registration server assigns a “probe-controller” to the candidate probe [ref:70]. The controller reports the probe’s measurements results back to the server, so that they become publicly available. In the case of a controller disruption the probe might have performed the measurements but be unable to make them public, e.g., like in the case of the AMS outage [ref:55]). [id=Alex]Although this was an actual outage, it’s looming of whether the probe is considered connected or disconnected during that time. A live system monitoring crucial destinations may incorrectly infer it as an anomaly while, in reality, this is just a problem between the probe and the controller.

Vii Conclusions

[id=Romain]This work leverages large scale traceroute measurements to monitor the intricate peering world of colocation facilities. We devised a system that maps colocation facilities to traceroute data and monitors delay and forwarding anomalies at the facility level. The proposed system enables us to go beyond the usual IP-level monitoring as it offers unique inter and intra facility monitoring capabilities. To demonstrate its benefits we analyzed eight months of data from the RIPE Atlas measurement platform and reported important outages caused by DoS attacks, power outages, and mis-operations. We found outages that can span across up to eight facilities and last several hours. These results provide new insights about the physical locations of facility outages which are crucial for a better understanding of the peering ecosystem and reliable connectivity. Furthermore, they provide yet another input for operators to configure their traffic and avoid outages. In this work, we focus on data plane measurements to shed some light on the peering world of colocation facilities. Using RIPE Atlas, we first map the colos that traceroutes traverse through and then monitor the inter and intra facility links to detect forwarding and delay anomalies. We scan the Atlas built-in measurements between May and December 2015 where we detect a lot, above 10k of anomaly reports. Ranking the anomalies though, we report only a few that can be considered as actual outages. In the top list of our differential alarms, we observe the DNS attack at the end of November and notice the AMS outage of May affecting the traffic of up to 8 colos in the same area.

[id=Romain] In the future, a combination of a data-plane focused and control-plane focused system can yield more complete results from the visibility perspective.