With a plethora of performance-sensitive distributed applications running on datacenter and wide-area networks, the requirements on the underlying network fabric have become extremely stringent [timegone]. In particular, because the fabric interconnects applications’ end-points, there is a push toward making it highly available.
A key factor that impacts fabric availability from the perspective of applications is failures. Even in the most well-managed networks, link/switch failures are common [datacenterfailures]. A variety of factors ranging from device crashes/reboots, cabling, buggy hardware/firmware, power supply issues, etc., can conspire to constantly induce link/switch failures.
The fabric’s behavior under failures critically determines its perceived availability. When a failure occurs in today’s data center and wide-area network fabrics, network forwarding attempts to reconverge to re-establish paths. When the network is still in an unconverged state, traffic destined to certain endpoints will have no valid route and will be dropped. This leads to a precipitous performance degradation for critical applications. Unfortunately, networks can remain in unconverged states for unreasonable amounts of time; this is true even if state-of-the-art approaches to route around failures are employed [frr, ddc].
Our primary goal is to design a fabric that provides the illusion of being always available. We define this as: if, under a failure scenario, there exist active paths between a source-destination pair, then the fabric must route packets through some such path without inducing any drops.
The other major consideration for networks is policy compliance. For example, Network Function Virtualization (NFV) [opennf, e2], a popular use-case, allows tenants and operators to specify middlebox chains that traffic between a set of endpoints must traverse for security and performance considerations. Because a non-trivial fraction of such middleboxes are now part of network fabrics [andromeda, azurefpga], the network is also tasked with ensuring correct middlebox traversal. As another example, operators may desire to employ various network load-balancing schemes—e.g., WCMP [wcmp]—so that the network can effectively spread load across multiple available paths to avoid queuing and congestion drops.
While operators can use various frameworks for policy compliance [simple, flowtags, merlin, propane, genesis, zeppelin, netkat], ensuring that policies always hold, especially when reacting to failures, is something that no state-of-art approach achieves. This is our second goal.
We observe that the main obstacle in realizing an “always available, policy-compliant” network is that recomputing new policy-compliant routes under failures is unreasonably slow. Traditionally, recomputation is performed by a centralized or distributed control plane; in both cases, the computation is off the fast path of packet forwarding, and therefore slow. The data plane, which lies on the fast path, is only equipped to perform forwarding based on the control plane route computations.
Thus, to meet our goals, we argue for a refactoring of responsibilities across control and data planes. Specifically, we argue for performing all route computation entirely in the data plane for the fastest reaction. Our paper shows that, with technology available today, it is possible to realize such data plane-only routing that can instantaneously react to failures in a policy-compliant manner.
Our network architecture, D2R 111Pronounced “detour”., leverages recent programmable data planes to this end. Given a view of the network topology and current state of the links, D2R implements graph traversal algorithms—e.g., Breadth-first Search and Iterative-Deepening Depth-first Search—completely in the data plane; our implementation can compute paths to any destination at near line rates. To propagate failure information for computing active routes without imposing reconvergence issues, we use the Failure Carrying Packets protocol [fcp] to tag each packet header with the failures it has encountered along its route. D2R switches use the failure information to guide the graph traversal algorithms and compute active policy-compliant routes.
Because programmable switches today have limited processing stages, they may not be able to compute the route to the destination in one pass through the switch. We address this limitation using the recirculation capabilities of modern switches that allow packets to be fed back to the switch for additional processing. Using a state-of-art hardware switch, we show we can achieve minimal latency and throughput degradation as long as we impose a low number of recirculations. Thus, we propose a hierarchical dataplane routing scheme that nearly eliminates recirculations and runs at line rate.
D2R, a new network architecture that can provide the illusion of a fabric that is always available and policy-compliant even under failures, by performing routing in the data plane (§3).
A P4 implementation of Breadth-first Search and Iterative-Deepening Depth-first Search that can run on software and hardware switches, coupled with an implementation of the Failure Carrying Packets protocol for routing under failures (§4).
A hierarchical routing scheme in the data plane, which decreases the processing requirements on each switch by splitting route computation across switches (§5).
An implementation of the data plane augmentations necessary to compute policy compliant paths (§6).
An evaluation of D2R’s end-to-end routing scheme for different topologies and failure scenarios on software and hardware platforms (§7).
2 Routing under Failures
In this section, we present the challenges faced today w.r.t providing guarantees of connectivity without packet loss while complying with high-level policies. We examine the role of the control plane and data plane in both these architectures and argue that a refactoring of roles is needed to address the challenges.
2.1 Delay in Reconvergence
A key goal of our work is to ensure that even when multiple failures occur packets are delivered without drops as long as network paths exist to packets’ destinations. We examine if and how this is possible today.
Distributed control planes fall short. Many networks use distributed routing protocols that rely on routers exchanging protocol messages to convey changes in the network topology, for instance, when link failures occur. Each router uses these messages to recompute new forwarding tables to react to its perceived new state of the network. Until the information about failures propagates to all routers in the network, and the network has become quiescent, forwarding tables may not be consistent across routers. During this convergence period—which can last very long [blink]—severe packet losses occur when routes become unavailable [convergenceloss].
Furthermore, information about failures is passed via advertisements that are generated and processed by router software control planes. Francois et. al [igpconvergence] study the behavior of IS-IS protocol convergence times based on different parameters: failure detection, link-state packet (LSP) generation to notify routers of failures, the overhead of flooding LSPs and processing advertisements at each router’s control plane, and updating the RIB and FIB for each LSP. For a 21 node topology geo-distributed in Europe and USA, they observe high convergence times of over 200-1000ms depending on different control plane parameters—for instance, how the control plane updates the FIB can vastly change convergence times. In other words, switch/router control plane software design can further delay convergence, leading to higher loss rates.
The state-of-the-art approaches to reduce the impact of convergence are (1) designing loop-free convergence protocols [loopfree1, loopfree2, loopfree3], and (2) using pre-computed backup paths to route around failures, e.g., LFA-FRR [frr], DDC [ddc] etc. The former approach uses provably correct mechanisms to break loops during convergence, but can only provide guarantees for a subset of network failures; it still incur convergence delays due to the switch software control plane processing. Local fast failover mechanisms in the data plane—i.e., the latter approach—are widely deployed, but they only provide guarantees for certain failure scenarios (generally one link failures); pre-computing backup paths for multiple failure scenarios will lead to an exponential increase in switch memory usage, or cannot avoid convergence problems when the failure scenario is not protected by the backup mechanisms. We performed an empirical analysis of using LFA-FRR (Loop-Free Alternate Fast Reroute) for protection of every 1-link failure, and we observe that 2-20% traffic classes are disconnected until convergence under 2 and 3 link failure scenarios.
Modern SDNs also fall short. Another approach to mitigate the impact of convergence is to leverage SDNs. In existing SDNs, a logically central controller manages a network of programmable switches. The controller detects failures, centrally computes forwarding rule changes, and pushes new rules to switches. However, this approach cannot be used to build a fabric which is always available. First, the controller must learn about the failure from network switches, which can incur high latency depending on the placement of the controller in the network. This can be a factor in Software-defined wide-area networks (SDWAN) [b4, swan]. Second, after the controller has been notified of a failure scenario and it has computed global rule changes for multiple traffic classes in response. Implementing these changes in a running network is challenging. The controller may have to update the rules of multiple switches using complex update schedules so that intermediate network states do not lead to inconsistencies like packet loops and drops [decentralizedupdate, kinetic, updatesynthesis]. State-of-art SDN update mechanisms can take around 300ms to order of minutes to compute and install the update across a network. Further, He et. al [controlplanelatency] measure the latency for programming rules in OpenFlow switches: it can take 10-100ms to add/modify/delete a single rule in the OpenFlow switch tables and such switch rule delays, mainly due to inefficient switch control plane software, make consistent updates even slower. Overall, even with SDNs, packets encountering failed links may be dropped for extended periods of time until failure notification and consistent update installation have completed.
2.2 Policy Compliance Challenges
In addition to ensuring availability under failures, we desire that packets always obey policies: routing around to avoid failure to reach a destination should not violate policies that pertain either to communicating with that destination or to network resource management.
Local failover mechanisms like Fast-Reroute, which are coupled with distributed routing protocols, cannot guarantee that failover paths comply with policies.
Even without failures, implementing network-wide policy compliant routing in the network control plane is difficult (the underlying problem is NP-Complete in general [genesis].) In SDN, the centralized controller has to compute new sets of policy-compliant paths for different flows (identified by packet headers), which can take order of seconds to hours. It has to update the network in a manner that even the intermediate steps comply with the policies—this complicates an already arduous process of generating consistent updates. Determining the appropriate distributed control plane configurations where high-level policies are met (using techniques such as [fibbing, propane, synet, zeppelin]) is even harder [zeppelin].
2.3 Our Position
In sum, policy-compliance while rerouting to ensure always-availability is difficult to achieve today. An always available fabric in itself is also hard to achieve.
We argue that the main underlying problem is that a bloated and/or remote control plane is involved in route computations and/or forwarding rule installations under failures. We argue for stripping the control plane down of any failure reaction duties and refactoring said duties. In particular we advocate: (1) pushing recomputation of, and forwarding along, policy-compliant alternate routes entirely to switch dataplanes, where fastest failure reaction can happen, and (2) leveraging a logically central policy plane to allow programming policies and informing switches of what policies to adhere to for different flows.
|addMboxChain( flow f, switch m1, switch m2 …)||Chain of middlebox arrays where one middlebox is traversed in each array Can be coupled with BFS/IDDFS.|
|addPreference( flow f, switch n1, switch n2)||From switch n1, prefer next hop n2 if n1 n2 is active. Can be coupled only with IDDFS.|
|addWeightedLB( flow f, switch n, switch next, int weights)||
, choose next-hop next with probability… Can be coupled only with IDDFS.
3 D2r Architecture
D2R solves the challenges mentioned in §2 and provides always availability and policy-compliance. Key to D2R is that it leverages programmable switch technologies to refactor the distribution of responsibilities amongst different components of the network. We illustrate the D2R architecture in Figure 1. In D2R, the network is divided into three components:
Policy Plane: The centralized policy controller is used by operators to specify the network topology and policy requirements for different flows. The policy plane sends data plane rules to the switch control planes, where the rules encode the network’s topological structure (). The policy plane also sends the policy to the end-hosts which are then included in the packet headers ().
Control Plane: Our switch control plane is wafer-thin [waferthincontrolplane] and is mainly responsible for programming the data plane with the rules reflecting topology sent by the policy plane (). The control plane is also responsible for monitoring link-up events in the switch ().
Data Plane: The data plane uses programmable ASICs to run graph traversal algorithms, atop the network topology encoded in the dataplane rules, to compute a policy-compliant route for packets entirely in the data plane. Routing does not rely on the control plane or policy plane on the critical path (, and ). For routing under failures, the data planes encode link failure information in the packet header, which is used for traversal (). The data plane does not store global link failure state.
We describe the flow of a packet in Figure 1: The packet is tagged at the end-host with the policy header specified by the policy plane. When the packet enters the switch, the switch data plane computes a route through the network taking into account the failed links and the policy, and stores the route as a source route in the packet. The failure information is also included in the packet. The switch parses the route in the header and forwards along the route. The packet is forwarded to the destination. We outline the roles played by each plane in realizing this forwarding behavior in the rest of this section.
3.1 D2r Data Plane
Modern programmable switching ASICs let developers write complex packet processing pipelines that can run at very high speeds. For instance, the state-of-the-art Barefoot Tofino switch can process packets at an aggregate line rate of 6.5Tbps. Thus, in D2R, we move away from the conventional model where the data plane just forwards packets and the control plane runs sophisticated routing algorithms. Instead, the data plane runs traversal algorithms like breadth-first search (BFS) and iterative-deepening depth-first search (IDDFS) to compute a route from the switch to the destination. Thus, when a packet arrives at a switch (), the data plane computes a route to the destination and stores the route in the packet header. Subsequent switches (, ) use the route information in the packet to forward the packet to the destination. We describe our P4 [p4] implementation of the data plane in §4.
Modern programmable ASICs can detect when a connected link is down and trigger a special packet indicating that the link/port is down (). As soon as the failure is detected, the D2R data plane stores this information in a register. When a packet arrives, the data plane uses this updated local link-state and computes a route which does not use the failed link, avoiding any packet drops. This approach solves the problem faced by SDNs, in which failures cause the centralized controller to react and add new forwarding rules in a consistent manner. Our approach is more general than static local fast failover mechanisms, as the data plane can compute a valid next hop dynamically based on the current state of the links connected to the switch.
For correct routing in the network, we need to know the state of links in the entire network. However, we cannot resort to distributed link-state advertisements because this leads to convergence issues and packet losses. We eliminate routing convergence periods by using the Failure Carrying Packets (FCP) protocol [fcp]. In FCP, each packet carries information about all the link failures it has encountered in its path (). The switch data plane uses this information to find a route that avoids failed links without actually storing the current global link-failure state. FCP provides guarantees of connectivity under failures without the need for a distributed routing protocol. We describe the FCP protocol and implementation in §A.
3.2 D2r Control Plane
The control plane in D2R is distributed across switches, and it has a minimal role. Most importantly, the switch control plane plays no part in the critical path for end-to-end forwarding, and thus, it is not a bottleneck for always availability and is not responsible for policy-compliance under failures. The switch control plane simply programs the data plane with rules provided by the policy plane. These are not forwarding rules and instead they encode the network topology and any changes that occur to it in the long term, such as planned maintenance, or links/switches getting (de)commissioned (). The policy plane, describe below, tracks these aspects of the network topology.
Some modern ASICs may not generate a packet for when the link has come back up. For such scenarios, the switch control plane uses mechanisms like BFD [bfd] to monitor the status of links and notify the switch data plane of link up events ().
3.3 D2r Policy Plane
D2R provides support for switch and network-wide policies under different failure scenarios. We restrict our support to per-packet policies in the data plane, i.e., computing a packet’s route is independent from other packet routes. To support hyperproperties (a policy constraining the routing behavior of two or more flows), we would need to store routing state of different flows in the data plane, which would consume scarce switch memory resources.
Even for per-packet policies, we need to store the policy information for different flows. We could store the policy state in the switches, but if we needed to change the policies, we would need to reprogram switches, which can lead to down time (§2). Moreover, unlike planned maintenance and link/switch additions/removals that induce slow topology churn, policy churn is significantly higher and can trigger frequent expensive network updates. Instead, we develop a policy plane which sends the policy information to end-hosts () which are responsible for adding the policy in the packet header (). The data plane uses the policy header to generate policy-compliant paths ().
The policy plane can also request the current state of network links from switch control planes to generate new policies. Crucially, for the policies D2R supports, policy updates will not trigger reprogramming of the data plane. We describe D2R’s policy support in Table 1 and the data plane implementation in §6.
4 Data plane routing
Given the FCP header in the packet, the D2R data plane computes an active path without going to the control plane. In this section, we first present a primer on programmable switches and P4, the state-of-art language used to program these switches. We then present two graph traversal algorithms we implement in D2R: breadth-first search (BFS) and iterative-deepening depth-first search (IDDFS).
4.1 Programmable Switches and P4
Modern programmable switching ASICs [rmt] contains three main components: the ingress pipeline, the traffic manager, and the egress pipeline. A switch can have multiple ingress and egress pipelines serving multiple ingress and egress ports. Packet processing is performed primarily at the ingress pipelines (Figure 2) which comprises of three programmable components: a parser, a match-action pipeline, and a deparser. To support complex packet processing, each pipeline has multiple stages which process packets in a sequential fashion. Each stage contains dedicated resources (e.g., match-action tables and registers) to process packets at high rates. For instance, the state-of-the-art Barefoot Tofino switch can process packets at an aggregate line rate of 6.5Tbps.
Packet processing can be abstracted as a control flow graph of match-action tables, where each table matches a set of header fields, and performs actions based on the match results. While processing a packet, the stages of the ASIC share the packet header and metadata fields (can be thought of as global memory), and stages can pass information in the pipeline by modifying these headers. The number of stages in programmable switches is limited, and the packet processing logic may not finish at the pipeline. In such scenarios, the packet can be recirculated back into the ingress pipeline with updated headers for further processing. Recirculating a packet multiple times consumes switch bandwidth resources (ports are set up in loopback mode for recirculations and cannot be used for physical links) and results in increased latency. Thus, our data plane algorithms must reduce to a minimum the number of recirculations required for packet processing.
P4 [p4] is the most widely used domain-specific language to program these ASICs. Figure 3 illustrates a simple P4 program that defines an IPv4 routing table and how the table is invoked in the ingress pipeline. While P4 is a programming language, it closely mimics the architecture of programmable ASICs—i.e., we cannot express any general algorithm as a P4 program. Thus, we need to take into account the P4 semantics for designing our graph traversal algorithms and express steps of the routing algorithms as match-action tables.
4.2 Breadth First Search
We now present the algorithm and P4 implementation for performing breadth-first search (BFS) in the network. BFS has the advantage of finding paths with the least number of hops. Traditionally, BFS explores the switches of the graph using a first-in-first-out (FIFO) queue. However, since currently P4 only supports stack data structures, we implement a modified BFS algorithm in P4 which uses only stacks and preserves the following invariant: a switch at a lower depth (number of hops from the source) is explored before any switch at a higher depth. The only difference from a queue-based implementation will be the relative ordering of explored switches at each depth. We present our stack-based BFS algorithm in Algorithm 1 and in the rest of the section.
In programmable switches, the amount of memory to store packet headers and
metadata is limited. Since the BFS stacks need to be processed by every stage,
we need to store it as a header field222We will not emit these stacks in
the deparser as they are not required for correct forwarding in the network. ,
thus, we must limit the number of used stacks.
Our BFS algorithm
uses two stacks for odd (Stack) and even depth (Stack) switches,
respectively— when we are exploring switches of odd depth
, thus, we must limit the number of used stacks. Our BFS algorithm uses two stacks for odd (Stack) and even depth (Stack) switches, respectively— when we are exploring switches of odd depth, we push the neighbors at depth in Stack and vice-versa, eliminating the need of more than two stacks.
We now break down how we translate Algorithm 1 to P4. The building block of our BFS algorithm is the following P4 match-action table: [frame=single] table bfs key= hdr.curr : exact; hdr.visited_vec : ternary; hdr.stack: exact; actions = push_neighbor; pop_stack; change_stack;
We initialize curr to the switch that is computing a path to the
destination. visited_vec is a bitvector whose size is equal to the
number of bidirectional links. For each ,
visited_vec[i]333 The indices start from 1 from the rightmost bit
of the vector.
The indices start from 1 from the rightmost bit of the vector.is set to 1 if has been visited or has failed, and to 0 otherwise. We set all failed links obtained from the FCP header to . Consider the example in Figure 4. If has failed, then we set the and bits of visited_vec—0000 0011. We also set all incoming links to curr to 1, so that BFS does not visit curr later in the algorithm. For our 2-stack implementation, we denote switches at odd depth with stack = 1, and switches at even depth with stack = 0.
Let be a switch at odd depth. BFS explores a neighbor which is unvisited and connected (line 9) by an active link and puts in Stack (line 9). To map our algorithm into the P4 programming model, we translate the if condition to a table match rule, and the code executed based on the if condition as one of the table actions.
If the link ID of is , then we will only explore from if visited_vec[id] = 0. Consider the example in Figure 4. If , we will only explore if visited_vec = 0; P4 supports ternary match kind for bitvectors where we can specify exact values (0/1) or wildcard for each bit; we use the ternary match to check the bit in visited_vec. Thus, the match fields for exploring the edge would be as follows (depending on if is at odd or even distance from source): [frame=single] curr: m=1, visited_vec: *******0, stack:0 curr: m=1, visited_vec: *******0, stack:1 If the above match succeeds, we need to push onto Stack. We also set all the bits corresponding to incoming edges to as 1 in visited_vec; thus, BFS will not explore again. For this, we define the following action which implements lines 9-10 444P4 targets may not support specifying header fields as indices— we define two action push_neighbor_0 and push_neighbor_1 to push onto Stack and Stack respectively. We elide these details for simplicity.. [frame=single] action push_neighbor(n, n_visited) Stack[ hdr.stack].push(n); hdr.visited_vec = hdr.visited_vec|n_visited; Figure 4 shows the action parameters when we explore the edge . Once, all neighbors of are explored, the BFS algorithm will pop the next element from the stack of the current depth (odd or even) and repeat the process of exploring the neighbors (lines 13-15). To check if all neighbors of have been explored, we again use the ternary match to check if all bits corresponding to outgoing links of are 1. If so, we update curr to the top switch of the stack. For example, if , the links with ID 1 and 5 must be explored: [frame=single] match curr: m=1, visited_vec:***1***1, stack:0 curr: m=1, visited_vec:***1***1, stack:1 action pop_stack() hdr.curr = Stack[hdr.stack].pop();
Finally, once we have explored all switches in the stack, we need to proceed to the switches at the next level. To match for this condition, we place a special switch "0" at the bottom of stack and swap stacks when once curr = 0. After switching stacks, we pop the top element of the new stack to start exploring its neighbors. [frame=single] match curr:0, visited_vec:********, stack:0 curr:0, visited_vec:********, stack:1 action change_stack() hdr.stack = hdr.stack; hdr.curr = Stack[hdr.stack].pop();
According to the P4 semantics, only one match-action rule will be triggered per table application. The match condition depends on the current packet headers, priorities and ordering of rules in the table. In the ingress pipeline, when a table is applied, the switch will execute the action code corresponding to the matched rule. Thus, a single bfs table application cannot perform the entire traversal. We apply multiple bfs tables to perform BFS from the source till curr is the destination switch.
[frame=single] control ingress bfs_1.apply(); if (hdr.curr != hdr.dst) bfs_2.apply(); else forwarding.apply(); if (hdr.curr != hdr.dst) bfs_3.apply(); else forwarding.apply(); … // End of pipeline if (hdr.curr == hdr.dst) forwarding.apply(); else recirculate();
Each bfs table reads and writes the packet headers (curr, visited_vec, stack) which are passed down in the pipeline to the next bfs table. Therefore, there is a Read-After-Write (RAW) dependency between the bfs tables. Thus, they cannot be placed in the same stage [compiling-dependencies]. Switches only have a bounded number of stages (10) in the ingress pipeline. Therefore, our BFS algorithm may not reach the destination in those stages. To overcome the limitation of bounded number of stages, we can repeatedly recirculate the packet back into the ingress pipeline with the headers at the end of the pipeline. This effectively resumes the BFS algorithm, and we keep applying the bfs table till we find the destination in the algorithm. To avoid recirculations, we implement a source routing flavor of BFS— the route is stored in the packet headers and downstream switches can use the source route and avoid route recomputations (thus, recirculations). We also propose a hierarchical routing scheme in §5 to further reduce recirculations by splitting route computations across switches.
4.3 Iterative Deepening Depth First Search
Another form of graph traversal that can yield paths while exploring fewer number of switches (thus, fewer recirculations) is Depth-first Search (DFS). However, without bounds on the path length, DFS can produce very long paths compared to BFS. This is not ideal, especially in wide-area settings. We implement a variant of DFS called Iterative Deepening DFS (IDDFS), which explores switches in a manner similar to DFS while imposing bounds on the length of the discovered paths, and iteratively increases the bound when needed. We present our IDDFS algorithm in Algorithm 2.
IDDFS works similarly to DFS with one major modification: we keep track of the length of the current path from src (len) and will not explore neighbors if the length of the path exceeds the max length path. Thus, IDDFS provides bounds on the path length and will eventually find a path if one exists within the bound. If a path within the bound does not exist, we perform a new DFS with an increased bound. IDDFS is linear in complexity. In the worst case, it explores switches.
Similar to BFS, we create a P4 table which acts as the building block of our IDDFS algorithm. [frame=single] table iddfs key = hdr.curr : exact; hdr.visited_vec : ternary; hdr.len: exact; hdr.max_len: exact; actions = goto_neighbor; backtrack; increase_length; default action = backtrack(); Similar to BFS, we add table rules to check if certain edges are visited/failed (using ternary match) and explore neighbors. Backtracking occurs when we have no neighbor to visit from a switch. Finally, we increase the maximum path length when the stack is empty—i.e., when we have explored all switches at the specified maximum length but did not reach the destination. The P4 Implementation details are in§B.
As with BFS, each invocation of the iddfs table can lead to one action execution. Thus, we add tables staged one after the other (due to the RAW dependency). At the last stage, if we have not found the destination, we recirculate the packet again. Similar to BFS, we implement source routing for IDDFS. IDDFS requires source routing for correctness purposes as it does not compute the shortest path to the destination. Consider the topology in Figure 4. Switch 1 uses IDDFS and computes the route (but does not store it in the packet) and sends to switch 2. Switch 2 now performs IDDFS to compute route , and sends it back to 1, and thus, packet will keep oscillating. Oscillation is circumvented by source routes: switch 2 will simply use the source route to send to 4.
5 Hierarchical Routing
In the BFS and IDDFS algorithms we presented in §4, the computation and memory requirements on each switch increase as the network size increases. First, increased memory requirements lead to complex resource fitting problems on the switch. Second, and most important, an increased number of traversal computations leads to more recirculations that consume precious switch capacity. In §7.1, we measure the recirculation incurred by IDDFS for a network with 126 links and observe that IDDFS routing can incur up to 10 recirculations to compute routes. This limitation of BFS and IDDFS routing begs the question:
Can we avoid computing paths for the entire network and decrease the number of recirculations?
Our approach is inspired by OSPF’s idea of dividing the network into areas to avoid large link-state databases on routers. In the rest of the section, we present D2R hierarchical routing, a routing mechanism that reduces recirculation overhead.
Routing Across Domains.
We divide the network into domains and construct a domain graph based on the domain adjacencies—i.e., if there is a switch of domain connected to switch of domain , we add an edge between and in the domain graph. Hierarchical routing works as follows: (1) The source switch computes the domain path to the destination domain and stores the path in the packet header. (2) The source switch then computes the intra-domain network path to a switch that belongs to the next domain in the domain path, and sends the packet to that domain, and so on till we reach the destination domain. (3) The switch in the destination domain finds a path to the destination. In summary, instead of finding the complete network path in a single switch, we split the computation across multiple domains, and the first switch is also responsible for finding a path in the domain graph. We implement the switching logic in our ingress table actions (details omitted for brevity).
We extend our graph algorithms to perform a traversal over the domain graph and store the domain path in the header, which is then used by the switches to find a path through each of the domains. We use the BFS/IDDFS tables defined in §4 for finding both the domain path and the network path, and differentiate between the two modes using a header field in the table match conditions: hdr.hierarchy = 1 means we are finding a domain path, and hdr.hierarchy = 0 means we are finding a path inside the domain.
Routing Inside Domains.
A switch has to find a route to one of the switches in the next domain. We modify the topology of each domain to add a special switch for each neighboring domain. We take all inter-domain links and connect them to the special domain switch. We illustrate this augmentation in Figure 5. Thus, to find a path to the next domain , we set hdr.destination = d and perform BFS/IDDFS on the modified topology—thus, finding a valid path to the next domain. Consider a packet from S to T in Figure 5. Switch 1 will first compute the domain path to T which is . Then, it will perform intra-domain BFS/IDDFS to 129 in the augmented intra-domain graph and will reach either switch 4 or 5 (based on if route is computed through 2 or 3, respectively). Switch 2 and 3 will have forwarding rules to send the packet to 4 and 5, respectively. Once the packet has reached a switch in domain 129, the switch can perform intra-domain routing to reach the destination.
Hierarchical Routing under failures.
We modify the FCP failure vectors to account for inter-domain link failures. Consider the example in Figure 5: the domain link 128 - 129 can be marked as failed only if both and links have failed. Thus, we can create a mapping of the network failvector to domain failure vector (implemented using a match-action table)—the domain failure vector can be then used to perform traversal on the domain graph. However, unlike normal FCP routing, hierarchical routing does not provide strict guarantees of reachability: if a domain becomes internally disconnected, we may not find a route to the destination even if one exists. To provide strict routing guarantees, we fall back to single domain routing (the whole network is a single domain) whenever a switch is unable to find a route using the hierarchical routing rules.
6 Policy Implementation
The D2R data plane can find compliant routes for the packet policies listed in Table 1, even under failures. The operator specifies the policies to the policy plane using the API, and the policy plane specifies the policy information which must be sent on the packet, which is used to guide the traversal. In this section, we present the modifications to our IDDFS implementation to support policies. D2R has the following failure semantics: if there exists a policy-compliant route in the network, the data plane algorithm will find it.
6.1 Middlebox Chaining
With the emergence of NFV [opennf, e2], operators can place middleboxes at different locations in the network to perform different network functions—e.g., firewall, intrusion detection, traffic optimizers etc., With the middlebox chaining policy, operators can specify a chain of middleboxes m1 m2 and the data plane must compute a path from src to m1, then to m2 and then to the destination. The middleboxes and destination are encoded in the packet header, and the data plane sets hdr.dst = m1, so then IDDFS will find a route to m1. Once, the path is found to m1, we set hdr.destination = m2 and restart traversal from m1, and so on until the switch computes a route to the destination. The other switches can use the computed route for forwarding. Under failures, a switch will be able to compute a new route through the middleboxes.
We also have support for specifying middlebox replicas. With support for disjunctions in P4 conditional statements, we can end IDDFS when hdr.curr is equal to any of the middlebox instances. So, we add multiple fields in the header to store the replicas and modify our ingress pipeline as follows: [frame=single] control ingress … if (hdr.curr != hdr.dst or hdr.curr != hdr.dst…) // apply iddfs else // switch to next middleboxes/destination // or forward to next-hop Note that enforcing the middlebox policy will not incur any additional rules or per-flow state; the policy in the packet header will specify the middlebox chain and replicas, which will be read by the data plane.
6.2 Next-hop Preferences
Operators may impose the most preferred path among multiple paths available to a destination, so that the fabric prefers or avoids using certain paths for cost or performance reasons. Preferences can be used by the operator to send a particular class of traffic through a geographical domain which has higher bandwidth or is less prone to malicious entities. D2R supports next-hop preferences (akin to BGP local preferences), which can be used to specify at switch the best next-hop for the packet. To enforce this policy in the data plane, we need to ensure that when our traversal reaches , it must choose if the link is active and routes to the destination. For next-hop preference, we use the IDDFS traversal to find a route. In IDDFS, the hop which is explored first is the most preferred hop (as IDDFS will move to and so on till it finds the route to destination), thus, we need to enforce that the rule is matched first in IDDFS. We cannot use rule priorities as they will require control plane intervention for different policies.
We add a new longest prefix match (lpm) field to the iddfs table: hdr.pref. For each switch and next-hop, the policy plane decides the pref value to guide IDDFS towards the most preferred hop. We illustrate the preferences using an example in Figure 6. Suppose the policy specifies that 4 is the most preferred hop from 1, for which the pref value is set to 11. By virtue of the lpm match, the rule will be the most preferred rule and IDDFS will explore 4. Similarly, if we set pref = 10, the switch will match to the rule and switch 3 will be the most preferred route. Finally, if we set pref = 00, all rules are valid matches with equal length prefixes (**). According to the P4 switch semantics, the first rule will be matched, and IDDFS will explore switch 2. The policy plane is responsible for specifying the right preference value in the packet depending on the policy, and the data plane will explore the appropriate hop if it is active. We do not support backup preferences in the data plane (prefer b1, then b2 etc.). However, if the preferred link is down, we ensure we pick an active route (to ensure high availability).
6.3 Flexible Weighted Load-Balancing
One of the key responsibilities of network routing is load-balancing—sending different flows on different paths to manage network capacity. D2R supports flexible WCMP [wcmp] in the data plane—i.e., the packet will carry the WCMP weights for a switch, and the switch’s data plane will find a route by picking a next-hop with probability calculated by the weights specified in the packet. The data plane logic does not depend on any particular set of weights. Thus, we can simply change weights in the packet and the data plane would perform load-balancing according to the new weights. In current networks, the control plane needs to add a set of rules based on fixed WCMP weights—if one needs to change weights, the control plane needs to modify the data plane, and if one of the next-hop links is failed, the switch would drop packets.
We illustrate how D2R avoids this problem. Consider the switch in Figure 6. Assume the policy in the packet specifies load-balancing weights as 1:2:1. We use preferences presented in §6.2 to load balance flows according to the weights in the packet. The data plane should set hdr.pref = 00 with probability 1/4 for switch 2, 10 with probability 2/4 for switch 3, and finally, 11 with probability 1/2 for switch 4. Thus, flows will be load-balanced at switch 1 with weights 1:2:1. P4 switches have support for generating hashes from the packet header fields, which D2R uses to decide the next-hop preference in a probabilistic manner. To support flexible WCMP, we use Boolean operations in a preprocessing table to map the random hash to a preference value based on the input weights. In the face of failures, we prefer a next-hop from the active next-hops with the same relative weights. For example, if the policy for switch 1 in Figure 6 is 1:2:1 and link is down, the links and will be preferred in a 1:1 ratio. For brevity, we elide the P4 implementation details.
Policy Support Limitations.
We currently do not support next-hop preferences and weighted load balancing policies with BFS traversal. BFS explores multiple routes simultaneously, so choosing one of the BFS routes which comply with the policy requires more complicated processing in the tables and increased header state, thus, inflating the number of stages required to find the path (thus, more recirculations). BFS works in conjunction with middlebox policies.
We can only support limited policies with hierarchical routing. For e.g., middlebox traversals cannot be completely enforced as a switch only computes a path within a domain—we can specify intra-domain middleboxes. Path preferences and load balancing can be supported with hierarchical routing as they only guide traversal to pick a particular next-hop. These limitations are subject to future work.
7 Implementation and Evaluation
The implementation of the D2R data plane consists of 2000 lines of P4 defining the P4 software behavioral model [bmv2] that emulates the behavior of programmable switch architectures. The implementation of the policy plane consists of 3000 lines of Python that use the topology specification to generate the table rules for each P4 switch in the topology. The policy plane hands off these rules to the switch control plane which uses the switch APIs to install the table rules in the software bmv2 and hardware switch.
Both of our graph traversal algorithms (i.e., BFS and IDDFS) use 10 stages to run the tables outlined in §4. Note that the number of stages is configurable based on the switch resource requirements and other potential applications running on the switch (e.g., firewalls, ACLs etc.). In our experiments, we store 8 hops in the source route of the packet.
We evaluate the effectiveness of routing using D2R using the Internet Zoo topologies [zoo] (5-70 switches, 10-150 links) and use failures scenarios varying between 1 and 3 links. We ask the following questions:
Can D2R find paths with low path stretch and few recirculations for different topologies under different failure scenarios? (§7.1)
Can D2R hierarchical routing reduce the number of recirculation? How do domain sizes affect the number of recirculation and the path stretch? (§7.2)
Can D2R provide end-to-end connectivity in an emulated network and on real hardware? (§7.3)
In hardware, switches generate a packet to indicate a link is down and the delay between the actual link failure and packet generation is a few microseconds. At 10Gbps, the data loss occurring between actual failure event and the data plane reacting to the failure will be in the order of kilobits, i.e, 1-2 packets (thus, nearly zero drops). For the rest of the section, we assume that failure detection is instantaneous.
7.1 Routing Effectiveness
In this section, we evaluate D2R’s ability to find routes using BFS and IDDFS, and measure the number of recirculations and path stretch incurred by both techniques. For these experiments, we generate packets for all pairs of endpoints in the network and emulate the data plane behavior using bmv2. We simulate the network by analyzing the output packet from bmv2 and "forwarding" it to next switch. To evaluate D2R under failures, we generate 20 failure scenarios for each number of failed links , and observe the routing behavior for all-to-all traffic.
Even with source routing, packets could undergo route computations at multiple switches (either due to failures or partially stored paths), thus, we report the total network recirculation in Figure 11. We define the path stretch as the ratio of the actual path taken by the packet in the D2R network compared to the shortest active path in the network (computed by an oracle using BFS). We report the stretch for the networks in Figure 16.
In the absence of failures (), D2R can find routes using few recirculations (average <7) for the different networks, and we observe more recirculations as the network size increases. This increase is expected; we need more table invocations to explore the switches and links to find a route. Recirculations are mostly confined to the source switch that stores the path (8 hops); the remaining switches can forward the packet to the destination based on the source route. We observe stretch of 1 for BFS because it finds the shortest path. For our experiments, we start IDDFS with maximum length as 4 and increase step size by a factor of 2. Thus, IDDFS incurs a higher stretch as it does not always find the shortest paths (though the stretch will be bounded). By configuring the starting depth and increments, we can achieve stretch comparable to BFS).
In the presence of failures (), the FCP algorithm kicks in and D2R needs
to recompute paths on multiple switches as packets learn about new link
failures. Moreover, a switch does not have the full view of the current topology
and it may compute a route through a failed link which the packet has not seen
yet, resulting in higher path stretch. As expected, we observe more
recirculations and higher stretch when failures occur. We also observe higher
variance in recirculation compared to
needs to recompute paths on multiple switches as packets learn about new link failures. Moreover, a switch does not have the full view of the current topology and it may compute a route through a failed link which the packet has not seen yet, resulting in higher path stretch. As expected, we observe more recirculations and higher stretch when failures occur. We also observe higher variance in recirculation compared tobecause routing using FCP highly depends on the topology and failure scenario. We do not see a significant increase in recirculations and stretch with increasing as the number of packets traversing through a route encountering all link failures will be low.
We evaluate the effect of policies on recirculation and stretch. Note that preferences and weighted load-balancing policies do not incur any additional recirculation or increased stretch as they simply change the order in which the next-hop is explored, thus, traversal will not use additional switch stages. We evaluate the number of recirculations and stretch for middlebox policies in Figure 19 for different topologies under no failures. We generate 100 policies for random endpoints which traverse one middlebox (chosen at random), and the data plane computes a route from source to destination through the middlebox. We observe higher recirculations as a middlebox policy is effectively two traversals in the switch. Path stretch trends are similar to routing without policies, as the shortest compliant path also becomes longer.
In summary, we show that both BFS and IDDFS can perform routing with recirculations on average, even under failures. Path stretch under failures is less than 2, thus, FCP does not incur very high stretch despite having a partial view of failure information.
7.2 Hierarchical Routing
We now evaluate if we can reduce recirculations using hierarchical routing in terms of number of recirculations. For clarity of exposition, we focus on two topologies from Internet Zoo - NetworkUsa with 35 switches and 78 links, and Cesnet (201006) with 52 switches and 126 links. For both networks, we partition the network randomly into 3 domains of roughly equal size. For endpoints residing in the same domain, D2R will not perform hierarchical routing. We consider all pairs of switches as endpoints and plot the cumulative distribution of the total network recirculation in Figure 24(a,b) and path stretch in Figure 24(c,d) for all routing strategies.
Hierarchical routing results in a significant reduction in recirculation for both networks and for both BFS and IDDFS: the maximum recirculation suffered in Cesnet with hierarchical routing is 1, compared to recirculations without hierarchies. Remarkably, hierarchical routing achieves 0 recirculations for the majority of endpoints in both networks (50-60% in NetworkUsa, 75-80% in Cesnet). Finally, most of the traffic suffers low stretch even with hierarchical routing (80% traffic have a stretch with hierarchies).
We conclude this section by evaluating the effect that varying the number of domains has on the number of recirculations and path stretch (Figure 27). As the number of domains increases, the average recirculations decreases (more domains implies each domain is smaller, thus less computation). The effect on stretch with changing domains is harder to analyze, as the stretch depends on the topology structure and how the domains are assigned. However, we note that generally higher stretch is incurred when the number of domains is because a switch computes routes on a partial topology.
In summary, hierarchical BFS and IDDFS are able to eliminate recirculations for majority of the endpoints without a significant increase in average path stretch (1.2-1.6).
7.3 D2r in the Real World
We run D2R on a Stordis BF606X switch which can run P4 Tofino programs. The first aspect of running D2R on hardware is compiling to Tofino. We use Barefoot P4 Studio [p4studio] to compile a version of D2R that adheres to the resource constraints of the switch. Since we only possess a single hardware switch, we could not perform an end-to-end routing demonstration using D2R. To understand the viability of D2R, we study the effects of recirculation on Tofino. By configuring adequate ports in loopback for recirculation, we are able to run D2R on the switch and can perform 7 recirculations with minimal degradation in throughput (< 1%), and additional latency in the order of microseconds between two hosts connected to the switch. We do not report actual numbers due to a confidentiality agreement with Barefoot.
End-to-end connectivity using Mininet.
We demonstrate end-to-end routing of D2R using an emulated Mininet [mininet] network of four P4 switches and two hosts (Figure 4). The P4 switches run the P4 software behavioral model [bmv2]. The bmv2 CLI is used to program the switch rules for each switch for routing and forwarding packets in the network. We send UDP traffic from to with the D2R headers as payload. We disable link (using a link failure status register in the switch) and bring it back up after some time.
Initially, the switch 1 data plane finds a path using IDDFS () and the rest of the switches forward along the path stored in the packet header. When has failed, 1 successfully finds an alternate path without any packet drops (assuming failure detection is instantaneous, emulating instantaneous failure detection). Finally, when link is back up and the switch failure state is updated, packets once again switch to path . We also verify that switch 2 and 3 do not compute the paths, instead use the route installed in the packet header.
8 Related Work
Blink [blink] is a state-of-art data-driven data plane solution for connectivity recovery. Blink analyzes TCP-induced signals to detect remote link failures that disrupt end-to-end connectivity. Once Blink has detected a remote link failure, it uses a data-driven fast reroute mechanism: it probes all next hops for availability and chooses a working one. However, without any topology information, Blink cannot fundamentally prevent forwarding issues like blackholes. D2R, on the other hand, does not actively detect remote failures and instead uses FCP for failure propagation. D2R could be potentially used as Blink’s reroute mechanism.
Sedar et. al [p4frr] propose a local fast reroute mechanism to deal with link failures. In the data-plane FRR primitive, the packet keeps track of all ports it has used in an attempt to reach the destination, and the data plane sends the packet on the next available port. To re-establish connectivity, they implement multiple mechanisms that leverage the FRR primitive to explore paths in the network using different strategies—e.g., Rotor-Router, DFS and BFS. The authors advocate for FRR as it does not incur recirculations (FRR is implemented using one table) and uses less resources. We argue that D2R can be used to implement routing itself in the data plane without incurring considerable overheads, thus, eliminating the need for local fast reroute mechanisms which can consume network bandwidth to explore paths.
Molero et. al [hwpathvector] propose a path vector protocol using programmable switches, and offloading key control plane functionalities to the data plane, in the same vein as our vision. However, a distributed path vector protocol, even one accelerated by hardware, will suffer losses during routing convergence periods. Similarly, path vector protocol cannot easily guarantee policy-compliance under failures and will require control plane intervention.
Finally, one of the major avenues of research orthogonal to work is leveraging
programmable data planes to perform various in-network computing tasks
efficiently: key-value stores [netcache], scale-free coordination for
distributed systems [netchain], stateful load balancers [silkroad],
network ordering for consensus [nopaxos], heavy hitter
detection [heavyhitter] , and distributed aggregation for machine
, and distributed aggregation for machine learning[p4ml]. We could potentially run D2R and these applications in parallel in the same data plane with D2R performing routing while the applications act on other packet headers.
We present D2R, a new network architecture that leverages programmable switching technologies to perform routing completely in the data plane using P4. D2R is able to provide always-availability and policy-compliance under failures. Our work opens up a vast avenue of interesting open problems: With current programmable switch architectures, can we implement shortest path algorithm for weighted graphs (to mimic OSPF/BGP configurations)? Can we increase the coverage of policies we can implement in the data plane? Can we design hardware optimized for graph traversal to perform routing efficiently?
Appendix A Failure Carrying Packets
Failure Carrying Packets (FCP) [fcp] is a distributed routing paradigm designed to eliminate convergence periods altogether—a packet is guaranteed to reach the destination if a path to the destination exists in the network. FCP takes advantage of the fact that permanent network topology change (in terms of provisioning/de-provisioning links and switches) happens at the timescales of weeks/months and is well-planned. The only changes for which operators are not prepared for are links and routers failing and coming back up at smaller timescales [datacenterfailures]. Thus, each router has a consistent topology description which indicates all switches and adjacencies between them.
The intuition behind FCP is that if the switch knows the list of failed links in the network, it can successfully route a packet to the destination using the network topology and failure information. However, knowledge of all link failures will require a link-state advertisement protocol, which can lead to convergence issues. Instead, in FCP, each packet header carries information about all failed links it has encountered, and the switch simply uses the topology and failure information to route the packet to the destination. The packet on the route may again encouter a failed link to the next-hop, in that case, the failed links is added to the packet header and route is once again recomputed at the new router, and so on. We illustrate an example of the FCP protocol in Figure 28. Switch 1 computes the route () to the destination as it does not have any information about the failed link. When the packet reaches switch 2, the failure information in the packet is updated and switch 2 computes the new route to the destination (). Switch 1 receives the packet once again, but it will not send the packet to 2 as the switch knows that is failed, and thus, sends it to 3 and so on.
FCP is able to guarantee reachability if a path exists by the following intuition: at every switch in the network, the packet will monotonically increase the set of failed links in the packet555FCP does not consider link flapping—i.e., the packet encountered a failure and updated its header, but the link came back up before the packet reached the destination.. Thus eventually, the packet would get information about all failed links in the network, and any router would be able to route the packet to the destination if a path exists. The only failure state maintained by an FCP router is the failure state of links connected to the router. FCP learns about the state of remote links solely from the packet headers, and importantly, it does not store this information. Thus, FCP routers do not need to advertise failures unlike OSPF. Thus, while FCP can incur additional stretch, we can avoid the link-state flooding overhead during failures.
With programmable switch architectures, realizing a FCP-like protocol is more practical than when FCP was actually introduced. One of the major deployment challenges for FCP was changing the router hardware to support a new protocol header to incorporate information about link failures. With P4, we can easily define our custom protocol header and parsers, which can be efficiently run on hardware at line rates. We store the failure information in the header as a bit-vector where each bit represents the state of a particular link in the topology.
Appendix B DFS P4 Implementation
Let’s consider a switch at len . IDDFS will explore a neighbor if is not visited/failed and max_len. If is valid, IDDFS pushs (curr) into the stack for backtracking purposes. IDDFS will also mark the incoming edges to as visited and update the current path length (lines 10-15). We add the following table rule(s) for to implement the above logic: [frame=single] match: curr:m=1,visited_vec:*******0,len:0,max_len:4 … curr:m=1,visited_vec:*******0,len:3,max_len:4
action goto_neighbor(n, n_visited) Stack.push(hdr.curr); hdr.curr = n; hdr.visited_vec = hdr.visited_vec|n_visited; hdr.len++; The match condition for visited_vec and the action parameters (n, n_visited) are same for BFS and IDDFS (shown in Figure 4).
At any switch , we backtrack (line 18, 22) when either all outgoing edges of are visited/failed, or the current path length exceeds the max length. We set backtrack as the default action of our IDDFS table, and will be executed whenever the current header values do not match any valid match condition corresponding to the neighbors. [frame=single] default action backtrack() hdr.curr = Stack.pop(); hdr.len–;
Finally, once we have explored all switches at max_len distance, IDDFS increases the max length by a factor of 2 and resets curr and visited_vec to the initial state. To match to this component of the algorithm(lines 24-29), we check for a special bottom of stack switch (0): [frame=single] match: curr:0,visited_vec:********,len:-1,max_len:4
action increase_length() hdr.max_len = hdr.max_len << 1; // *2 hdr.curr = curr_init; hdr.visited_vec = visited_init; hdr.len = 0;
Appendix C Source Routing Implementation
While each D2R data plane is capable of computing a route to the destination, recomputing the path at each switch will incur additional recirculations. To prevent unnecessary recomputations, we augment our graph traversal algorithms to store the computed route in the packet. Downstream switches can use the path in the header and forward to the next-hop without any recomputation, except in the scenario that the next-hop in the packet is not reachable (due to a link failure). If the next-hop is not reachable, the switch will compute a new route and store it in the packet header. In D2R, we keep track of 8 hops in the packet header (configurable parameter), and the last switch in the path can recompute the path to the destination and load it into the header.
The BFS algorithm explores multiple paths from the source till it finds the destination. Thus, when we add each switch in the stack, we also need to add the current path computed to the switch into the stack. To do this, we define our BFS stack in P4 as follows: [frame=single] header bfs_stack_entry bit<8> switch; bit<64> path; // store max 8 hops bfs_stack_entry Stack; We keep track of the current path and length in hdr.path and hdr.len respectively and update the bfs actions to keep track of the current computed path. [frame=single] table bfs key = … hdr.len: exact;
action push_neighbor(n_2) newPath = hdr.path; newPath[hdr.len + 1] = n_2; Stack[ hdr.stack].push(n_2, newPath); … While popping elements from the stack in actions pop_stack and change_stack, we update curr and path from the top of the BFS stack. Finally, once we have reached the destination in the algorithm, hdr.path will reflect the path from source to destination and is emitted in the deparser for use by downstream switches.
The IDDFS algorithm explores along a single path, backtracking till the algorithm reaches the destination. Thus, we can store the current explored path in hdr.path and do not need to store paths in the stack like BFS. Since, we already track the current path length, we need to modify the iddfs actions to store the path in the packet header, which can be then transmitted. [frame=single] action goto_neighbor(n_2) hdr.curr = n_2; hdr.len++; hdr.path[hdr.len] = n_2;
action backtrack() hdr.curr = Stack.pop(); hdr.path[hdr.len] = 0; //erase len index hdr.len–;
action increase_length() … hdr.curr = curr_init; hdr.path = 0; // erasing all hdr.len = 0; Source-routing is necessary for IDDFS for correctness of routing, while BFS can operate without storing paths. This is because IDDFS does not always find the shortest path (in terms of next-hops).