Deploying Near-Optimal Delay-Constrained Paths with Segment Routing in Massive-Scale Networks

With a growing demand for quasi-instantaneous communication services such as real-time video streaming, cloud gaming, and industry 4.0 applications, multi-constraint Traffic Engineering (TE) becomes increasingly important. While legacy TE management planes have proven laborious to deploy, Segment Routing (SR) drastically eases the deployment of TE paths and is thus increasingly adopted by Internet Service Providers (ISP). There is a clear need in computing and deploying Delay-Constrained Least-Cost paths (DCLC) with SR for real-time interactive services. However, most current DCLC solutions are not tailored for SR. They also often lack efficiency or guarantees. Similarly to approximation schemes, we argue that the challenge is to design an algorithm providing both performances and guarantees. However, conversely to most of these schemes, we also consider operational constraints to provide a practical, high-performance implementation. We leverage the inherent limitations of delay measurements and account for the operational constraint added by SR to design a new algorithm, best2cop, providing guarantees and performance in all cases. Best2cop outperforms a state-of-the-art algorithm on both random and real networks of up to 1000 nodes. Relying on commodity hardware with a single thread, our algorithm retrieves all non-superfluous 3-dimensional routes in only 250ms and 100ms respectively. This execution time is further reduced using multiple threads, as the design of best2cop enables a speedup almost linear in the number of cores. Finally, we extend best2cop to deal with massive scale ISP by leveraging the multi-area partitioning of these deployments. Thanks to our new topology generator specifically designed to model the realistic patterns of such massive IP networks, we show that best2cop solves DCLC-SR in approximately 1 second even for ISP having more than 100000 routers.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/10/2020

Computing Delay-Constrained Least-Cost Paths for Segment Routing is Easier Than You Think

With the growth of demands for quasi-instantaneous communication service...
01/03/2019

An Efficient Linux Kernel Implementation of Service Function Chaining for legacy VNFs based on IPv6 Segment Routing

We consider the IPv6 Segment Routing (SRv6) technology for Service Funct...
12/24/2020

Q-SR: An Extensible Optimization Framework for Segment Routing

Segment routing (SR) combines the advantages of source routing supported...
09/11/2019

An SR Flip-Flop based Physical Unclonable Functions for Hardware Security

Physical Unclonable Functions (PUFs) have emerged as a promising solutio...
05/31/2022

Optimal Multicast Service Chain Control: Packet Processing, Routing, and Duplication

Distributed computing (cloud) networks, e.g., mobile edge computing (MEC...
08/08/2022

Optimal Dynamic Orchestration in NDN-based Computing Networks

Named Data Networking (NDN) offers promising advantages in deploying nex...
05/11/2022

RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization

This paper explores training efficient VGG-style super-resolution (SR) n...

1 Introduction

Latency is critical in modern networks for various applications. The constraints on the delay are indeed increasingly stringent. For example, in financial networks, vast amounts of money depend on the ability to receive information in real-time. Likewise, technologies such as 5G slicing, in addition to requiring significant bandwidth availability, demand strong end-to-end delay guarantees depending on the service they aim to provide, e.g., less than 15ms for low latency applications such as motion control for industry 4.0, VR or video games 5GSlicing. For such interactive applications, the latency is as critical as the IGP cost.

This Interior Gateway Protocol cost is defined as an additive metric that usually reflects both the link’s bandwidth and the operator’s load distribution choices on the topology. Paths within an IGP are computed by minimizing this cost. Thus, although delay constraints are increasingly important, they should not be enforced to the detriment of the IGP cost. With minimal IGP distances, the traffic benefits from high-bandwidth links and follows the operator’s intent in managing the network and its load. With bounded delays, the traffic can benefit from paths allowing for sufficient interactivity. It is thus relevant to minimize the IGP cost while enforcing an upper constraint on the latency. Computing such paths requires to solve DCLC, an NP-Hard problem standing for Delay Constrained Least Cost.

Figure 1: Practical relevance of DCLC in the GEANT network. IGP costs are deduced from the bandwidth of each link. Depending on their needs (in terms of delay and bandwidth), applications can opt for three non-comparable paths between Frankfurt and Vienna.

DCLC: a relevant issue. The practical relevance of DCLC can be illustrated on real networks, as displayed by Fig. 1. This map is a sample of the GEANT transit network GeantTopo

. As fibers often follow major roads, we rely on real road distances to infer the propagation delay of each link while the bandwidth, and so the estimated IGP cost, matches the indications provided by GEANT. A green link has an IGP cost of 1 while it is 2 and 10 respectively for the yellow and pink ones. When considering the couple Frankfurt-Vienna, this network exhibits three non-comparable (or

non-dominated) paths, forming the Pareto-front of the paths between the two cities. These paths, as well as their delays and IGP distances, are provided in the legend.

Such paths offer diverse options: either solely the delay matters and the direct link (in pink) should be preferred, or the ISP prefers to favor high capacity links, and the green path, minimizing the IGP cost, should be used. The yellow path, however, offers an interesting compromise. Out of all paths offering a latency well-below 10ms, it is the one minimizing the IGP cost. Thus, it allows to provide strict Service-Level Agreement (ms), while considering the IGP cost. These kid of paths, retrieved by solving DCLC, provide more options by enabling tradeoffs between the two most important networking metrics. Applications such as videoconferences, for example, can then benefit both from real-time interactive voice exchange (delay) and high video quality (bandwidth). In addition, IGP costs are also tuned to represent the operational costs. Any deviation from the shortest IGP paths thus results in additional costs for the operator. For all these reasons, there exist a clear interest for algorithms able to solve and deploy DCLC paths SegmentR71:online. However and so far, while this problem has received a lot of attention in the last decades from the network research community survey2018; survey2010, no technologies were available for an efficient deployment of such paths.

Segment Routing & DCLC’s rebirth. Segment Routing (SR) is a vibrant technology gathering traction from router vendors, network operators and academic communities I-D.matsushima-spring-srv6-deployment-status; ventre_segment_2020. Relying on a combination of strict and loose source routing, SR enables to deviate the traffic from the shortest IGP paths through a selected set of nodes and/or links. Such deviation may for example allow to route traffic through a path with lower latency. These deviations are encoded in the form of segments within the packet itself. To prevent any packet forwarding degradation, the number of deviations one can encode is limited and should be taken into consideration when computing paths. While this technology is adequate to support a variety of services, operators mainly deploy SR in the hopes of performing fine-grained and ECMP-friendly tactical Traffic-Engineering (TE) SR4TE, due to its far lesser overhead compared to RSVP-TE filsfils2017segment. Our discussion with network vendors further reveal a clear desire from operators to efficiently compute DCLC paths deployable with Segment Routing SegmentR71:online. Such a solution should thus not only encompass Segment Routing, with its constraints on the number of segments that can be pushed at line rate, but also fare well on large sized networks observable in the near future.

Challenges. In this paper, we propose a simple efficient algorithm, called BEST2COP, that makes DCLC-TE possible on (very) large scale networks. We leverage SR to reach computation times suitable for real-time routing. To achieve our goal, we solve several challenges: (i) cap, or even minimize, the number of segments to consider the packet manipulation overhead supported by routers, (ii) scale with very large modern networks, and (iii) provide near-exact algorithms with bounded error margin and strong guarantees despite the difficulty induced by considering three metrics (cost, delay, and number of segments)111While difficult instances are unlikely to occur in practice, our algorithm can tackle even such corner cases. Thus, our proposal is not only efficient in practice but provably correct and efficient even for worst theoretical cases..

Efficiently encompassing the Maximum Segment Depth constraint (MSD). While SR can add instructions within the packets to guide them through the desired paths, the number of instructions is limited to MSD ( with good hardware). Fortunately, this constraint does not prevent SR from being used to deploy DCLC paths. Indeed, according to our study (Fig. 2), the number of required segments does not exceed MSD for all constraints between 1 and 100ms, in massive-scale transit networks ( nodes scattered in multiple areas) with realistic IGP costs. Nevertheless, the number of segments required to encode paths should still be taken into account, to avoid computing non-deployable paths exceeding MSD segments. Indeed, in difficult cases, where computation times would increase significantly due to the difficulty of finding a suitable path for the constraints, the limit on MSD serves as a safeguard to cut the computation short. MSD prevents the rare use of inefficient paths with regard to the IGP cost. While MSD adds an additive metric to consider (the number of segments), BEST2COP manages, through adequate data-structures and graph exploration, to natively manipulate the list of segments instead of having to convert physical paths in segment lists a posteriori, and control their size live, ensuring the computation of paths requiring at most MSD segments.

Towards massive-scale networks. While we already showed in nca2020 that computing DCLC paths for Segment Routing (DCLC-SR) is possible in far less than a second on networks of up to nodes, scaling to ten or a hundred times more routers remains an open issue. Some current SR deployments exhibit more than ten thousand nodes and continue to grow even in medium-sized countries I-D.matsushima-spring-srv6-deployment-status. Given that most operators deploy SR for TE purposes SR4TE, it becomes clear that TE algorithms must adapt to massive-scale networks. Our new proposal, an extension of BEST2COP, aims to deal with such cases efficiently using a divide-and-conquer approach. Indeed, massive networks usually rely on both a physical and logical partitioning, as IGP protocols like OSPF or IS-IS do not scale well as is. By leveraging this decomposition in areas, as well as re-designing BEST2COP to benefit from multi-threaded architectures, it becomes possible to solve DCLC-SR in a time suited for real-time routing. To evaluate our contribution, we create a topology generator, YARGG, able to construct realistic massive-scale, multi-valuated, and multi-area topologies based solely on geographical data. In such cases, our extension solves DCLC-SR in 1 second for nodes.

Bounded error margin and practical delay measurements. DCLC is a well-known NP-Hard problem 536364. While there exist several ways to solve DCLC  survey2018, they usually do not consider the underlying deployment technologies, and often do not offer any guarantees regarding either their computation time or optimality, both seemingly vital for a real-life deployment. However, it is worth noting that the delay constraints are usually formulated at the millisecond granularity and somewhat arbitrary. Besides, the real end-to-end experienced delay is often not measured. In practice, end-to-end latency is induced by the (variable) queueing delay and the (stable) propagation delay. Both delays may play an important part in the overall latency, though none can be stated to be the main factor end2end. When measuring delay for TE, it is strongly recommended to avoid measuring the queuing delay (through the use of a priority queue), so that the measurement remains stable RFC7471; RFC8570. Varying delay estimations may indeed lead to frequent re-computations, control-plane message exchanges and fluctuating traffic distribution. The propagation delay, deduced from the minimum observation within a large enough sampling window perfmes, is thus used as a stable lower-bound approximation of the end-to-end latency. This is a pertinent estimation as in practice, flows benefitting from DCLC paths will benefit from a queue with high priority and experience negligible queueing delays.

Because of the arbitrary and imprecise nature of the constraints and the sole use of the propagation delay (whose measurement itself exhibits inherent limits in terms of trueness and accuracy), a bounded error margin is acceptable. BEST2COP is designed to take advantage, if needed, of this acceptable margin. While BEST2COP can return exact solutions, the latter can very easily be tuned to return approximated ones whose distances to the optimal solution can be bounded (within whatever desired acceptable error margin).

Summary and Contributions. By taking into consideration the operational deployment of constrained paths with SR, the current scale and structure of modern networks, as well as the practicality of the delay measures and constraints, we designed BEST2COP, a simple efficient algorithm able to solve DCLC-SR in 1s in large networks of nodes.

The main achievements of our proposal follow the organization of this paper:

  • In Section 2, we present SR in further details, discuss the context and works related to DCLC and then evaluate the relevance of SR for deploying DCLC paths;

  • In Section 3, we formalize DCLC-SR and its generalization (2COP). In particular, we describe the network characteristics we leverage and define the construct we use to encompass SR (in Sec. 3.1 and Sec. 3.2 respectively). These allow us to propose in Sec. 3.3 a parallelizable version of BEST2COP (initially introduced in nca2020) to benefit from multi-threaded architectures;

  • In Sec. 3.4, we extend BEST2COP to deal with massive scale networks relying on area partitioning (as with OSPF with a single metric), making it able to solve DCLC efficiently in graphs of nodes;

  • In Sec. 3.5, we formally define the guarantees brought by BEST2COP and its versatility to solve several TE optimization problems at once (i.e., the sub-problems of 2COP), as well as its polynomial complexity;

  • Finally, in Section 4, we present our large-scale topology generator, YARGG, evaluate BEST2COP on the resulting topologies and compare our proposal to the relevant state-of-the-art path computation algorithm.

2 Back to the Future: DCLC vs SR

2.1 Segment Routing Background and Practical Usages

Segment Routing implements source routing by prepending packets with a stack of up to MSD segments. Segments are checkpoints the packet has to go through, and may either specify a node (such segments are called node segments), or an interface and its link (adjacency segments). Routers forward packets according to the topmost segment, which is removed from the stack when the packet reaches the associated intermediate destination.

A node segment indicates that the packet should (first) be forwarded to with ECMP (instead of its final IP destination). Flows are then load-balanced among the best IGP next hops for destination . Adjacency segments indicate that the packet should be forwarded through a specific interface. Adjacency segments may be globally advertised, and thus be used the same way as node segments, or they may only have a local scope and, as such, can only be interpreted by the router possessing said interface. In this case, the packet should first be guided to the corresponding router, by preprending the associated node segment. In the following, as a worst operational case, we consider the latter scenario. Indeed, the resulting number of segments can be seen as an upper bound, as node segments preceding adjacency ones could be removed depending on the control-plane in use (i.e., if adjacency segments are globally advertised).

Segment Routing attracted a lot of interest from the research community. A table referencing most SR-related work can be found in Ventre et al. ventre_segment_2020. While some SR-TE works are related to tactical TE problems (like minimizing the maximum link utilisation) taking indirectly into account some delay concerns swarm; 10.1007/978-3-319-23219-5_41, most of the works related to SR do not focus on DCLC, but rather bandwidth optimization  7218434; Gay2017ExpectTU; Loops, network resiliency 8406885; 7524551, monitoring SRv6PM; 7524410, limiting energy consumption  7057272 or path encoding (the translation of path to segment lists) guedrez2016label; 7417097. Aubry AubryPhd proposes a way to compute paths requiring less then MSD segments while optimizing an additive metric in polynomial time. The number of segments required is then evaluated. This work, however, consider only a single metric in addition to the operational constraints. The problem we tackle (i.e., DCLC paths for Segment Routing) deal with two metrics (in addition to the operational constraints). This additional dimension drastically changes the problem, which then becomes NP-Hard. Some works use a construct similar to ours (presented in details in Sec. 3.2.1) in order to prevent the need to perform conversions from network paths to segment lists, Lazzeri2015EfficientLE in particular. However, the authors of Lazzeri2015EfficientLE do not pretend to solve DCLC and, as such, do not tune the structure the same way (i.e., they do not remove dominated segments, as explained later on), and simply use their construct to sort paths lexicographically.

As aforementioned, while operators seem to mainly deploy SR to perform fine grained TE, to the best of our knowledge, no DCLC variant exists for specifically tackling SR characteristics and constraints (except for our contribution). Using segments to steer particular flows allows however to deviate some TE traffic from the best IGP paths in order to achieve, for example, a lower latency (and by extension solve DCLC). A realistic example is shown on Fig. 1 where the node segment Vienna, as well as considering Vienna as the destination itself, would result in the packets following the best IGP path from Frankfurt to Vienna, i.e., the green dashed path. To use the direct link instead (in plain pink) and so minimize the delay between the two nodes of this example, the associated adjacency segment would have to be used as it enforces a single link path having a smaller delay than the best IGP one (including here two intermediary routers). Finally, the yellow path, offering a non dominated compromise between both metrics (and being the best option if considering a delay constraint of 8ms), requires the use of the node segment Budapest to force the traffic to deviate from its best IGP path in green. Before converting the paths to segment lists (and actually deploy them with SR), such non dominated paths need first to be explored. Computing these paths while ensuring that the number of segments necessary to encode them remains under MSD is at least as difficult as solving the standard DCLC problem since an additional constraint now applies.

2.2 DCLC (Delay-Constrained, Least-Cost), a Well-known Difficult Problem having many Solutions?

DCLC belongs to the set of NP-Hard problems (as well as most related multi-constrained path problems). Intuitively, solely extending the least-cost path is not sufficient, as the latter may exceed the delay constraint. Thus, paths with greater cost but lower delays must be memorized and extended as well. These non-dominated paths form the Pareto front of the solution, whose size may grow exponentially with respect to the size of the graph. However, DCLC in particular, and related variants and extensions in general, does possess several interesting applications such as mapping specific flows to their appropriate paths (in terms of interactive quality). Thus, these problems have been extensively studied in the past decades. Many solutions have been proposed so far, as summarized in these surveys survey02; survey2010; survey2018: they range from to heuristics and approximations to exact algorithms, or even genetic approaches.

Heuristics. Because DCLC is NP-Hard, several polynomial time heuristics have been designed to limit the worst case computing time, but at the detriment of any guarantees. For example, Fallback only returns the least-cost or least-delay path if one is feasible (i.e., respect the constraints). More advanced proposals try to explore the delay and cost space simultaneously, by either combining in a distributed manner the least-cost and least-delay subpaths DCUR; DCR; SFDCLC or by aggregating both metrics into one in a more or less intricate manner.

Aggregating metrics in a linear fashion Jaffe; Aneja_Nair_1978 preserves the subpaths optimality principle (isotonicity of best single-metric paths) and therefore allows to use standard shortest paths algorithms. However, it leads to a loss of relevant information regarding the quality and feasibility of the computed paths 536364, in particular if the hull of the Pareto Front is not convex. Some methods try to mitigate this effect by using a -shortest path approach to possibly find more feasible paths DCBFKLARAC; larac, but such an extension may result in a large increase of execution time and may not provide more guarantees. Other heuristics rely on non-linear metric aggregation. While it seems to prevent loss of relevant information, at first glance, such algorithms expose themselves to maintain all non-dominated paths (towards all nodes) as the isotonicity does not hold anymore (while it holds with linear metrics). Since the Pareto Front may be exponential with respect to the size of the graph, those algorithms either simply impose a hard limit on the number of paths that can be maintained (e.g., TAMCRA TAMCRA and LPH EBFA01), or specifically chose the ones to maintain through previously acquired knowledge (HMCOP Korkmaz_Krunz_2001). Finally, other works like Feng_Makki_Pissinou_Douligeris_2002; Guo_Matta_1999 rely on heuristics designed to solve a variant of DCLC, the MCP problem (Multi-Constrained Paths, the underlying NP-Complete decision version of DCLC – with no optimization objective). It mainly consists of sequential MCP runs using a conservative cost constraint iteratively refined.

Relying on heuristics is tempting, but their lack of guarantees can prevent to enforce strict SLAs even when a suitable path actually exists. One can argue it is particularly unfortunate, as DCLC is only weakly NP-hard: it can be solved exactly in pseudo-polynomial time, i.e., polynomial in the numerical value of the input gareycomputers. Said otherwise, DCLC is polynomial in the smallest largest weight of the two metrics once translated to integers. Consequently it is possible to design FPTAS222Fully Polynomial Time Approximation Scheme. solving DCLC while offering strong guarantees Papadimitriou_Yannakakis_2000.

Approximations. The common principle behind these schemes is to reduce the precision (and/or magnitude) of the considered metrics. This can be performed either directly, by scaling and rounding the weights of each link, or indirectly, by dividing the solution space into intervals and only maintaining paths belonging to different intervals (Interval partitioningSahni_1977. Scaling methods usually consider either a high level dynamic programming scheme or a low level practical Dijkstra/Bellman-Ford core with pseudo-polynomial complexity, and round the link costs to turn their algorithms into an FPTAS (see for example Hassin Hassin_1992, Ergun et al. Ergun_Sinha_Zhang_2002 or Lorenz and Raz Lorenz_Raz_2001 methods). Goel et al. goel, in particular, chose to round the delay instead of the cost and can consider multiple destinations (as our own algorithm).

Most interval partitioning solutions explore the graph through a Bellman-Ford approach. The costs of the paths are mapped to intervals, and only the path with the lowest delay within a given interval is kept. The size of the intervals thus introduces a bounded error factor Hassin_1992; Tsaggouris_Zaroliagis_2009. In particular, HIPH HIPH offers a dynamic approach between an approximation and exact scheme. It proposes to maintain up to non-dominated paths for each node and stores eventual additional paths using an interval partitioning strategy. This allows the algorithm to be exact on simple instances (resulting in a limited Pareto front, i.e., polynomial in the number of nodes, in particular when it is bounded by ) and offer strong guarantees on more complex ones. This versatility is an interesting feature, as most real-life cases are expected to be simple instances with a bounded Pareto front size, in particular because one of the metric may be coarse by nature. For these reasons, not only approximation schemes can offer practical solutions (with a bounded margin error) but also exact algorithms (with controlled performance), as they may be viable in terms of computing time for simple real-life IP networking instances.

Exact methods. Numerous exact methods have indeed also been studied extensively to solve DCLC. Some methods simply use a -shortest path approach to list all paths within the Pareto front Namorado_Climaco_Queiros_Vieira_Martins_1982; Paixao_Santos_2008. On the other hand, Constrained Bellman-Ford CBF94 (ironically, also called Constrained Dijkstra as it uses a priority queue – denoted PQ in the following) explores paths by increasing delays and lists all non-dominated paths towards each node. Several algorithms use the same principle but order the paths differently within the queue, relying either on a lexicographical ordering, ordered aggregated sums, or a simple FIFO/LIFO ordering Martins_1984; Martins_Santos_1999; Brumbaugh-Smith_Shier_1989. Most notably, A* Prune Liu_Ramakrishnan_2001 is a multi-metric adaptation333This adaptation is exact, i.e., not an heuristic, as the estimated cost underestimates the actual distance towards the destination. of A* relying on a PQ where paths known to be unfeasible are pruned. Two-phase methods Raith_Ehrgott_2009 first find paths lying on the convex hull of the Pareto front through multiple Dijkstra runs, before finding the remaining non-dominated path through implicit enumerations.

Finally, SAMCRA SAMCRA is a popular and well-known multi-constrained path algorithm. Similarly to other Dijkstra-based algorithms, SAMCRA relies on a PQ to explore the graph but instead of the traditional lexicographical ordering, it relies on non-linear cost aggregation. Among feasible paths (others are natively ignored) it first considers the one that minimizes its maximum distance to the multiple constraints. Such a path ranking to deal with the PQ is supposed to increase its performance with respect to other PQ organizations.

Algorithms Practicality Exactitude vs Performance
Multi-Dest SR Multi-thread Bounded Coarse All
Single Run Ready Ready Pareto Front Metric Cases
LARAC larac
LPH EBFA01
HMCOP Korkmaz_Krunz_2001
HIPH  HIPH
Hassin Hassin_1992
Tsaggouris et al. Tsaggouris_Zaroliagis_2009
Raith et al. Raith_Ehrgott_2009
A* Prune. Liu_Ramakrishnan_2001
SAMCRA SAMCRA
BEST2COP
Table 1: Qualitative summary of a representative subset of DCLC-compatible algorithms showcasing their practicality, exactitude and performance. In the Practicality column, the green check-mark indicates whether the algorithm supports the corresponding feature (while the red cross denotes the opposite). In the Exactitude vs Performance column, the two subcolumns associated which each three scenarios shows how the latter impact the exactitude (exact, strong guarantees, no guarantees) and the performance of the algorithm (polynomial time or not). While the orange tilde denotes strong guarantees in terms of exactitude, green check-marks (and red crosses respectively) either indicate exact results (no guarantees resp.) or polynomial time execution (exponential at worst resp.) for performance. For both subcolumns Bounded Pareto Front and Coarse Metric, we consider the case where their spreading is polynomial with respect to the number of vertices in the input graph (and as such predictable in the design/calibration of the algorithm).

As we have seen so far, while many solutions exist, most possess certain drawbacks or lack certain features to reconcile both the practice and the theory. Heuristics do not always allow to retrieve the existing paths enforcing strict SLAs, while exact solutions are not able to guarantee a reasonable maximum running time when difficult instances arise, although both features are essential for real-life deployment. On the other hand, FPTAS can provide both strong guarantees and a polynomial execution time. However, they are often found in the field of operational research where, at best, possible networking applications and assumptions are discussed, but are not investigated. Because of this, the deployment of the computed paths, with SR and its MSD in particular, is not taken into consideration. It is worth to note that the number of segments is not a standard metric as it is not simply a weight assigned to each edge in the original graph (that is, without a specific construct, it requires to be computed on the fly for each visited path). Considering the latter can have a drastic impact on the performance of the algorithms not designed with this additional metric in mind (see Section 4.4 for more details). In addition, not all the algorithms presented here and in Table 1 are single-source multiple-destinations. Finally, none of these algorithms evoke the possibility to leverage multi-threaded architectures, an increasingly important feature as such computations now tend to be performed by dedicated Path Computation Elements or even in the cloud.

Our contribution, BEST2COP, aims to close this gap by mixing best existing features (such as providing both a limited execution time and strong guarantees in terms of precision in any cases) and adapt them for a practical modern usage in IP networks deploying SR. Table 1 summarizes some key features of a representative subset of the related work. Similarly to FPTAS, BEST2COP rounds one of the metrics of the graph. However, conversely to most algorithms, BEST2COP does not sacrifice accuracy of the cost metric, but of the measured delay. Because of the native inaccuracy of delay measurements (and the arbitrary nature of its constraint), this does not prevent BEST2COP from being technically exact in most practical cases. In addition (and akin to  HIPH), BEST2COP can easily be tuned to remain exact on all simple instances with a bounded Pareto front regardless of the accuracy of the metrics. Thus, BEST2COP can claim to return exact solutions in most scenarios and, at worst, ensure strict guarantees in others (for theoretical exponential instances). In all cases, BEST2COP possesses a pseudo-polynomial worst-case time complexity. BEST2COP was designed while keeping the path deployment aspect of the problem in mind. A single run allows to find all DCLC paths (and many variants as we will see latter) to all destinations. The MSD constraint related to SR is taken into account natively. As a result, paths requiring more than MSD segments are excluded from the exploration space. The outer loop of BEST2COP can be easily parallelized, leading to a non-negligible reduction in the execution time. In Sec. 4, relying on a performance comparison between BEST2COP and SAMCRA, we will show that BEST2COP does not even need to rely on multi-threading to provide lower computing times than SAMCRA while returning the exact same solutions (even though we make SAMCRA benefit from the same advantageous methods explained thoroughly in the remainder of this paper). This result is particularly interesting as it remains true even for simple IP network instances. This comparison also enables to evaluate Dijkstra-oriented solutions (SAMCRA) with respect to Bellman-Ford-oriented ones (BEST2COP).

Last but not least, BEST2COP has been adapted for multi-area networks and leverages the structures of the latter, allowing it to solve DCLC on very large ( nodes) domains in one second. To the best of our knowledge, such large-scale experiments and results have neither been conducted nor achieved within SR domains 444Some elementary algorithms (such as Multi-constrained Dijkstra) and more intricate solutions exhibit impressive computing times on even more massive road networks Hanusse_Ilcinkas_Lentz_2020. However, such networks are less dense than IP ones (and with metrics that are also even more correlated)..

2.3 SR is Relevant for DCLC: MSD is not a Limit

One can question the choice of SR for deploying DCLC paths in practice. Indeed, in some cases, in particular if the metrics are not aligned555The delay and the IGP costs in particular. Since node segments represent best IGP paths, the IGP cost and the number of segments will most likely be aligned by design, constrained paths may required more than MSD detours to satisfy a stringent latency constraint.

Figure 2: Required number of segments for all DCLC solutions, in a network of nodes generated by YARGG, with delay constraints of up to ms.

While it has been shown that few segments are required for most current SR usages(e.g. for TI-LFA or when considering only one metric) TheseClarenceFils; AubryPhd

, to the best of our knowledge, there is no similar study for our specific use-case, i.e. massive scale networks with two valuation functions (delay and IGP cost). This is probably one of the most exciting challenge for SR as DCLC is a complex application. However, since such massive-scale computer network topologies are not available publicly, we rely on our own topology generator whose detailed description is available in section

4.2. Our aim is to obtain realistic networks and patterns benefiting from both the physical hierarchy and metrics alignement observed in real cases like in Fig. 1. These topologies follow a standard OSPF-like area division. For this analysis, we opt for a worst case graph having nodes and edges scattered in areas. Both metrics (delay and IGP cost) follow a realistic pattern described in 4.2.

For this analysis, we keep track, for each destination, of all the solutions solving DCLC for all delay constraints up to 100ms, and extract the necessary number of segments. In other words, we show the number of segments required to encode all non-dominated (and thus practically relevant for some given constraint) paths, considering all delay constraints up to 100ms. The results are shown in Fig. 2.

One can see that the MSD limit does not seem to be an issue for most configurations, as most paths require less than 10 segments. Said differently, we have DCLC-SR = DCLC in most cases. However, even when considering realistic IGP weights and delay, there exist some corner cases which require more than 10 segments. These cases probably arise from stringent delay constraint (recall that we consider all constraints ), for which adequate paths may require a larger number of segments than usual. In addition, the MSD limit depends on the hardware. Thus, for some routers, some DCLC path may not be deployable.

In practice, this MSD limitation can be mitigated if needed using a flexible algorithm I-D.ietf-lsr-flex-algo. Flexible algorithm (Flex-Algo) allows the IGP to compute shortest paths based on metrics different from the IGP cost, e.g., the delay. These can in turn also be translated in segments that can be encoded within the packet. Thus, one can not only use IGP-oriented segments (representing best IGP paths) but also delay-oriented segments (representing paths of lowest delay). These new node segments (i.e., new edges in our multi-metric SR graph) can help to shift the general distribution observed in Fig. 2

to the left. Furthermore, outliers can be dealt with through

binding segments RFC8402, which create a mapping between a segment and a list of node or adjacency segments. Upon reading , a router replaces it with the corresponding segment list. This allows to decrease the number of segments to stack at the edge router.

In summary, while, in some rare cases, the MSD constraint may prevent the deployment of non dominated paths, this is not a real issue in practice. Nearly all DCLC solutions do not require more than 9 segments. For some corner-cases, SR can mitigate this theoretical limit by relying on flex-algo or binding segments.

3 BEST2COP(E): Efficient Data Structures and Algorithms for Massive Scale Networks

This section presents our contributions. We introduce and define preliminary notations and concepts used to design BEST2COP, before describing the data structures used by our algorithm. In section 3.3, we describe our algorithm, BEST2COP, and show how we extend it for massive scale networks divided in several areas in section 3.4.

We have shown that SR seems indeed appropriate (as desired) for fine-grained delay-based TE. We thus aim to solve DCLC in the context of an ISP deploying SR, leading to the DCLC-SR problem that considers the IGP cost, the propagation delay, and the number of segments.

For readability purposes, we denote:

  • the metric referring to the number of segments, with the constraint = MSD applied to it;

  • the delay metric, with a constraint ;

  • the IGP metric being optimized.

Given a source , DCLC-SR consists in finding, for all destinations, a segment list verifying two constraints, and , respectively on the number of segments () and the delay (), while optimizing the IGP distance (). We denote this problem DCLC-SR. On Fig. 1, we would have DCLC-SR. This DCLC path (shown in yellow in Fig 1), is indeed the best option to reach Vienna when considering an arbitrary delay constraint of 8ms. Since the best IGP path from Frankfurt to Vienna (the green one) does not go through Budapest, encoding this DCLC path requires at least one detour, i.e. one segment (here, a node segment instructing the packet to go through Budapest first).

To solve such a challenging problem, efficient data structures are required. In the following, we first introduce the constructs we leverage and that benefit from the inaccuracy of real delay measurements in particular.

3.1 DCLC and True Measured Delays

As mentioned, DCLC is weakly NP-Hard, and can be solved exactly in pseudo-polynomial time. In other words, as long as either the cost of the delay possesses only a limited number of distinct values (i.e., paths can only take a limited number of distinct distances), the Pareto front of the paths’ distances is naturally bounded in size as well, making DCLC tractable and efficiently solvable666Metric is omitted for now as this trivial distance is only required for SR and discussed in details later. While dealing with a three-dimensional Pareto front seems more complex at first glance, we will show that SR eventually reduces the exploration space because its operational constraint is very tight in practice and easy to handle efficiently.. Such a metric thus has to be bounded and possess a coarse accuracy (i.e., be discrete). Although this has little impact when solving DCLC in a theoretical context, it can be strongly leveraged to solve DCLC efficiently thanks to the characteristics of real ISP networks.

We argue that the metrics of real ISP networks do indeed possess a limited number of distinct values. Although BEST2COP can be adapted to fit any metric, we argue that , the propagation delay, is the most appropriate one. Indeed, IGP costs depend on each operators configurations. For example, while some may rely on few spaced weights, other may have possess some intricate weight systems where small differences in weights may have an impact. Thus, bounding the size of the Pareto front based on the IGP costs is not only operator-dependant, but might still result in a very large front.

On the other hand, the delay (i) is likely strongly bounded, and (ii) can be handled as if having a coarse accuracy in practice. For TE paths, the delay constraint is likely to be very strict (10ms or less). Second, while the delay of a path is generally represented by a precise number in memory, the actual accuracy, i.e the trueness , of the measured delay is much coarser due to technical challenges RFC7679; RFC2681. In addition, delay constraints are usually formulated at the millisecond granularity with a tolerance margin, meaning that some loss of information is acceptable.

Thus, floating numbers representing the delays can be truncated to integers, e.g., taking 0.1ms as unit. This allows to easily bound the number of possible non-dominated distances to , with being the desired level of accuracy of (the inverse of the unit of the delay grain, here 0.1ms). For example, with ms and a delay grain of 0.1ms (), we have only distinct (truncated) non-dominated pairs of distances to track at worst. This leads to a predictable and bounded Pareto front. One can then store non-dominated distances within a static array, indexed on the -distance (as there can only be one non-dominated couple of distances for a given -distance).

In the remaining of the paper, denotes the size allocated in memory for this Pareto front array (i.e., ). When , i.e., the real level of accuracy, is lower (or equal) than , the stored delay can be considered to be exact. More precisely, it is discretized but with no loss of relevant information. When is too high, one can choose such that , to keep at a manageable value. In this case, some relevant information can be lost, as the discretization is too coarse. While this sacrifices the exactitude of the solution (to the advantage of computation time), our algorithm is still able to provide predictable guarantees in such cases (i.e., a bounded error margin on the delay constraint). This is further discussed in Section 3.5.2. These discretized delays, enabling us to bound the number of non-dominated distances, are then used within our structure used to encompass Segment Routing natively, the SR graph.

3.2 The SR Graph and 2COP

To solve DCLC-SR efficiently, as well as its comprehensive generalization, 2COP, we rely on a specific construct used to encompass SR, the delay, and the IGP cost: the multi-metric SR graph.

3.2.1 Turning the Physical Graph into a Native SR Representation

This construct represents the segments as edges to natively deal with the metric and its constraint, . The valuation of each edge depends on the distance of the path encoded by each segment. While the weights of an adjacency segment are the weights of its associated local link, the weights of a node segment are the distances of the ECMP paths it encodes: the common IGP cost, and the lowest guaranteed delay, i.e., the worst delay among all ECMP paths. Hence, computing paths on the SR graph is equivalent to combining stacks of segments (and the physical paths they encode), as stacks requiring segments are represented as paths of edges in the SR graph (agnostically to its actual length in the raw graph). The SR graph can be built for all sources and destinations thanks to an All Pair Shortest Path (APSP) algorithm. Note that this transformation is inherent to SR and leads to a complexity of , for a raw graph having nodes and edges, with the best known algorithms and data structures. This transformation being required for any network deploying TE with SR (the complexity added by our multi-metric extension being negligible), we do not consider it as part of our algorithm presented later.

This transformation is shown in Fig. 3, which shows the SR counterpart of the raw graph provided in Fig. 1. To describe this transformation more formally, let us denote the original graph, where and respectively refer to the set of vertices and edges. As can have multiple parallel links between a pair of nodes , we use to denote all the direct links between nodes and . Each link possesses two weights, its delay and its IGP cost . The delay and the IGP cost being additive metrics, the and distances of a path (denoted and respectively) are the sums of the weights of its edges.

From , we create a transformed multigraph, the SR graph denoted . While the set of nodes in is the same as in , the set of edges differs because encodes segments as edges representing either adjacency or node segments encoding respectively local physical link or sets of best IGP paths (with ECMP). The -weight of an edge in is denoted . However, to alleviate further notations, we denote simply the distance of a path in instead of . Note that if is connected, then is a complete graph thanks to node segments.

A node segment, encoding the whole set of ECMP best paths between two nodes and , is represented by exactly one edge in . The -weight of a node segment is the common -distance of . Since, when using a node segment, packets may follow any of the ECMP paths, we can only guarantee that the delay of the path will not exceed the maximal delay out of all ECMP paths. Consequently, its -weight is defined as the maximum -distance among all the paths in . Links representing node segments in thus verify the following:

An adjacency segment corresponds to a link in the graph and is represented by an edge in , whose weights are the ones of its corresponding link in , only if it is not dominated by the node segment for the same pair of nodes, i.e., if , or by any other non-dominated adjacency segments , i.e., if or , where and are two different outgoing links of in 777If two links have exactly the same weights, we only add one adjacency segment in . Fig. 3 illustrates the result of such a transformation: one can easily identify the three non dominated paths between Frankfurt and Vienna, bearing the same colors as in Fig. 1. The green path (i.e., the best path) is encoded by a single node segment. The pink, direct path (i.e., the best path) is encoded by an adjacency segment (the double line in Fig. 3). The yellow paths (the solution of DCLC-SR(Frankfurt, 3, 8) and an interesting tradeoff between and ) requires an additional segment, in order to be routed through Budapest. Note that in practice, the last segment is unnecessary if it is a node segment, as the packet will be routed towards its final IP destination through the best paths natively.

Figure 3: The SR graph encodes segments as edges. Plain edges represent node segments, i.e., sets of ECMP paths. Double-lines are adjacency segments, here only , and are visible only if they are not dominated by other segments. Colored edges refer to the paths highlighted in Fig. 1.

The main feature of the multi-metric SR graph is that paths encodable in segments in the raw graph are represented as paths of edges within the SR graph888In practice, such path could be encodable in fewer segments depending on how the network is configured. By exploring this graph in a fashion akin to the Bellman-Ford algorithm, the path exploration then naturally iterates over the number of segments. The SR graph allows to explore the DCLC-SR solution space in an efficient manner by natively ignoring solutions not encodable in (less than) MSD segments. As a consequence, we only need to manage a 2D Pareto front within our structures. This construct not only allows us to solve DCLC-SR but also the problem we refer to as 2COP, a general and practically relevant problem regarding the computation of constrained paths in an SR domain.

3.2.2 The 2COP Problem(s)

Solving DCLC-SR exactly requires, by definition, to visit the entirety of the Pareto front for all destinations. However, although only some of these paths are DCLC-SR solutions for a given delay constraint, all paths visited during this exploration may be of some practical interest. In particular, some of them solve problems similar to DCLC but with different optimization strategies and constraints. By simply memorizing the explored paths (i.e., storing the whole Pareto front within an efficient structure), one can solve a collection of practically relevant problems. For instance, one may want to obtain a segment path that minimizes the delay, another the IGP-cost, or the number of segments. Solving 2COP consists in finding, for all destinations, paths optimizing all three metrics independently, and respecting the given constraints. We formalize this collection of problems as 2COP. Solving 2COP enables more versatility in terms of optimization strategies and handles heterogeneous constraints for different destinations. Simply put, while DCLC-SR is a one-to-many DCLC variant taking MSD into account, 2COP is more general as it includes all optimization variants.

With 2COP, one can also actually claim to be highly flexible regarding constraints (and optimization strategies). Besides the three optimization variants, other practically relevant heterogeneous problems can be solved. In practice, a customer may change its needs after the execution of our algorithm. In particular, one may wish for a smaller constraint if some requirements become stricter. Finally, ISPs could be interested in finding destination-specific constraints and optimizations, i.e., different constraints and strategies for each destination.

While algorithms returning only DCLC-SR solutions would have to be run multiple times and re-calibrated to solve the situations mentioned above (different optimization strategies, stricter constraints, or destination-specific constraints), our algorithm returns the whole Pareto Front. With initial constraints , BEST2COP solves 2COP, i.e., returns in a single run paths that satisfy smaller constraints for any , .

Definition. 2-Constrained Optimal Paths (2COP)
Let be a function that returns a feasible segment path from s to d (if it exists), verifying all constraints and optimizing . For a given source and given upper constraints , we have
2COP(s, c_0, c_1, c_2) = _ ∀d ∈V,
∀j ∈{0,1,2},
∀c_j’ ≤c_j f(M_j, c_0’, c_1’, c_2’, s, d)

Observe that, for any , DCLC-SR() consists of the paths in minimizing . Looking at Fig. 3, we have two interesting examples (we rely on the first capital letter of the cities): f(M_2,3,70,∞,F,V) = (F,B), (B,V) (67,4)
f(M_1,3,Γ,∞,G,B) = (G,M), (M,B) (77,4) In the second example, recall that the M1-distances are truncated to obtain integer values and is the maximum constraint we consider (multiplied by ).

When the delay accuracy allows to reduce the problem’s complexity sufficiently, BEST2COP can solve exactly any of the variants within 2COP, regardless of the constrained metrics or the one to optimize. Our algorithm, BEST2COP takes the constraints and the SR graph as inputs and builds in polynomial time an array that can return any desired output of the image of given stricter sub-constraints instead of relaunching the algorithm. Retrieving the desired output can be performed in constant (or at worst sub-linear) time.

In Sec. 3.5.2, we detail how we can handle each 2COP variant with guarantees when the delay accuracy is too high to provide exact solutions while remaining efficient. Solving 2COP can be implemented as efficiently as solving only DCLC-SR.

3.3 Our Core Algorithm for Flat Networks

In this section, we describe BEST2COP, our algorithm efficiently solving 2COP (and so DCLC-SR). Its implementation is available online999https://github.com/talfroy/BEST2COP as well as a complete detailed algorithmic description.Akin to the SR graph computation, BEST2COP can be run on a centralized controller but also by each router. Its design is centered around two properties. First, the graph exploration is performed so that paths requiring node segments are found at the iteration101010Note that each adjacency segment translates to at least one necessary segment, two if they are not globally advertised and not subsequent., to natively tackle the MSD constraint. Second, BEST2COP’s structure is easily parallelizable, allowing to benefit from multi-core architectures with low overhead.

Simply put, at each iteration, BEST2COP starts by extending the known paths by one segment (one edge in the SR graph) in a Bellman-Ford fashion (a not-in-place version to be accurate); with the main difference being that we remember all non-dominated paths. This extension is performed in a parallel-friendly fashion that prevents data-races. At the end of an iteration, newly found extended paths are filtered to reflect the new Pareto front (updated in a relaxed manner). The remaining paths are in turn extended at the next iteration. These steps only need to be performed times as ignored paths are not SR feasible.

The good performance of BEST2COP does not only result from such a cut in the exploration space but also from well-chosen data structures benefiting from the limited accuracy of the delay measurements. This limited accuracy allows to manipulate arrays of fixed size, because the Pareto front of distances towards each node is limited to at each step. These simple data structures, frequently used for read/write operations, allow for updating Pareto fronts very efficiently with an amortized relaxation cost (conversely to priority queues with Dijkstra-like algorithms). Using a Bellman-Ford approach thus comes with two advantages: straight-forward parallelism and lazy updates of the Pareto fronts.

BEST2COP’s main procedure is shown in Alg. 1. The variable pfront, the end result returned by our algorithm, contains, for each iteration, the Pareto front of the distances towards each node n. In other words, pfront [i] contains, at the end of the iteration, all non-dominated distances of feasible paths towards each node n.

The variable dist is used to store, for each vertex, the best -distance found for each -distance to each node. Since the -distance of any feasible path in is bounded by , we can store these distances in a static array dist [v ]. Note that during iteration , dist will contain the Pareto front of the current iteration (non-dominated distances of segments) in addition to distances that may be dominated. Keeping such paths in dist allows us to pre-filter paths before ultimately extracting the Pareto front of the current iteration from dist later on. This variable is used in conjunction with pf_cand, a boolean array to remember which distances within dist were found at the current iteration.

The variable extendable is a simple list that contains, at iteration , all non-dominated distances discovered at iteration . More precisely, extendable is a list of tuples , where is the list of the best known paths towards . The variable nextextendable is a temporary variable allowing to construct extendable.

After the initialization of the required data structures, the main loop starts. This loop is performed MSD times, or until no feasible paths are left to extend. For each node , we extend the non-dominated distances found during the previous iteration towards (originally (0,0) towards src). Extending paths in this fashion allows to easily parallelize the main for loop (e.g., through a single pragma Line 1). Indeed, each thread can manage a different node towards which to extend the non-dominated paths contained within extendable. As threads will discover distances towards different nodes (written in turn in structures indexed on ), this prevents data-races. Note that in raw graphs, this method may lead to uneven workloads, as not all paths may be extendable towards any node . However, since an SR graph is (at least) complete, any path may be extended towards any node , leading to similar workloads among threads.

1 pfront := Array of size MSD
2 forall i  do
3        pfront [i] := Array of size of Empty Lists
4add (pfront [0][src], (0,0))
5 dist := Array of size
6 forall n  do
7        dist [n ] := Array of size
8dist [src][0] = (0,0)
9 optimal constrained extendable:= Empty List (of Empty Lists)
10 add (extendable, (src, [ (0,0) ]))
11 nextextendable:= Array of size of Empty Lists
12 i:= 1, max_d1:= 0
13 while  extendable [ ] and i MSD do
14        #pragma omp parallel for
15        forall v  do
16              pf_cand := Array of size
17              nb, imax := ExtendPaths (v, extendable, pf_cand, dist [v])
18               max_d1 = max (imax, max_d1)
               // How to iterate on dist to get new PF
19               if  nb log nb + nb + |pfront [i-1][v ]| max_d1  then
20                      d1_it := mergesortd1 (pfront [i-1][v ], pf_cand)
21               else
22                     d1_it := [0…max_d1 ]
23              nextextendable [v ] = [ ]
               // Extract new PF from dist
24               CptExtendablePaths (nextextendable [v ], pfront [i ][v ], pf_cand, d1_it, dist [v])
       // Once each thread done, gather ext. paths
25        extendable = []
26        forall v nextextendable [v ] [ ] do
27               add (extendable, (v,nextextendable [v ]))
28              
29       i = i + 1
30       
return pfront
Algorithm 1 BEST2COP(G’,src)

The routine ExtendPaths, detailed in Alg. 2, takes the list of extendable paths, i.e., non-dominated paths discovered at the previous iteration, and a node . It then extends the extendable paths to further towards . The goal is to update dist [v ] with new distances that may belong to the Pareto front. Before being added to dist [v ], extended distances go through a pre-filtering. Indeed, the newly found distance to may be dominated or may be part of the Pareto front. While this check is performed thoroughly later, we can already easily prune some paths: if the new paths to violate either constraint, there is no point in considering it. Furthermore, recall that dist stores, for all -distances towards a node, the best respective -distance currently known. Thus, if the new -distance is worst than the one previously stored in dist at the same index, this path is necessarily dominated and can be ignored. Otherwise, we add the distances to dist and update pf_cand to remember that a new distance which may be non-dominated was added during the current iteration. Note that ExtendPaths returns the number of paths updated within dist, as well as the highest -distance found. This operation is performed for efficiency reasons detailed here.

Once returned, dist contains distances either dominated or not. We thus need to extract the Pareto front of the current iteration. This operation is performed in a lazy fashion once for all new distances (and not for each edge extension). Since this Pareto front lies within dist, one can simply walk through dist by order of increasing distance from to the highest distance found yet and filter all stored distances to get the Pareto front of the current iteration. This may not be effective as most of the entries of dist may be empty.

However, the precise indexes of all active distances that need to be examined (to skip empty entries) can be constructed by merging and filtering the union of the current Pareto front and the new distances (pf_cand). Thus, if the sorting and merging of the corresponding distance indexes is less costly than walking through dist, the former method is performed in order to skip empty or useless entries. Otherwise, a simple walk-through is preferred. The merging of the distances of the Pareto front and new distances is here showcased at high-level (Line 1, Alg. 1). The usage of more subtle data structures in practice allows to perform this operation at the cost of a simple mergesort.

After the list of distance indexes to check and filter is computed, the actual Pareto front is extracted during the CptExtendablePaths procedure, as shown in Alg. 3. This routine checks whether paths of increasing distance do possess a better distance than the one before them. If so, the path is non-dominated and is added to the Pareto front, as well as to the paths that are to be extended at the next iteration. Finally, once each thread is terminated, nextextendable contains lists of non-dominated distances towards each node. These lists are merged within extendable, to be extended at the next iteration.

Note that most approximations algorithms relying on interval partitioning or rounding do not bother with dominance check. In other words, the structure they maintain is similar to our dist: the best distance for each distance to a given node. The latter may thus contain dominated paths which are considered and extended in future iterations. In constrast, by maintaining the Pareto front efficiently, we ensure to consider the minimum set of paths required to remaining exact, and thus profit highly from small Pareto front.

1 imax = 0, nb = 0
2 forall (u, d_list) extendable do
3       forall l E’(u,vdo
4               forall (d1u, d2u) d_list do
5                      d1v = d1u +
6                      d2v = d2u +
                      // Filters: constraints and dist
7                      if d1v c1 and d2v c2
8                     and d2v dist_v [d1vthen
9                             dist_v [d1v ] = d2v
10                             if not pf_cand [d1vthen
11                                   nb ++
12                                    pf_cand [d1v ] = True
13                                   
14                            if d1v > imax then
15                                   imax = d1v
16                            
17                     
18              
19       
20return nb, imax
Algorithm 2 ExtendPaths(v, extendable, pf_cand, dist_v)
1 last_d2 =
2 forall d1 d1_it do
3        if dist_v [d1 ] last_d2 then
4               add (pfront_iv, (d1,d2))
5               last_d2 = dist_v [d1 ]
6               if pf_cand [d1then
7                      add (nextextendable_v, (d1,d2))
8              
9       
Algorithm 3 CptExtendablePaths (nextextendable_v, pfront_iv, pf_cand, d2_it, dist_v)

The output of BEST2COP. When our algorithm terminates, the pfront array contains, for each segment number, all the distances of non dominated paths from the source towards each destination . To answer the 2COP problem, for each and for all (stricter sub-)constraints , and , we can proceed as follows in practice:

  • for , i.e., to retrieve the distance from to that verifies constraints and minimizing , we look for the first element in verifying constraint (the first feasible distance is also the one minimizing because they are indexed on the later metric).

  • for , we look for the last element in verifying constraint . The path minimizing being, by design, the last element.

  • to compute , let us first denote the smallest integer such that contains an element verifying constraints and . The resulting image is then any of such elements in .

As one might notice, computing , cannot always be achieved in constant time (for and sub-constraints in particular). Indeed, we favor a simple data structure. A search in an ordered list of size is needed for stricter constraints (and may be performed times when optimizing ). To improve the time efficiency of our solution, each may be defined as or converted into a static array in the implementation.

Finally, for simplicity, we did not show in our pseudo-code the structure and operations that store and extend the lists of segments. In practice, we store one representative of the best predecessors and a posteriori retrieve the lists using a backward induction for each destination.

3.4 For Massive Scale, Multi-Area Networks

As shown in nca2020, BEST2COP exhibits great performance on large-scale networks of up to nodes (ms). However, since the design of BEST2COP implies a dominant factor of in term of time complexity111111The detailed complexity is given in section 3.5.1 (the SR graph being complete), recent SR deployments with more than nodes would not scale well enough. The sheer scale of such networks, coupled with the inherent complexity of TE-related problems, makes 2COP very challenging if not impossible to practically compute at first glance. In fact, even BEST2COP originally exceeds 20s when dealing with nodes. As we will see in the evaluations, this is much worse with concurrent options.

In this section, we extend BEST2COP in order to deal efficiently with massive scale networks. By leveraging the physical and logical partitioning usually performed in such networks, we manage to solve 2COP in s even in networks of nodes.

3.4.1 Scalabity in Massive Network & Area decomposition

The scalability issues in large-scale networks do not arise solely when dealing with TE-related problems. Standard intra-domain routing protocols encounter issues past several thousands of nodes. Naive network design creates a large, unique failure domain resulting in numerous computations and message exchanges, as well as tedious management. Consequently, networks are usually divided, both logically and physically, in areas. This notion exists in both major intra-domain routing protocols (OSPF and IS-IS). Although they exhibit slight differences, our solution can be adapted to fit any one of them. In the following, we consider the standard OSPF architecture and terminology.

Areas can be seen as small, independent sub-networks (usually of around 100 - 1000 nodes at most). Routers within an area maintain a comprehensive topological database of their own area only. Stub-areas are centered around the backbone, or area 0. Area Border Routers, or ABRs, possess an interface in both the backbone area and a stub area. Being at the intersection of two areas, they are in charge of sending a summary of the topological database (the best distance to each node) of one area to the other. There are usually at least two ABRs between two areas. We here (and in the evaluation) consider two ABRs, but the computations performed can be easily extended to manage more ABRs. Summaries of a non-backbone area are sent through the backbone. Upon reception, ABRs inject the summary within their own area. In the end, all routers possess a detailed topological database of their own area and the best distances towards destinations outside of their own area. Not only does this reduces the computation cost induced in each area, but only specific destinations could be advertised between areas (e.g. stub sub-networks but not transit ones) to mitigate the overall churn and overhead to distances.

3.4.2 Leveraging Area Decomposition

This partitioning creates obvious separators within the graph, the ABRs. We leverage this native partition to use a divide-and-conquer approach, running BEST2COP at the scale of the areas before exchanging and combining the results. We do not only aim to reduce computation time, but also to keep the number and size of the exchanged messages manageable. We here chose to detail a simple distributed variant of our solution, where each router performs its own computations within its area and combines it with the distances advertised by the two ABRs of its area. However, our solution may well be deployed in other ways, e.g. relying on controllers, or even a single one. In such cases, the computation could be parallelized per area if needed. Such discussion is left for future work. Before discussing our design in detail, we start with a high-level description of the way we extend BEST2COP:

  1. Routers compute the multi-metric SR graph of their own areas, before running BEST2COP on the resulting area-SR-graph. Note that ABRs, being adjacent to two areas, should run BEST2COP twice;

  2. ABRs share their pfront structures described in Alg 1, that is all non-dominated distances, and their segment lists, towards destinations within their own area;

  3. ABRs now possess all non-dominated paths to others ABRs (from running BEST2COP on area 0) and all non-dominated paths from said distant ABRs to the destinations in the corresponding areas (through message exchange). By combining both paths through a simple cartesian product and a light post-processing, one can get all 2COP paths to the destinations within distant areas. These paths may then be sent to routers within the local area, which in turn perform a similar processing.

Let us now explain each of these steps in further detail. For readability purposes, we rely on the following notations: denotes area . denotes the ABR between the backbone and . When necessary, we may distinguish the two ABR and . Finally, b2cop() denotes the results (the non-dominated paths) from to within . When is omitted, we consider all routers within as destination. Figure 4 illustrates a network with three areas, , and 0, the backbone area. We can see that a non-dominated path from to is obtained by concatenating a non-dominated path in each area. Thus, the set of solutions is the cartesian product of the sets of solutions in each area.

b2cop
Figure 4: The set of solutions across areas is obtained from the cartesian product of the solutions in each area.

Working at area scale. To limit the computations and the volume of information transiting between areas, routers do not possess the topological information to compute a full, complete SR graph of the whole network: routers only compute the SR graph of their own area(s). Once computed, it is tempting to make routers exchange the SR graph of their area, and run BEST2COP on the union of all area-SR-Graphs. However, it systematically implies a large volume of information to share between areas.

Another method is for the ABRs to exchange their 2COP paths (i.e., the non dominated paths to all destinations of their areas). We claim that only sharing the Pareto fronts will waste much less computing and message resource in practice since we limit their sizes to at worst. This exchange still provides enough information for all routers to compute all 2COP paths for every destination within the network.

More formally, each ABR computes b2cop() and exchange the results with . Areas being limited to a few hundreds routers on average, this computation is very efficient. Note that ABRs also compute b2cop(), but need not exchange it, as all ABRs perform this computation. Exchanging the computed 3D Pareto front has a message complexity of at worst in theory. In practice, we expect both the size of Pareto fronts and the number of relevant destinations to consider to be fairly low ( and resp.). In the case of non scalable Pareto fronts, one can opt for sending only part of them but at the cost of relaxing the guarantees brought by BEST2COP.

After exchanging messages, any ABR should know the non-dominated paths from itself to , and the non-dominated paths from to all nodes within . By combining this information, we can compute the non-dominated paths from to all nodes within , as we will now detail.

Cartesian product. Since ABRs act as separators within the graph, to reach a node within a given area , it is necessary to go through one of the corresponding ABRs . It thus implies that non-dominated paths to nodes within from can be found by combining bcop() with bcop(). In other words, by combining, with a simple cartesian product, the local non-dominated paths towards the ABRs of a given zone with the non-dominated paths from said ABRs to nodes within the corresponding distant areas, one obtains a superset of the non-dominated paths towards the destinations of the distant area. In practice, since several ABR can co-exist, it is necessary to handle the respective non-dominated paths (bcop() and bcop()) with careful comparisons to avoid incorrect combinations.

Post-processing. To ensure that the results obtained through the cartesian product aforementioned are correct, some post-processing is required. More precisely, this post-processing allows to remove unnecessary segments for the segments lists, and to remove non-dominated paths from the computed cartesian product. First, the segment list associated with each path should be corrected. When combining paths, the segment lists are simply concatenated, the first sub-path thus always reaching an ABR. More precisely, the segment lists necessarily possess the following structure: (u_0,u_1), …, (u_i, A), (A, v_0), …, (v_j-1, v_j), with denoting an ABR. However, being a separator, it is likely that the best IGP path from to natively goes through without the need of an intermediary segment. Thus, segments of the form (u_i, A), (A, v_0), can often be replaced by a single segment . Such anomalies should be corrected, as the MSD constraint may be tight on some hardware. An additional useless segment may thus render the path falsely unfeasible, even though it actually fits the MSD constraint. This correction can be performed easily. Let be the separator, if and are node segments, and all best IGP paths from to go through (or possess the same cost and delay as the best IGP ones going through ), the two node segments can be replaced by a single one.

This correction is performed quickly and relies solely on information available to the router (the local SR graph and the received distances summary). Finally, after having performed and corrected the cartesian products for all the ABRs of the area, the latter are merged in a single Pareto front.

Once performed for all areas, an ABR now possesses all 2COP paths to all considered destinations within the network. These can then be sent to routers within , who will need to perform similar computations to compute non dominated paths to all routers within a different area. Note that the 2COP paths for each destination can be sent as things progress, so that routers can process such paths progressively (and in parallel) if needed.

Summary. By running BEST2COP within each area, before exchanging and combining the results, one can find all non-dominated paths to each destination within a network of nodes in less than 900ms. The induced message complexity is manageable in practice and can be further tuned if required. Our method can be adapted for controller-oriented deployments.

3.5 A Limited Complexity with Strong Guarantees

3.5.1 An Efficient Polynomial-Time Algorithm

The flat BEST2COP. In the worst-case, for a given node , there are up to paths that can be extended towards it. Observe that is at least (because is complete) and depends on how many parallel links has with its neighbors. With being the average number of links between two nodes in , on average we thus have paths to extend to a given node, at worst. These extensions are performed for each node and up to MSD times, leading to a complexity of

. Using up to threads, one can greatly decrease the associated computation time.

The Cartesian Product. Its complexity is simply the size of the 2COP solution space squared, for each destination, thus at worst . Note that we can reach a complexity of , again with the use of threads since each product is independent. This worst case is not expected in practice as metrics are usually moslty aligned to result in Pareto fronts whose maximal size is much smaller than .

Overall, BEST2COPE (multi-area) exhibits a complexity of

with denoting the set of nodes in each area ( being their number) and the use of enough threads ( ideally). Note that the cartesian product dominates this worst case analysis as long as the product remains small enough. However, with realistic weighted networks, we argue that the contribution of the Cartesian product is negligible in practice, so BEST2COPE is very scalable for real networking cases.

To conduct our evaluations, we consider that:

  • , as it is close to the best hardware limit;

  • , although this value is tunable to reflect the expected product trueness-constraint on , we consider here a fixed delay grain of 0.1ms (so an accuracy level of ) regarding a maximal constraint 100ms.

This latter limitation is realistic in practice and guarantees the efficiency of BEST2COP even for large complex networks as it becomes negligible considering large .

3.5.2 What are the Guarantees one can Expect when the Trueness Exceeds the Accuracy? that is if


If propagation delays are measured with a really high trueness (e.g., with a delay grain of 1 s or less), BEST2COP can either remain exact but slower, or, on the contrary, rapidly produce approximated results. In practice, if one prefers to favor performance by choosing a fixed discretization of the propagation delay (to keep the computing time reasonable rather than returning truly exact solutions), this may result in an array not accurate enough to store all non dominated delay values, i.e., two solutions might end up in the same cell of such an array even though they are truly distinguishable. Nevertheless, we can still bound the margin errors, relatively or in absolute, regarding constraints or the optimization objective of the 2COP variant one aims to solve.

In theory, note that while no exact solutions remain tractable if the trueness of measured delays is arbitrarily high (for worst-case DCLC instances), it is possible to set these error margins to extremely small values with enough CPU power. If , each iteration of our algorithm introduces an absolute error of at most for the metric, i.e., the size of one cell in our array (recall that is the accuracy level and is the inverse of the delay grain of the static array used by BEST2COP). So our algorithm may miss an optimal constrained solution (for a destination ) only if there exists another solution such that but the distance of both solutions associated to the same integer (that is stored in the same cell of the dist array) i.e., only if . In this case, we have because otherwise, would have been stored instead of . From this observation, depending on the minimized metric, BEST2COP ensures the following guarantees.

If one aims to minimize or (e.g., when solving DCLC), then BEST2COP guarantees a solution that optimizes the given metric, but this solution might not satisfy the given delay constraint . As an example, for DCLC-SR (optimizing ), we have

With denoting the optimal constrained solution. When minimizing , the solution returned by BEST2COP for a given destination , , will indeed verify the constraints on and , and we have . The induced absolute error of regarding the delay of paths becomes negligible as the delay constraint increases. If , the latter translates to a small relative error of . Conversely, it becomes significant if . When minimizing or , it is thus recommended to set as low as possible regarding the relevant sub-constraint(s) if necessary. Similarly, to guarantee a limited relative error when minimizing , it is worth running our algorithm with a small as we can have . However, note that this later and specific objective (in practice less interesting than DCLC in particular) requires some a priori knowledge, either considering the best delay path without any and constraints, or running twice BEST2COP to get as a first approximation to avoid set up blindly initially (here is not a real constraint, only and apply as bounds of the problem, just represents the absolute size of our array and, as such, the accuracy one can achieve).

Even though BEST2COP exhibits strong and tunable guarantees, it may not return exact solutions once two paths end up in the same delay cell, which may happen even with simple instances exhibiting a limited Pareto front. Fortunately, a slight tweak in the implementation is sufficient to ensure exact solutions for such instances. Keeping the original accuracy of M distances, one can rely on truncated delays only to find the cell of each distance. Then, one possible option consists of storing up to distinct distances in each cell121212In practice, note that several implementation variants are possible whose one consists of using the array only when the stored Pareto front exceeds a certain threshold. Moreover, can be set up at a global scale shared for all cells or even all destinations, instead of a static value per cell, to support heterogeneous cases more dynamically. These approaches were also evoked in  HIPH. Thus, some cells would form a miniature, undiscretized Pareto front of size when required. This trivial modification allows the complexity to remain bounded and predictable: as long as there exists less than distances within a cell, the returned solution is exact. Otherwise, the algorithm still enforces the aforementioned guarantees. While this modification increases the number of paths we have to extend to at worst, such cases are very unlikely to occur in average. Notably, our experiments show that 3D Pareto fronts for each destination contain usually less than elements at most on realistic topologies, meaning that a small would be sufficient in practice. In summary, BEST2COP is efficient and exact to deal with simple instances and/or when , while it provides approximated but bounded solutions for difficult instances if to remain efficient and so scalable even with massive scale IP networks.

4 Performance Evaluation

In this section, we evaluate the computation time of our solution. We start by evaluating BEST2COP on various flat network instances, ranging from worst-case scenario to real topologies, and compare it to another existing approach based on the Dijkstra algorithm, SAMCRA SAMCRA. Then, after having introduced our multi-area topology generator, we evaluate the extended variant of our solution, BEST2COPE, on massive scale networks. In the following, we consider our discretization to be exact (i.e.,  is high enough to prevent loss of relevant information). Note that the delays fed to SAMCRA are thus also discretized in the same fashion and that we also provide SAMCRA with the SR graph as input, to make the comparison as fair as possible. While it increases the graph density , it eliminates the need to deal with a third dimension when updating the Pareto front and managing the priority queue. SAMCRA can thus ignore paths of more than 10 edges. Note that while BEST2COP returns the whole 2COP solutions, our implementation of SAMCRA only returns DCLC-SR in this comparison. In other words, BEST2COP returns a whole 3D Pareto front, while SAMCRA is implemented to return a constrained 2D Pareto front. All our experiments are performed on an Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz 8.

4.1 Computing Time & Comparisons for Flat Networks

This section illustrates the performance of our algorithm using three flat network scenarios. In particular, we do not take advantage of any area decomposition to mitigate the computing time. First, we get a strict upper bound on its execution time for the worst case scenario. Second, we show that BEST2COP is far from reaching this upper bound when executed on actual networks, even with random weight assignments. Finally, we show that BEST2COP performs even better on real networks where weight assignments are not completely correlated (which may lead to a large Pareto front). We will see that BEST2COP outperforms SAMCRA even without the use of multi-threading.

Figure 5: Upper bound of BEST2COP execution time regarding the number of nodes and threads.

Upper bound. Here we force BEST2COP to explore its full iteration space (i.e., behaving as if distances have to be extended at each iteration). The results are shown in Fig. 5 for an increasing number of nodes and threads. BEST2COP does not exceed 84s when using a single thread, considering and , the average number of parallel links in the transformed graph. This time, reasonable given the unrealistic nature of the experience, is significantly reduced when relying on multi-threading. Using 8 threads, BEST2COP execution time decreases to around 10s, highlighting both the parallelizable nature of our algorithm and its inherently good performance (past this number of threads, there is no more speedup as we exceed the number of physical cores of our machine). Additional experiments conducted on a high-performance grid show that BEST2COP reaches a speedup factor of 23 when run on 30 cores.

Overall, the extreme execution times presented here are far from BEST2COP real performance; its data structures were virtually filled up to push it to its maximum exploration limits. In practice, when considering concrete underlying raw networks and their SR transformation, BEST2COP requires less than half a second even with random topologies and a limited number of threads as we will now see.

Random networks. We continue our evaluation with random scenarios and compare our execution times with SAMCRA. We first generate raw connected graphs of nodes by using the Erdos-Rényi model with a density of . IGP weights are picked at random between 1 and (i.e., one more bit than the maximum possible IGP cost in current IGPs) using a bounded Zipf distribution (having a shape of ). Likewise, propagation delays are picked at random using the same distribution. Here, we only show the results for a maximum delay value of 100ms, this value leads to the highest computation time out of all the tests performed. Finally, we apply the SR graph conversion on these random graphs.

We perform these tests for ranging from 100 to 1000 (with steps of 100). To account for the randomness of both valuation functions, we generate 30 differently weighted distinct topologies for each value of . We run BEST2COP from nodes selected as representative sources (randomly picked uniformly). The number of threads is set to and as shown in Fig 6. While SAMCRA is run from the same amount of sources in the same conditions, it cannot benefit as-is from multi-threaded architectures, as it cannot be parallelized as easily. Note that SAMCRA is also run on the SR Graph, to spare it the management of a 3D Pareto front (i.e., on the fly conversions into segment lists of the discovered paths to perform continuous Pareto front updates).

The resulting computing times are shown in Fig 6: both axes follow a linear scale up until and a computing time of s on the y-axis. They switch to a logarithmic scale afterward for better readability.

Figure 6: Computation time taken by Best2cop (1 and 8 threads) and SAMCRA to solve 2COP on random topologies with random weights. The scale of both axis switches from linear to logarithmic once and

(the grey area). A confidence interval of

is shown, but is very tight and thus hardly noticeable.

Note that such an evaluation is not advantageous and not yet representative of the times reached on real instances. We do not benefit from the patterns present in realistic networks such as metric alignment. Such random valuations exhibit the efficiency of BEST2COP in moderate challenging conditions. When relying on a single thread, BEST2COP reaches an execution time of about ms when . Once again, this time can be greatly reduced when relying on multiple threads. With 8 threads, BEST2COP remains under 100ms in all of its runs (below this #node threshold), exhibiting an average performance of about 50ms when . Overall, BEST2COP scales well enough with the dimension of the network which is the critical performance parameter (quadratic in ).

SAMCRA showcases a higher execution time, reaching ms for , and 2s when . The latter suffers far more from the density of the SR graph (to natively manage the third dimension of the Pareto front), and mostly from the random nature of the weights that impacts the size of the Pareto front. Indeed, while BEST2COP perform updates of the Pareto front in a relaxed amortized manner at the end of each iteration, SAMCRA updates it for each new discovered path resulting in a higher management cost when the front size becomes significant.

For , Fig. 6 solely showcases the execution time of the most effective approach, i.e., BEST2COP with 8 threads. Although BEST2COP seemed to scale very well with the dimensions of the graph, its execution time increases drastically once , reaching 10s for . Thus, even BEST2COP (in its flat variant not relying on area decomposition), cannot scale with SR graphs of nodes to enable real-time routing.

In summary, although the flat variant of BEST2COP also shows its limits in massive-scale networks, BEST2COP performs well for large scale networks (i.e., a single large area) and showcases better performance than SAMCRA (here with not favorable random weighted SR graphs of up to nodes). SR graphs translated from real topologies are likely to produce simpler instances of the problem, in particular with favorable valuations mitigating the Pareto front size. Thus, we now analyze the performance of BEST2COP and SAMCRA on a more realistic case.

Figure 7: Computation time taken by Best2cop (1 and 8 threads) and SAMCRA to solve 2COP on a real topology with real weights ().

Real network. Let us now consider a real IP network topology neither having random structure nor valuations. We use our largest available ISP topology, consisting of more than 1100 nodes and 4000 edges. While the IGP costs of each link in were available, we do not have their respective (real) measured delays. We thus infer delays thanks to the available geographical locations we do possess: we set the propagation delays as the orthodromic distances between the connected nodes divided by the speed of light. The execution times are then shown in Fig. 7 where BEST2COP and SAMCRA are run for every node as potential source (resulting in distributions represented as boxplots).

One can see that the two algorithms indeed benefit from interesting properties in deployed networks. In particular, the execution time of BEST2COP rarely exceeds 25ms (resp. 100ms) when relying on 8 (resp. 1) thread. Performance are greatly enhanced compared to previous cases thanks to the real underlying network structure and weights alignement: few distances dominate all the others leading to small Pareto front. A similar trend can be observed for SAMCRA, whose average execution time decreases drastically compared to the previous evaluation, to reach ms, although some outliers took ms. These results are not surprising, as SAMCRA, as well as other Dijkstra-like related approaches, rely on the fact that simple practical cases result in small Pareto fronts, as metrics are often somewhat aligned.

Nevertheless, BEST2COP still shows better performance (be it with a single or multiple threads), in particular because of the density of the SR graph input that impacts both approaches. Indeed, SAMCRA, being implemented with a binary heap as priority queue, is more likely to suffer from large than BEST2COP.

In realistic cases, BEST2COP can thus work with and so with a supported accuracy ms (to deal with a micro-second grain) for small enough delay constraint (i.e., , ms), while keeping the execution time in the hundreds of milliseconds. One may notice that (almost) perfectly aligned metrics reduce the usefulness of any DCLC-like algorithm, but such metrics are not always aligned for all couples in practice (even with realistic cases, we observe that the average size of the 3D Pareto front is strictly greater than , typically ). Our algorithm deals efficiently with easy cases and remains exact131313Or at least near exact for difficult instances having both high trueness and exponential increasing Pareto fronts. and efficient for more complex cases, e.g., with random graphs.

The networks used so far were flat networks, with sizes typically not exceeding a thousand nodes. However, some recent IP network deployments exceed nodes even in medium-sized countries. We thus now aim to evaluate the execution time of BEST2COPE, the extended variant of BEST2COP which supports and leverages OSPF-like area division. This version is adapted to tackle TE problems in massive scale, hierarchical networks. In the following, we only consider our approach, as it already exhibited better performance than SAMCRA even for small simple instances. However, before continuing analyzing the computing time results, we first introduce our generator for massive-scale, multi-areas, realistic networks having two valuations.

4.2 Massive Scale Topology Generation

To the best of our knowledge, there are no massive scale topologies made publicly available which exhibit IGP costs, delays, and area subdivision. For example, the graphs available in the topology zoo (or sndlib) datasets do not exceed 700 nodes in general. Moreover, the ones for which the two metrics can be extracted, or at least inferred, are limited to less than 100 nodes. Thus, at first glance, performing a practical massive-scale performance evaluation of BEST2COPE is highly challenging if not impossible. There exist a few topology generators IGEN; BRITE able to generate networks of arbitrary size with realistic networking patterns, but specific requirements must be met to generate topologies onto which BEST2COPE can be evaluated, in particular the need for two metrics and the area decomposition.

Topology generation requirements. First, the experimental topologies must be large, typically between and nodes. Second, they must possess two valuation functions as realistic as possible, one for the IGP cost and the other modeling the delay. Third, since the specific patterns exhibited by real networks impact the complexity of TE-related problems, the generated topologies must possess realistic structures (e.g., with respect to redundancy in the face of failures in particular). Finally, for our purposes, the topology must be composed of different areas centered around a core backbone, typically with two ABRs between each to avoid single point of failure.Since we do not know any generator addressing such requirements, we developed YARGG (Yet Another Realistic Graph Generator), a python tool(Code available online 141414https://github.com/JroLuttringer/YARGG) which allows one to evaluate its algorithm on massive-scale realistic IP networks. In the following, we describe the generation methods used to enforce the required characteristics.

High-level structure. One of the popular ISP structure is the three-layers architecture CampusNe63:online, illustrated in Fig. 9. The access layer provides end-users access to the communication service. Traffic is then aggregated in the aggregation layer. Aggregation routers are connected to the core routers forming the last layer. The aggregation and access layers form an area, and usually cover a specific geographical location. The core routers, the ABRs connecting the backbone other areas, and their links, form the backbone area that interconnect the stub areas, i.e., the aggregation and access layers of the different geographical locations. Core routers are ABRs and belong both to a sub area, per couple of 2 for redundancy. Thus, while the access and aggregation layers usually follow standard structures and weight systems recommended by different network vendors, the backbone can vastly differ among different operators, depending on geographical constraints, population distribution, and pre-existing infrastructure. Taking this factors into account, YARGG generates large networks by following this 3-layer model, given a specific geographical location (e.g., a given country).

Figure 8: Core network (before step 5) generated by YARGG in France. While we consider the road distances, we represent the links in an abstract fashion for readability purposes. The color and width of the links represent their bandwidth (and thus their IGP costs).

Generating the core network and the areas. YARGG generates the core network by taking the aforementioned considerations into account: existing infrastructures, population, and geographical constraints. An example of a core network as generated by YARGG may be seen in Fig. 8. At a very high level, given a geographical location (e.g., a country or a continent), YARGG builds the structure of the core network by

  1. Extracting the most populated cities in the area (close cities are merged in a single entity);

  2. Constructing a minimum Spanning-Tree covering all cities of the area, considering road distances and population. Links between highly populated cities are prioritized.

  3. Removing articulation-points, creating so a bi-connected graph;

  4. Adding links increasing the connectivity and resilience for a limited cost;

  5. Doubling the obtained topology (making the graph tri-connected) by adding systematic links between the two ABR routers of the same city as well as the same links in their own same topology (dual data plane);

  6. Associating each couple of ABR to the two last layers of the topology, forming an area per city.

The couple of routers located at each city within this generated backbone area become the ABRs between the backbone and their area, which is generated next.

Figure 9: Weights and structures of an area generated by YARGG.

Access & aggregation layers. These last two layers make up a non-backbone area and span a reduced geographical area. Thus, one access and one aggregation layer are located in each city considered by YARGG. Several network equipment vendors recommend a hierarchical topology, such as the three-layer hierarchical model Hierarch47:online. An illustration can be seen in Fig. 9. Simply put, there should be two core routers (the ABRs) at the given location (a city in YARGG’s case). Each core router is connected to all aggregation routers. For better resiliency, the aggregation layer is divided into aggregation groups, composed of two connected routers. Finally, routers within an aggregation group are connected to access-layer routers. To achieve areas of 300 nodes, we consider 30 access routers per aggregation group. This results in a large, dense and realistic graph.

Weights. In the backbone, the weights generated by YARGG are straightforward. The delays are extracted from the road distances between the cities, divided by the speed of light (close to the best performing fiber optic). The IGP cost is 1 for links between large cities since these links usually have a high bandwidth (in black in Fig. 8), 2 for standard links, necessary to construct a tri-connected graph (added at step 3, in red in Fig. 8), and 5 for links that are not mandatory, but that increase the overall connectivity (added at step 4, in orange in Fig. 8).

Within an area, the IGP costs follow a set of realistic constraints, according to two main principles: (i) access routers should not be used to route traffic (except for the networks they serve), (ii) links between routers of the same hierarchical level (e.g., between the two core routers or the two aggregation routers of a given aggregation group) should not be used, unless necessary (e.g., multiple links or node failures). These simple principles lead to the IGP costs exhibited in Fig. 9. The delays are then chosen uniformly at random. Since access routers and aggregation routers are close geographically, the delay of their links is chosen between and ms. The delay between aggregation routers and core routers is chosen between ms and the lowest backbone link delay. Thus, links within an area necessarily possess a lesser delay than core links.

Summary. YARGG computes a large, realistic and multi-area topology. The backbone spans a given geographical location and possesses simple IGP weights and realistic delays. Other areas follow a standard three-layer hierarchical model. Weights within a stub area are chosen according to a realistic set of usual ISP constraints. Delays, while chosen at random within such area, remain consistent with what should be observed in practice.

4.3 Computing Time for Massive Scale Multi-Areas Networks

Figure 10: Best2copE computation time on 5 continent-wide topologies generated by YARGG.

Using YARGG, we generate five massive scale, continent-wide topologies, and run BEST2COP on each one of them. The topologies ranges from to nodes. Each non-backbone area possess around nodes. The topologies, their geographical representations and some of the associated network characteristics can be found online topologies.

We run BEST2COP on each ABR as a source (around sources). The time corresponding to the message exchange of the computed Pareto front (step 2 of BEST2COPE) is not taken into consideration. Thus, the experiment is comprised of the following steps:

  1. All ABRs run BEST2COP for their areas and convert the found distances into segment lists;

  2. The current source ABR performs the Cartesian product of distances as described in Sec. 3.4, to find all non-dominated distances to all destinations.

The computation time showcased is thus the sum of the average time taken by ABRs to perform the preliminary intra-area BEST2COP (and the distances to segment lists conversions) plus the time taken to perform the Cartesian products (for all other ABRs of all other areas).

Note that we consider an ABR as a source and not an intra-area destination. In practice, the ABR would send the computed distances to the intra-area nodes, who in turn would have to perform a Cartesian product of these distances with its own distances to said ABR. However, both the ABR and the intra area node have to consider the same number of destinations (), and the results computed by the ABR can be sent as they are generated (destination per destination), allowing both the ABR and the intra-area nodes to perform their Cartesian product at the same time. In addition, intra-area nodes may benefit from several optimizations regarding their Cartesian product, if the constraints of the desired paths are known (these optimizations will not be used nor detailed in this paper). For these reasons, we argue that the time measured here, using an ABR as a source, is representative of the total actual time required, i.e., the overall worst time for the last treated destination at each source.

The results of this experiment are shown in the violin plot of Fig. 10. By leveraging the network structure, BEST2COPE exhibits very good performance despite the scale of the graph. For nodes, BEST2COPE exhibits a time similar to the one taken by its flat variant for . Furthermore, BEST2COPE seems to scale linearly with the number of nodes, remaining always under one second for . Even once the network reaches a size of , BEST2COPE is able to solve 2COP in less than one second for a non-negligible fraction of the sources, and never exceeds 1.5s.

Note that the times showcased here rely on a single thread. While BEST2COPE’s Cartesian product can be parallelized locally (both at the area and the destination scale), this parallelization hardly has any effect. This is explained by the fact that these individual computations are in fact fairly efficient, hence the overhead induced by the creation and management of threads is heavier than their workload. In addition, since BEST2COP deals with very large topologies, some complex memory-related effects might be at play. Indeed, we notice these results to surprisingly vary depending on the underlying system, operating system and architecture, due to the differences in terms of memory management.

Thus, while massive scale deployments seem to a priori prevent the usage of fine-grained TE, their structures can be leveraged, making complex TE problems solvable in less than one second even for networks reaching nodes. The computations performed for each area can also be distributed among different containers within the cloud, if handled by a controller.

4.4 Discussions and Ongoing Works

The absolute times exhibited here may surprise some readers. Well-known algorithms, such as MC Dijkstra (similar to SAMCRA, except for a different cost function), have been shown to solve DCLC exactly, in about 300ms in road networks of nodes Hanusse_Ilcinkas_Lentz_2020. In our evaluation, SAMCRA reaches this execution time for one thousand nodes only.

However, such experiments only consider two dimensions (cost and delay), and do not consider Segment Routing (neither as a third dimension or with an SR graph). Consequently, these best time DCLC results are usually performed on very sparse networks (usually real or random ones, with a density that may be as low as 0.0001) with aligned metrics. In our study, algorithms are run on very dense SR graphs (density of 1 for each flat network or area), to natively consider the number of segments and fully solve 2COP (or at least DCLC-SR for SAMCRA in our comparisons).

As one may recall, considering the fully meshed SR graph has two main advantages. First, it allows to easily consider the number of segments and the associated MSD constraint. Maintaining a 2D Pareto front is then sufficient, as the third dimension can be handled natively. Second, it prevents the need of converting all discovered paths to segment lists on the fly. Indeed, the number of segments characterizes a path, not an edge. In other words, this metric cannot be added to the weight vector of an edge. Rather, all discovered paths must be converted to segment lists to maintain a correct 3D Pareto front, respecting the MSD constraint. Although the information required to perform this conversion in polynomial time is available, it would require many additional computations.

For these reasons, we considered here that running algorithms on the SR graph itself is beneficial. Indeed, BEST2COP has the ability to handle path updates without the use of any PQ, such that it is less sensitive to this graph parameter than Dijkstra related approaches that may suffer from dense graphs, such as SR ones, if relying on a naive PQ implementation. While most algorithms maintain the Pareto front up to date as soon as a path is visited, BEST2COP only perform this operation once per iteration. Thus, the maintenance cost is lower and less dependent on the number of edges. For Dijkstra based approaches such as SAMCRA, the use of a relaxed PQ (like the Fibonnacci heap) may look useful to decrease this sensitivity. However, in our evaluations, we consider a binary heap to implement the PQ of SAMCRA as it is the one providing the best performance in the scope of our experiments. Indeed, just marking a path as deprecated in the binary heap is enough to relax the updates (i.e., there is no need to re-organize the heap at a logarithmic cost).

Moreover, it is thus worth investigating if using the SR graph is actually detrimental to other algorithms, including SAMCRA. For some, relying on the raw graph, as well as on the fly conversions and the maintenance of a 3D Pareto front might be more efficient. This model might also be more advantageous to BEST2COP itself in some cases. This study is left for future works. Such investigations may also call forth the following question: is considering the number of segments of each path necessary? One may, for example, running Dijkstra based algorithms on the raw graph (relying on a PQ), and convert only the desired paths to segment lists afterward. Paths may turn out to exceed the MSD constraint, but could still be deployed relying on binding segments. This solution is however fairly risky. Binding segments should indeed be avoided, as they results in routing overhead and additional states to maintain in each routing node. Depending on the MSD constraint of the underlying hardware, and the nature of the network, this solution might lead to the use of too many binding segments.

Overall, BEST2COP is the only algorithm to provide safe guarantees regarding both its results and execution times when considering 2COP in all possible cases. By natively considering the number of segments, BEST2COP ensures that the paths found are feasible without relying on the management of a 3D Pareto front, on the fly conversions, or binding segments, which could be very costly in worst cases. This is possible due to the fact that BEST2COP deals more efficiently with dense networks than other state-of-the-art algorithms, as shown in this section, it does not rely on any PQ and perform efficient amortized Pareto front updates. Finally, BEST2COP solves relevant TE-problems exactly in simple scenarios, or returns approximated solutions exhibiting strong, predictable and straightforward guarantees. Because of its ability to provide safeguards in all possible cases (regarding both its computation times and results), as well as its ability to deal with massive scale multi-area networks, BEST2COP is an ideal candidate to be deployed for such TE flavors in IP networks.

5 Conclusion

While the overhead of MPLS-based solutions lead to a TE winter in the past decade, Segment Routing marked its rebirth. In particular, SR enables the deployment of a practical solution to the well-known DCLC problem. In this paper, we proposed a multi-metric SR construct onto which our algorithm, BEST2COP, iterates to natively and efficiently solve DCLC in SR domains. BEST2COP leverages both SR properties and the inherently limited accuracy of measured delays to efficiently handle all scenarios; either with exactitude for simple instances (i.e., with Pareto fronts limited in size and/or weak delay trueness) or, at worst for difficult instances, with strong guarantees, e.g., a bounded distance regarding the delay constraint. Indeed, BEST2COP not only handle SR as a third constraint almost for free, but also relies on the most simple and efficient data structures and concepts available in the DCLC literature.

In this paper, we went several steps further with the following achievements:

  • we started by experimentally demonstrating that SR is a relevant technology to deploy DCLC paths, the vast majority of solutions are not exceeding its MSD limit (i.e., the required number of segments is limited to 10 in practice);

  • the versatile design of BEST2COP allows ISP to solve the most practical category of optimization variants and heterogeneous constraints for each destination within SR domains, i.e., it solves 2COP taking into account the propagation delay, the IGP cost and the number of necessary segments;

  • for massive scale ISPs relying on area-subdivision, we extend BEST2COP, partitioning 2COP into smaller sub-problems, to further reduce its overall complexity (time, memory and churn);

  • through extensive evaluations and comparisons, in particular relying on multi-threading and our own multi-metric/multi-areas network generator, we have shown that BEST2COP is very efficient in practice.

To the best of our knowledge, BEST2COP is the first practically exact and efficient solution for 2COP within SR domains, making it the most practical candidate to be deployed for such a TE flavor in today ISPs. It is able to solve 2COP on massive scale realistic networks having nodes in less than a second. For large areas having thousands of routing devices, we have shown that BEST2COP can easily deal with random topologies while its concurrents do not scale. Finally, more advanced and flexible structures can be envisioned to deal with really high trueness requirements, while deploying novel flex-algo strategies can help to mitigate the rare SR limit drawbacks.

Acknowledgment

This work was partially supported by the French National Research Agency (ANR) project Nano-Net under contract ANR-18-CE25-0003.

References