Optimizing Inter-Datacenter Tail Flow Completion Times using Best Worst-case Routing

08/24/2019 ∙ by Max Noormohammadpour, et al. ∙ University of Southern California 0

Flow routing over inter-datacenter networks is a well-known problem where the network assigns a path to a newly arriving flow potentially according to the network conditions and the properties of the new flow. An essential system-wide performance metric for a routing algorithm is the flow completion times, which affect the performance of applications running across multiple datacenters. Current static and dynamic routing approaches do not take advantage of flow size information in routing, which is practical in a controlled environment such as inter-datacenter networks that are managed by the datacenter operators. In this paper, we discuss Best Worst-case Routing (BWR), which aims at optimizing the tail completion times of long-running flows over inter-datacenter networks with non-uniform link capacities. Since finding the path with the best worst-case completion time for a new flow is NP-Hard, we investigate two heuristics, BWRH and BWRHF, which use two different upper bounds on the worst-case completion times for routing. We evaluate BWRH and BWRHF against several real WAN topologies and multiple traffic patterns. Although BWRH better models the BWR problem, BWRH and BWRHF show negligible difference across various system-wide performance metrics, while BWRHF being significantly faster. Furthermore, we show that compared to other popular routing heuristics, BWRHF can reduce the mean and tail flow completion times by over 1.5× and 2×, respectively.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We revisit the well-known flow routing problem over the inter-datacenter networks where one operator manages the network routing as well as the datacenters. Examples of such networks are Google B4 [b4, b4andafter], Facebook Express Backbone [facebook-express-backbone], and Microsoft Global WAN [swan-backbone]. Traffic forwarding over these networks is controlled in a logically centralized fashion which brings about new opportunities. That is, routing algorithms can use a global view of network status and end-point demands to optimize the routing of inter-datacenter traffic.

A growing fraction of traffic over the inter-datacenter networks is generated by cross-datacenter flows that replicate content and data across many datacenters [facebook-express-backbone]. These flows are mostly throughput-oriented, carry large volumes of data, and are resilient to initial routing and scheduling latency. Optimizing the completion times of cross-datacenter flows, which is the focus of this paper, can improve the performance of geographically distributed datacenter applications and increase the overall network utility.

An inter-datacenter flow can be indicated with a source datacenter, a receiver datacenter, and the volume of data to be moved. We consider an online scenario without knowledge of future flow arrivals. Upon arrival, the routing algorithm with a global network view will select a path for the new flow. Similar to the standard routing techniques, once a path is selected for a flow, it cannot be changed unless there is a network failure. The objective is to minimize the tail completion time of flows over a given period where a flow’s completion time is defined as the time since its arrival to its completion. Since the tail-optimal flow completion times can only be computed given the knowledge of future flow arrivals, we apply an online-greedy approach to minimizing the tail flow completion times named Best Worst-case Routing (BWR).111An earlier version of BWR was proposed and analyzed in [bwr]. In this paper, we have extended BWR to networks with non-uniform link capacities and have also proposed and evaluated an additional heuristic with guaranteed polynomial running time.

We briefly go over current online path selection techniques. Multiple static and dynamic routing techniques have been proposed for this purpose, which consider link capacity information, the number of hops, link propagation latency, and instantaneous link utilization for routing [tvlakshman, routing-metric]. For example, the shortest widest path approach [tvlakshman] selects the path that offers the maximum available bandwidth at the current time with the minimum number of hops from the source to the destination. Also, the min-max utilization approach selects the path with the minimum number of hops over which the maximum utilization is minimum from the source to the destination. Besides, a popular static routing approach is to assign to every edge a cost equal to their inverse capacity and then select one shortest path from the source to the destination [routing-metric]. We will consider these approaches later in the evaluations.

The BWR technique presented in this work takes advantage of the remaining flow size information for long-running flows to further improve on the current online and dynamic routing approaches. In a network that is managed in a logically centralized fashion, such information is either known by the traffic scheduler (e.g., [tempus, amoeba, owan]) or can be obtained from the end-points periodically. In summary, our paper makes the following contributions:

  • We propose the Best Worst-case Routing (BWR) with aims to minimize the worst-case completion time of every incoming flow given the network topology, the currently ongoing flows’ paths, and their remaining volume of data. We make no assumptions on the future flow arrivals and impose no constraints on the traffic scheduling policy.

  • Since BWR is NP-Hard [bwr], we propose two heuristics that aim to select a path that minimizes an upper bound on the worst-case completion time of the arriving flow. We discuss why computing the exact worst-case per path can be computationally intensive as it may require building complex dependency graphs.

  • We run extensive simulations to compare the performance of BWRH and BWRHF with that of popular static and dynamic routing approaches currently used. We first find that BWRH and BWRHF offer almost identical performance across various topologies and traffic patterns. Next, we show that BWRHF can improve the mean and tail completion times by over and , respectively, given various flow size distributions and scheduling policies.

Ii System Model

Let us consider an inter-datacenter network with non-uniform link capacity distribution across the edges. We assume an online scenario without the knowledge of future flow arrivals. Every flow is assigned a fixed path when it arrives which is computed using the information available at a logically centralized network controller on the network topology and the other currently ongoing flows. This assumption is based on many recent related works on management of inter-datacenter WAN [b4, facebook-express-backbone, swan-backbone]. We also assume the availability of flow size information for the new flow and the remaining flow size for ongoing flows. Please find the definition of variables used in this paper in Table I.

Variable Definition
A data flow
Source node of
Destination node of
The arrival time of
Worst-case finish time of
Worst-case completion time of , i.e.,
Total volume of in bytes
Remaining volume of in bytes, i.e.,
The inter-datacenter network graph
An edge where given
Capacity of edge in bytes per second
A path on the inter-datacenter network
Path selected for
Worst-case completion time of on path
Edges of path
All flows whose paths goes over
TABLE I: Definition of Variables

We focus on the routing of long flows for which the additional latency due to centralized control is acceptable. Also, by aiming at sufficiently large flows, the queuing and propagation latency can be negligible and so can be omitted from the calculations. The routing for short flows can be done in a distributed fashion without the involvement of the logically centralized controller, e.g., short flows can be routed on the paths with minimum propagation latency. This approach will make the system more scalable as the number of large flows is usually a small fraction of the total number of flows although they carry significant volumes of traffic [social_inside, facebook-express-backbone]. 222A flow is large if its completion time is orders of magnitude larger than the propagation latency or if it is orders of magnitude larger than the median flow size.

Depending on how network queues prioritize the transmission of data packets and how the end-points transmit traffic, a variety of network scheduling policies can be realized. We consider the three popular scheduling policies of First Come First Serve (FCFS), Shortest Remaining Processing Time (SRPT), and fair sharing based on max min fairness [max-min-fairness]. In evaluations, we will focus mainly on the fair sharing policy as that is the ideal outcome when multiple TCP flows share network links.

Finally, we assume that the network controller is capable of dynamically updating the network forwarding state to install custom routes for the long incoming flows. Such forwarding is achievable with the application of Software Defined Networking (SDN) [sdn] which is supported by several current inter-datacenter networks [b4, swan-backbone, facebook-express-backbone]. This goal is obtainable over MPLS networks as well via techniques such as Segment Routing (SR) [filsfils2018segment].

Iii Best Worst-case Routing (BWR)

BWR aims to minimize the worst-case (i.e., tail) completion time of an incoming flow without any knowledge of future flow arrivals. The worst-case completion time of a flow is independent of the network scheduling policy and is only a function of network topology, links’ capacities, and the remaining volumes of traffic for the current flows. The BWR problem can be stated as follows (variables in Table I):

Best Worst-case Routing Problem: Given an inter-datacenter wide area network , capacity for , and current flows , we want to assign a path to the new flow so that is minimized.

This is a greedy online approach to optimizing the tail completion times of flows. This problem is highly complex for two reasons. First, in a general setting, computing the exact worst-case completion time of on a given path may be computationally intensive due to the interdependence of current flows.333Please see the appendix for an example and discussion. Next, even assuming that there is an efficient algorithm to compute the worst-case completion time of on any given path , finding the path is NP-Hard [bwr].

Our Approach: Given that BWR is NP-Hard, we propose two heuristics in the following sections for finding an approximate solution. Instead of computing the exact worst-case completion time of for a given path (i.e., exact ), we use two different upper bounds. Depending on the upper bound used, the heuristics developed will have different properties.

Iv Bwrh

We compute an upper bound for the worst-case completion time of a new flow given the following assumptions. First, we assume no knowledge of future arrivals, and so only the current flows are considered. Second, the worst-case occurs when all the current flows that have a common edge with a candidate path are sent before and that their bottleneck is the edge with minimum capacity that is in common with . Next, the worst-case occurs when all the current flows that have a common edge with are transmitted sequentially, and so the time to complete their remaining volumes is accumulated. Finally, after all the current flows that have a common edge with are finished, it will take equal to its volume divide by the bottleneck capacity on to complete. Therefore, given the ongoing flows , we approximate for any given path using the aforementioned upper bound:


A straightforward approach to computing the path with the best worst-case completion time for is using exhaustive search. However, since the number of paths between two end-points grows exponentially with the network size, exhaustive search is generally inefficient and slow. In Algorithm 1, we have augmented exhaustive search with a straightforward termination condition that speeds up the process significantly for the majority of flows.

Input: , , , and
#hops on the minimum hop path from to ;
Find a path with at most hops from to for which is the minimum possible by scanning all such paths;
       Compute ;
until ;
Algorithm 1 BWRH

Algorithm 1 finds a path for using the approximation in Eq. 1. At every iteration, the algorithm finds the path with at most hops from to with the minimum value of by finding and examining all such paths. Assuming that is the number of hops on the minimum hop path from to , the algorithm finds and where . It then increases the number of maximum hops allowed by one, i.e., , extending the search space to more paths. This process continues until there is no improvement in the worst-case completion time of the best path, i.e., the path for which the worst-case completion time is minimum while increasing .

The termination condition in BWRH favors shorter paths. If the best path is much longer than the minimum hop path, it is possible that the algorithm terminates before it is found. However, it is unlikely, in general, for the best path to be long as having more edges increases the likelihood of having common edges with more current flows, which increases the worst-case completion time on a path.

V Bwrhf

The worst-case running time of Algorithm 1 is exponential as it might search all existing paths from to . To improve, we propose a new heuristic that runs in guaranteed polynomial time. We do that by using a new upper bound on the worst-case completion time of a new flow which is based on the following inequality:


Where , , are positive real numbers. Accordingly, we can arrive at the following upper bound from the one in Eq. 1. That is, given the ongoing flows , we approximate for any given path with:


Algorithm 2 shows how BWRHF finds a path for using this new upper bound. In contrast with the upper bound used in BWRH, the one used in Algorithm 2 is edge-decomposable. Therefore, BWRHF can be implemented using any standard shortest path algorithm. This algorithm offers guaranteed polynomial running time, which is a highly desired property for online flow routing.

Input: , , , and
To every edge , assign the weight ;
A minimum weight path from to found by a standard shortest path algorithm (e.g., Dijkstra’s algorithm);
Algorithm 2 BWRHF
Fig. 1: Example used to demonstrate the two upper bounds for
(a) Average performance gap between BWRHF and BWRH over five WAN topologies
(b) Worst-case running times
Fig. 2: Comparison of flow completion times and running time for the two heuristics of BWRHF and BWRH over three different traffic patterns (i.e., light-tailed, heavy-tailed, and cache-follower) and three scheduling policies (i.e., FCFS, SRPT, and fair sharing).

Vi An Example

Consider the scenario shown in Figure 1. A new flow with a volume of 8 GB has arrived for which we are considering the candidate path . We will compute the two upper bounds on as discussed in the two previous sections. The approximation in Eq. 1 is computed as follows:


The approximation of Eq. 3 can be computed as follows:


The second upper bound will always be greater than the first one as shown in Eq. 2.

(a) By scheduling policy given AT&T North America topology with randomly generated variable link capacities
(b) By network topology given the fair sharing policy based on max-min fairness
Fig. 3: Flow completion times reported for 500 flow arrivals given and

for various traffic patterns. The capacity of every link was randomly generated using a uniform distribution from 0.2 to 1, and each scenario was repeated 20 times to consider many capacity distribution possibilities. The results were averaged across the 20 runs, and the error bars show the standard deviation over the runs. The Facebook patterns were inferred from the CDF curves reported in


Vii Evaluations

We considered both synthetic flow size distributions of light-tailed (Exponential) and heavy-tailed (Pareto) as well as the Cache-follower flow size distribution reported by Facebook [social_inside]. We also considered Poisson flow arrivals with the rate of . We assumed an average flow size of units with a maximum of 500 units along with a minimum size of 2 units for the heavy-tailed distribution.

Topologies: We considered AT&T North America [att] with 25 nodes and 57 edges, Cogent [cogent] with 197 nodes and 243 edges, GScale [b4] with 12 nodes and 19 edges, AGIS [agis] with 25 nodes and 30 edges, and ANS [ans] with 18 nodes and 25 edges. We assumed bidirectional edges with the capacity of every link in each direction randomly generated using a uniform distribution from 0.2 to 1.

Schemes: We considered three schemes besides BWRH and BWRHF. The Inverse Capacity Shortest Path assigns a cost equal to the reciprocal of a link’s capacity to it and then selects a shortest hop path from the source to the destination for the new flow. The Min-Max Utilization approach chooses a path that has the minimum value of maximum utilization across all paths going from the source to the destination. This approach has been used extensively in the traffic engineering literature [tvlakshman, tempus]. The Shortest Widest Path selects the path with the minimum number of hops over which the available bandwidth across the whole path is maximum. The available bandwidth across a path is equal to the minimum of available bandwidth across all its edges.

Implementation: We implemented Algorithm 1 in Java using the JGraphT library. To exhaustively find all paths with at most hops, we used the class AllDirectedPaths in JGraphT. Algorithm 2 was also implemented in Java using the class DijkstraShortestPath from JGraphT library.444We have not made the Java source code publicly available. However, a C++ implementation of BWRHF and the exact BWR algorithm can be found in the following Git repository: https://github.com/noormoha/bwr_routing

Vii-a Comparison of Bwrh and Bwrhf

In Figure 2, we compare BWRH and BWRHF. We initially compare the mean and tail flow completion times for these two heuristics per topology across various scheduling policies and traffic patterns as shown in Figure 2(a). A positive number shows that BWRH performed better that BWRHF on average. As can be seen, the performance difference is a function of topology. Data points for completion times are averaged across all scenarios per topology. Each scenario was repeated ten times, and for every topology per scenario, the capacity of every link was randomly generated using a uniform distribution from 0.2 to 1. Each sample for computation of the difference was calculated as where is either the tail or mean completion times for a simulation instance. Increasing will increase the average network load. The maximum difference in both mean and tail completion times is measured as less than 15%.

Next, we observe the running time of these two heuristics in Figure 2(b). For the running time, we have reported the worst-case across all runs across all topologies, traffic patterns, and scheduling policies. We can see that the worst-case running time of BWRHF remains within the realm of milliseconds while BWRH shows a worst-case of tens of seconds. Given that the running time of the routing algorithm is added to the completion time of flows and that these two schemes show little performance difference with regards to completion times of flows, we will focus on BWRHF and its further evaluation from here on.

Vii-B Effect of Scheduling Policies

In Figure 3(a), we study the effect of various scheduling policies on the performance of BWRHF. We see considerable and consistent gains with regards to tail completion times that range from to across various scheduling policies, traffic patterns, and schemes. We also see significant gains in mean completion times that range from to . Interestingly, we also see that the error bars for BWRHF is less than or equal to those of other schemes per metric and per scenario which shows that it offers a more stable performance gain across various scenarios.555Please note that link capacities are randomly generated per instance ranging from 0.2 to 1 and uniformly distributed. This allows us to examine the schemes over highly varying scenarios given a single connectivity topology.

Vii-C Effect of Network Topologies

In Figure 3(b), we study the effect of various scheduling policies on the performance of BWRHF. We see that it can improve the mean flow completion times compared to other schemes across different topologies and traffic patterns by up to . It can also reduce the tail completion times by to . We also see fewer fluctuations in the performance of BWRHF, i.e., smaller error bars compared to other schemes. We also see a similar pattern of completion times across different traffic patterns for a given topology, which shows the significant effect of network topology in performance.

Viii Conclusions and Future Directions

We discussed Best Worst-case Routing (BWR) as an approach to use the remaining volumes of current flows and their paths for improving the completion times of new flows in an online scenario. BWR is a greedy online approach to selecting a path for an incoming flow that optimizes its worst-case completion time. We then noted that this problem is NP-Hard. We also realized that computing the exact worst-case completion time of a new flow on a given path can be computationally intensive due to the interdependence of current flows. We then aimed at developing two heuristics that use two upper bounds on the worst-case completion time for a given path instead of using an exact value. BWRH uses a tighter bound on the worst-case completion times but has an exponential worst-case running time, while BWRHF uses a looser upper bound with a guaranteed polynomial running time by extending standard shortest path algorithms. We performed extensive evaluations given various topologies, traffic patterns, and scheduling policies and found that BWRHF can significantly improve the tail and mean flow completion times. Future directions include the extension of BWRHF to multipath routing for further increasing network throughput and a study of how potentially inaccurate flow size information can affect the performance.