Dedicated inter-datacenter networks connect dozens of geographically dispersed datacenters [2, 3, 4] whose traffic can generally be categorized as either user-generated or internal. User-generated traffic is in the critical path of users’ quality of experience and is generated as a result of direct interaction with users. Internal traffic flows across servers that host applications in the back-end and is a product of staging data and content that will later be used to offer services to users. Compared to user-generated traffic, internal traffic is more resilient to scheduling and routing latency and is usually orders of magnitude larger in volume [2, 5, 3]. Internal data transfers over inter-datacenter networks can potentially take a long time to complete, that is, up to hours .
A prevalent form of internal traffic is the replication of data and content from one datacenter to multiple other datacenters which accounts for tremendous volumes of traffic [2, 3, 6]. Examples include the distribution of numerous copies of voluminous configuration files , multimedia content served to regional users by CDNs , and search index updates . Such replication generates bulk multicast transfers with a predetermined set of receivers and known transfer volume which are the focus of this paper.
As bandwidth over dedicated inter-datacenter networks is managed by one organization that also operates the datacenters, it is possible to coordinate data transmissions across the end-points to avoid congestion and optimize network-wide performance metrics such as mean or tail completion times of receivers. We focus on minimizing the mean completion times of receivers while performing concurrent bulk multicast transfers assuming that receivers of a transfer can complete at different times. Speeding up several receivers per transfer can translate to improved end-user quality of experience and increased availability. For example, faster replication of video content to regional datacenters enhances average user’s experience in social media applications or making a newly trained model available at regional datacenters allows speedier access to new application features for millions of users.
Several recent works focus on improving the performance of unicast transfers over dedicated inter-datacenter networks [2, 9, 5, 10, 11]. Performing bulk multicast transfers as many separate unicast transfers can lead to excessive bandwidth usage and increase completion times. Although there exists extensive work on multicasting, it is not possible to apply those solutions to our problem as existing research has focused on different goals and considers different constraints. For example, earlier research in multicasting aims at dynamically building and pruning multicast trees as receivers join or leave , building multicast overlays that reduce control traffic overhead and improve scalability , or choosing multicast trees that satisfy a fixed available bandwidth across all edges as requested by applications [14, 15], minimize congestion within datacenters [16, 17], reduce data recovery costs assuming some recovery nodes , or maximize the throughput of a single multicast flow [19, 20]. To our knowledge, none of the related research efforts aimed at minimizing the mean completion times of receivers for concurrent bulk multicast transfers while considering the overall bandwidth usage, which is the focus of this work.
Motivating Example: Figure 1 shows an example of delivering a large object X from source to destinations which has a volume of 100 units. We have two types of links with capacities of 1 and 10 units of traffic per time unit. We can use a single multicast tree to connect the sender to all receivers which will allow us to transmit at the bottleneck rate of 1 to all receivers. However, one can group receivers into two partitions of and and attach each partition with a separate multicast tree. Then we can select transmission rates so that we minimize the mean completion times. In this case, assigning a rate of 1 to the tree attached to and a rate of 9 to the tree attached to will attain this goal while respecting link capacity over all links (the link attached to is the bottleneck). As another possibility, we could have assigned a rate of 10 to the tree attached to , allowing to finish in 10 units of time, while suspending the tree attached to until time 11. As a result, the tree attached to would have started at 11 allowing to finish at 110. In this paper, we aim to improve the speed of several receivers per bulk multicast transfer without hurting the completion times of the slow receivers. In computing the completion times, we ignore the propagation and queuing latencies as the focus of this paper is on delivering bulk objects for which the transmission time dominates the propagation or queuing latency along the trees.
We break the bulk multicast transfer routing, and scheduling problem with the objective of minimizing mean completion times of receivers into three sub-problems of the receiver set partitioning, multicast forwarding tree selection per receiver partition, and rate allocation per forwarding tree. We propose QuickCast, which offers an elegant solution to each one of these three sub-problems.111Compared to , we have extended QuickCast by considering actual WAN topologies with non-uniform link capacity and by eliminating the constraint on the number of partitions. We have performed additional empirical evaluations on multiple tree selection techniques and several rate allocation policies.
Receiver Set Partitioning: As different receivers can have different completion times, a natural way to improve completion times is to partition receivers into multiple sets with each receiver set having a separate tree. This reduces the effect of slow receivers on faster ones. We employ a partitioning technique that groups receivers of every bulk multicast transfer into multiple partitions according to their mutual distance (in hops) on the inter-datacenter graph. With this approach, the partitioning of receivers into any partitions consumes minimal additional bandwidth on average. We also offer a configuration parameter called the partitioning factor that is used to decide on the right number of partitions that create a balance between receiver completion times improvements and the total bandwidth consumption.
Forwarding Tree Selection:
To avoid heavily loaded routes, multicast trees should be chosen dynamically per partition according to the receivers in that partition and the distribution of traffic load across network edges. We utilize a computationally efficient approach for forwarding tree selection that connects a sender to a partition of its receivers by assigning weights to edges of the inter-datacenter graph, and using a minimum weight Steiner tree heuristic. We define a weight assignment according to the traffic load scheduled on edges and their capacity and empirically show that this weight assignment offers improved receiver completion times at minimal bandwidth consumption.
Rate Allocation: Given the receiver partitions and their forwarding trees, formulating the rate allocation for minimizing mean completion times of receivers leads to a hard problem. We consider the popular scheduling policies of fair sharing, Shortest Remaining Processing Time (SRPT), and First Come First Serve (FCFS). We reason why fair sharing is preferred compared to policies that strictly prioritize transfers (i.e., SRPT, FCFS, etc.) for network throughput maximization when focusing on bulk multicast transfers especially ones with many receivers per transfer. We empirically show that using max-min fairness , which is a form of fair sharing, we can considerably improve the average network throughput which in turn reduces receiver completion times. In QuickCast, we applied max-min fairness for rate allocation across the multicast forwarding trees.
QuickCast assumes a logically centralized setting, communicates with the end-points that transmit traffic for rate limiting, and with the inter-datacenter network elements that perform traffic forwarding for managing multicast forwarding trees. We evaluate QuickCast against a variety of synthetic and real traffic patterns as well as real WAN topologies. Compared to performing bulk multicast transfers as separate unicast transfers, QuickCast achieves up to reduction in mean receiver completion times while at the same time using the bandwidth. Also, QuickCast allows the top 50% of receivers to complete between to faster on average compared with when a single forwarding multicast tree is used for data delivery. We also show that on a real WAN topology, fair sharing offers up to higher throughput with 16 receivers per bulk multicast transfer compared to other scheduling policies, i.e., SRPT and FCFS.
Ii Related Work
Internet Multicasting: A large body of general multicasting approaches have been proposed where receivers can join multicast groups anytime to receive required data and multicast trees are incrementally built and pruned as nodes join or leave a multicast session such as IP multicasting , TCP-SMO  and NORM . These solutions focus on building and maintaining multicast trees, and do not consider link capacity and other ongoing multicast flows while building the trees.
Multicast Traffic Engineering: An interesting work  considers the online arrival of multicast requests with a specified bandwidth requirement. The authors provide an elegant solution to find a minimum weight Steiner tree for an arriving request with all edges having the requested available bandwidth. This work assumes a fixed transmission rate per multicast tree, dynamic multicast receivers, and unknown termination time for multicast sessions whereas we consider variable transmission rates over timeslots, fixed multicast receivers, and deem a multicast tree completed when all its receivers download a specific volume of data. MTRSA  considers a similar problem to  but in an offline scenario where all multicast requests are known beforehand while taking into account the number of available forwarding rules per switch. MPMC [19, 20] maximizes the throughput for a single multicast transfer by using multiple parallel multicast trees and coding techniques. None of these works aims to minimize the completion times of receivers while considering the total bandwidth consumption.
Datacenter Multicasting: A variety of solutions have been proposed for minimizing congestion across the intra-datacenter network by selecting multicast trees according to link utilization. Datacast  sends data over edge-disjoint Steiner trees found by pruning spanning trees over various topologies of FatTree, BCube, and Torus. AvRA  focuses on tree and FatTree topologies and builds minimum edge Steiner trees that connect the sender to all receivers as they join. MCTCP  reactively schedules flows according to link utilization. These works do not aim at minimizing the completion times of receivers and ignore the total bandwidth consumption.
Overlay Multicasting: With overlay networks, end-hosts can form a multicast forwarding tree in the application layer. RDCM  populates backup overlay networks as nodes join and transmits lost packets in a peer-to-peer fashion over them. NICE 
creates hierarchical clusters of multicast peers and aims to minimize control traffic overhead. AMMO allows applications to specify performance constraints for selection of multi-metric overlay trees. DC2  is a hierarchy-aware group communication technique to minimize cross-hierarchy communication. SplitStream  builds forests of multicast trees to distribute load across many machines. BDS  generates an application-level multicast overlay network, creates chunks of data, and transmits them in parallel over bottleneck-disjoint overlay paths to the receivers. Due to limited knowledge of underlying physical network topology and condition (e.g., utilization, congestion or even failures), and limited or no control over how the underlying network routes traffic, overlay routing has limited capability in managing the total bandwidth usage and distribution of traffic to minimize completion times of receivers. In case such control and information are provided, for example by using a cross-layer approach, overlay multicasting can be used to realize solutions such as QuickCast.
Reliable Multicasting: Various techniques have been proposed to make multicasting reliable including the use of coding and receiver (negative or positive) acknowledgments. Experiments have shown that using positive ACKs does not lead to ACK implosion for medium scale (sub-thousand) receiver groups . TCP-XM  allows reliable delivery by using a combination of IP multicast and unicast for data delivery and re-transmissions. MCTCP  applies standard TCP mechanisms for reliability. Another approach is for receivers to send NAKs upon expiration of some inactivity timer . NAK suppression has been proposed to address implosion which can be applied by routers . Forward Error Correction (FEC) has been used to reduce re-transmissions  and improve the completion times  examples of which include Raptor Codes  and Tornado Codes . These techniques can be applied complementary to QuickCast.
Multicast Congestion Control: Existing approaches track the slowest receiver. PGMCC , MCTCP  and TCP-SMO  use window-based TCP like congestion control to compete fairly with other flows. NORM  uses an equation-based rate control scheme. With rate allocation and end-host based rate limiting applied over inter-datacenter networks, need for distributed congestion control becomes minimal; however, such techniques can still be used as a backup.
Other Related Work: CastFlow  precalculates multicast spanning trees which can then be used at request arrival time for fast rule installation. ODPA  presents algorithms for dynamic adjustment of multicast spanning trees according to specific metrics. BIER  has been recently proposed to improve the scalability and allow frequent dynamic manipulation of multicast forwarding state in the network and can be applied complementary to our solutions in this paper. Peer-to-peer approaches [38, 39, 40] aim to maximize throughput per receiver without considering physical network topology, link capacity, or total network bandwidth consumption. Store-and-Forward (SnF) approaches [41, 42, 43, 44] focus on minimizing transit bandwidth costs which does not apply to dedicated inter-datacenter networks. However, SnF can still be used to improve overall network utilization in the presence of diurnal link utilization patterns, transient bottleneck links, or for application layer multicasting. BDS  uses many parallel overlay paths from a multicast source to its destinations storing and forwarding data from one destination to the next. Application of SnF for bulk multicast transfers considering the physical topology is complementary to our work in this paper and is a future direction. Recent research [46, 47, 48] also consider bulk multicast transfers with deadlines with the objective of maximizing the number of transfers completed before the deadlines. We realize that reducing completion times is a more general objective and for most applications, completing transfers is valuable and required even when it is not possible to meet all the deadlines .
Iii Problem Statement and Challenges
We consider a scenario where bulk multicast transfers arrive at the inter-datacenter network in an online fashion. Every bulk multicast transfer is specified with a source , set of destinations , and volume in bytes (unicast and broadcast can be considered as special cases with one receiver or all other nodes as receivers). In general, no form of synchronization is required across receivers of a bulk multicast transfer and therefore, receivers are allowed to complete at different times as long as they all receive the multicast object in whole. Incoming requests are processed as they come by a traffic engineering server that manages the forwarding state of the whole network in a logically centralized manner for installation and eviction of multicast trees. Upon arrival of a request, this server decides on the number of partitions and receivers that are grouped per partition and a multicast tree per partition.
We consider a slotted timeline with a timeslot duration of . Periodically, the traffic engineering server computes the transmission rates for all multicast trees at the beginning of every timeslot and dispatches them to senders for rate limiting. This allows for a congestion free network since the rates are computed according to link capacity constraints and other ongoing transfers. To minimize control plane overhead, partitions and forwarding trees are fixed once they are established for an incoming transfer. In this context, the bulk multicast transfer routing and scheduling problem can be formally stated as follows. A summary of our notations is present in Table I.
|The current timeslot|
|A directed edge|
|Edge ’s bandwidth utilization|
|A directed inter-datacenter network graph|
|A directed Steiner tree|
|Duration of a timeslot|
|A bulk multicast transfer request|
|Source datacenter of|
|Set of destinations of request|
|Original volume of|
|Set of partitions of request|
|Set of partitions of all transfers in the system|
|Forwarding tree of some partition|
|The transmission rate over forwarding tree of some partition at timeslot|
|Residual volume of some partition . Therefore, where|
|Edge ’s total traffic load at time , i.e., total outstanding bytes scaled by ’s inverse capacity|
|A configuration parameter that determines a partitioning cost threshold|
|A configuration parameter that determines the maximum number of partitions per transfer|
Problem Statement: Given an inter-datacenter network with the edge capacity and the set of all partitions , for a newly arriving bulk multicast transfer , the traffic engineering server needs to compute a set of receiver partitions each with one or more receivers, and select a forwarding tree for every partition . In addition, per timeslot , the traffic engineering server needs to compute the rates . The objective is to minimize the average time for a receiver to complete data reception while keeping the total bandwidth consumption below a certain threshold compared to the minimum possible, i.e., a minimum edge Steiner tree per transfer.
Challenges: Both the number of ways to partition receivers into subsets and the number of candidate forwarding trees per subset grow exponentially with the problem size. It is, in general, not clear how partitioning and selection of forwarding trees correlate with both receiver completion times and total bandwidth usage. Even the simple objective of minimizing the total bandwidth usage is a hard problem. Also, assuming known forwarding trees, selecting transmission rates per timeslot per tree for minimization of mean receiver completion times is a hard problem. Finally, this is an online problem with unknown future arrivals which adds to the complexity.
As stated earlier, we need to address the three sub-problems of receiver set partitioning, tree selection, and rate allocation. Since the partitioning sub-problem uses the tree selection sub-problem, we first discuss tree selection in the following. As the last problem, we will address rate allocation. Since the total bandwidth usage is a function of transfer properties (number of receivers, transfer volume, and the location of sender and receivers) and the network topology, it is highly sophisticated to design a solution that guarantees a limit on the total bandwidth usage. Instead, we aim to reduce the completion times while minimally increasing bandwidth usage.
Iv-a Forwarding Tree Selection
The tree selection problem states that given a network topology with link capacity knowledge, how to choose a Steiner tree that connects a sender to all of its receivers. The objective is to minimize the completion time of receivers (all receivers on a tree complete at the same time) while minimally increasing the total bandwidth usage. Since the total bandwidth usage is directly proportional to the number of edges on selected trees, we would want to keep trees as small as possible. Reduction in completion times can be achieved by avoiding edges that have a large outstanding traffic load. In general, this would mean selecting potentially larger trees to go around such edges, if necessary. This effect can be accounted for by assigning proper weights to the edges of the inter-datacenter graph and choosing a minimum weight Steiner tree that connects the sender to a partition of receivers for some bulk multicast transfer. The minimum weight Steiner tree is a hard problem for which many heuristics exist.
Weight Assignment: Since we focus on reducing the receiver completion times for bulk transfers where transmission time could be orders of magnitude larger than propagation or queuing delay, conventional routing metrics such as end-to-end latency are not effective. Also, we realized that instantaneous link utilization, which has been extensively used for traffic engineering over WAN, lacks stability over longer time periods which makes it hard to infer how it will change in the near future. Therefore, we will use a new metric we refer to as link load that is defined in Table I and can be computed as follows:
A link’s load is the total outstanding volume of traffic allocated on that link (that we know will cross over that link in the future) divided by its capacity. We can compute this value since we know the volume of incoming transfers and the edges that it will be using. A link’s load is a measure of how busy it is expected to be shortly. It increases as new transfers are scheduled on a link, and diminishes as traffic flows through it. To select a forwarding tree from a source to a set of receivers, we use an edge weight of and select a minimum weight Steiner tree. The selected tree will most likely exclude any links that are expected to be highly busy. Addition of the second element in the weight (new request’s volume divided by capacity) helps select smaller trees in case there is not much load on most edges.
Algorithm 1 applies the weight assignment approach mentioned above to select a forwarding tree that balances the traffic load across available trees and finds a minimum weight Steiner tree using the GreedyFLAC heuristic . In §V, we explore a variety of weights for forwarding tree selection as shown in Table IV and see that this weight assignment provides consistently close to minimum values for the three performance metrics of mean and tail receiver completion times as well as total bandwidth usage.
Iv-B Receiver Set Partitioning
The maximum transmission rate on a tree is that of the link with minimum capacity. To improve bandwidth utilization of inter-datacenter backbone, we can replace a large forwarding tree with multiple smaller trees each connecting the source to a subset of receivers. By partitioning, we isolate some receivers from the bottlenecks allowing them to receive data at a higher rate. We aim to find a set of partitions each with at least one receiver that allows for reducing the average receiver completion times while minimally increasing the bandwidth usage. Bottlenecks may appear either due to competing transfers or differences in link capacity. In the former case, some edges may be shared by multiple trees which lead to lower available bandwidth per tree. Such conditions may arise more frequently under heavy load. In the latter case, differences in link capacity can increase completion times especially in large networks and with many receivers per transfer.
Receiver set partitioning to minimize the impact of bottlenecks and reduce completion times is a sophisticated open problem. It is best if partitions are selected in a way that no additional bottlenecks are created. Also, increasing the number of partitions may in general increase bandwidth consumption (multiple smaller trees may have more edges in total compared to one large tree). Therefore, we need to come up with the right number of partitions and receivers that are grouped per partition. We propose a partitioning approach, called the hierarchical partitioning, that is computationally efficient and uses a partitioning factor to decide on the number of partitions and receivers that are grouped in those partitions.
Number of Partitions: Transfers may have a highly varying number of receivers. Generally, the number of partitions should be computed based on the number of receivers, where they are located in the network, and the network topology. Also, using more partitions can lead to the creation of unnecessary bottlenecks due to shared links. We compute the number of partitions per transfer according to the total traffic load on network edges and considering a threshold that limits the cost of additional bandwidth consumption.
Limitations of Partitioning: Partitioning, in general, cannot improve tail completion times of transfers as tail is usually driven by physical resource constraints, i.e., low capacity links or links with high contention.
Hierarchical Partitioning: We group receivers into partitions according to their mutual distance which is defined as the number of hops on the shortest hop path that connects any two receivers. Hierarchical clustering  approaches such as agglomerative clustering can be used to compute the groups by initially assuming that every receiver has its partition and then by merging the two closest partitions at each step which generates a hierarchy of partitioning solutions. Each layer of the hierarchy then gives us one possible solution with a given number of partitions.
With this approach, the partitioning of receivers into any partitions consumes minimal additional bandwidth on average compared to any other partitioning with partitions. That is because assigning a receiver to any other partition will likely increase the total number of edges needed to connect the source to all receivers; otherwise, that receiver would not have been grouped with the other receivers in its current partition in the first place. There is, however, no guarantee since hierarchical clustering works based on a greedy heuristic.
After building a partitioning hierarchy, the algorithm selects the layer with the maximum number of partitions whose total sum of tree weights stays below a threshold that can be configured as a system parameter. Choosing the maximum partitions allows us to minimize the effect of slow receivers given the threshold, which is a multiple of the weight of a single tree that would connect the sender to all receivers and can be looked at as a bandwidth budget. We call the multiplication coefficient the partitioning factor . Algorithm 2 shows this process in detail. The partitioning factor plays a vital role in the operation of QuickCast as it determines the extra cost we are willing to pay in bandwidth for improved completion times. In general, a greater than one but close to it should allow partitioning to separate very slow receivers from several other nodes. A that is considerably larger than one may generate too many partitions and potentially create many shared links which reduce throughput and additional edges that increase bandwidth usage. If is less than one, a single partition will be used.
Worst-case Complexity: Algorithm 2 performs multiple calls to the GreedyFLAC algorithm . It also uses the hierarchical clustering with average linkage which has a worst-case complexity of . To compute the pair-wise distances of receivers, we can use breadth first search with has a complexity of . Worst-case complexity of Algorithm 2 is .
Iv-C Rate Allocation
To compute the transmission rates per tree per timeslot, one can formulate an optimization problem with the capacity and demand constraints, and consider minimizing the mean receiver completion times as the objective. This is, however, a hard problem and can be modeled using mixed-integer programming by assuming a binary variable per timeslot per tree that shows whether that tree has completed by that timeslot. One can come up with approximation algorithms to this problem which is considered part of the future work.
In this paper, we consider the three popular scheduling policies of FCFS, SRPT, and fair sharing according to max-min fairness  which have been extensively used for network scheduling. These policies can be applied independently of partitioning and forwarding tree selection techniques. Each one of these three policies has its unique features. FCFS and SRPT both prioritize transfers; the former according to arrival times and the latter according to transfer volumes and so obtain a meager fairness score . SRPT has been extensively used for minimizing flow completion times within datacenters [52, 53, 54]. Strictly prioritizing transfers over forwarding trees (as done by SRPT and FCFS), however, can lead to low overall link utilization and increased completion times, especially when trees are large. This might happen due to bandwidth contention on shared edges which can prevent some transfers from making progress. Fair sharing allows all transfers to make progress which mitigates such contention enabling concurrent multicast transfers to all make progress. In §V-C, we empirically compare the performance of these scheduling policies and show that fair sharing based on max-min fairness can significantly outperform both FCFS and SRPT in average network throughput especially with a larger number of receivers per tree. As a result, we will use QuickCast along with the fair sharing policy based on max-min fairness.
The traffic engineering server periodically computes the transmission rates per multicast tree every timeslot to maximize utilization and cope with inaccurate inter-datacenter link capacity measurements, imprecise rate limiting, and dropped packets due to corruption. To account for inaccurate rate limiting, dropped packets and link capacity estimation errors, which all can lead to a difference between the actual volume of data delivered and the number of bytes transmitted, we propose that senders keep track of actual data delivered to their receivers per forwarding tree. At the end of every timeslot, every sender reports to the traffic engineering server how much data it was able to deliver allowing it to compute rates accordingly for the timeslot that follows. Newly arriving transfers will be assigned rates starting the next timeslot.
We considered various topologies and transfer size distributions as shown in Tables II and III. Also, for Algorithm 2, unless otherwise stated, we used which limits the overall bandwidth usage while offering significant gains. In the following subsections, we first evaluated a variety of weight assignments for multicast tree selection considering receiver completion times and bandwidth usage. We showed that the weight proposed in Algorithm 1 offers close to minimum completion times with minimal extra bandwidth consumption. Next, we evaluated the proposed partitioning technique and considered two cases of (as used in ) and . We measured the performance of QuickCast while varying the number of receivers and showed that it offers consistent gains. We also measured the speedup observed by different receivers ranked by their speed per multicast transfer, and the effect of partitioning factor on the gains in completion times as well as bandwidth usage. In addition, we evaluated the effect of different scheduling policies on average network throughput and showed that with increasing number of multicast receivers, fair sharing offers higher throughput compared to both FCFS and SRPT. Finally, we showed that QuickCast is computationally fast by measuring its running time and that the maximum number of group table forwarding entries it uses across all switches is only a fraction of what is usually available in a physical switch across the several considered scenarios.
Network Topologies: Table II shows the list of topologies we considered. These topologies provide capacity information for all links which range from 45 Mbps to 10 Gbps. We normalized all link capacities dividing them by the maximum link capacity. We also assumed all bidirectional links with equal capacity in either direction.
|ANS ||A medium-sized backbone and transit network that spans across the United States with nodes and links. All links have equal capacity of 45 Mbps.|
|GEANT ||A large-sized backbone and transit network that spans across the Europe with nodes and links. Link capacity ranges from 45 Mbps to 10 Gbps.|
|UNINETT ||A large-sized backbone that spans across Norway with nodes and links. Most links have a capacity of 1, 2.5 or 10 Gbps.|
Traffic Patterns: Table III
shows the considered distributions for transfer volumes. Transfer arrival followed a Poisson distribution with rate. We considered no units for time or bandwidth. For all simulations, we assumed a timeslot length of . For Pareto distribution, we considered a minimum transfer volume equal to that of full timeslots and limited maximum transfer volume to that of full timeslots. Unless otherwise stated, we considered an average demand equal to volume of
full timeslots per transfer for all traffic distributions (we fixed the mean values of all distributions to the same value). Per simulation instance, we assumed equal number of transfers per sender and for every transfer, we selected the receivers from all existing nodes according to the uniform distribution (with equal probability from all nodes).
Assumptions: We focused on computing gains and assumed accurate knowledge of inter-datacenter link capacity, and precise rate control at the end-points which together lead to a congestion free network. We also assumed no dropped packets due to corruption or errors, and no link failures.
Simulation Setup: We developed a simulator in Java (JDK 8). We performed all simulations on one machine (Core i7-6700 and 24 GB of RAM). We used the Java implementation of GreedyFLAC  for minimum weight Steiner trees.
Based on Exponential distribution.
|Heavy-tailed||Based on Pareto distribution.|
|Facebook Cache-Follower||Generated by cache applications over Facebook inter-datacenter WAN .|
|Facebook Hadoop||Generated by geo-distributed analytics over Facebook inter-datacenter WAN .|
V-a Weight Assignment Techniques for Tree Selection
We empirically evaluate and analyze several weights for selection of forwarding trees. Table IV lists the weight assignment approaches considered for tree selection (please see Table I for definition of variables). We considered three edge weight metrics of utilization (i.e., the fraction of a link’s bandwidth currently in use), load (i.e., the total volume of traffic that an edge will carry starting current time), and load plus the volume of the newly arriving transfer request. We also considered the weight of a tree to be either the weight of its edge with maximum weight or the sum of weights of its edges. An exponential weight is used to approximate selection of trees with minimum highest weight, similar to the approach used in . The benefit of the weight #6 over #5 is that in case there is no load or minimal load on some edges, selecting the minimum weight tree will lead to minimum edge trees that reduce bandwidth usage. Also, with this approach, we tend to avoid large trees for large transfers which helps further reduce bandwidth usage.
|#||Properties of Selected Trees|
|1||A fixed minimum edge Steiner tree|
|2||Minimum highest utilization over edges|
|3||Minimum highest load over edges|
|4||Minimum sum of utilization over edges|
|5||Minimum sum of load over edges|
|6||Minimum final sum of load over edges|
|7||Minimum edges, min-max utilization|
|8||Minimum edges, min-max load|
|9||Minimum edges, min-sum of utilization|
|10||Minimum edges, min-sum of load|
|Mean Receiver Completion Times|
|Tail Receiver Completion Times|
|Total Bandwidth Used|
|from min||10-||from min||20-||from min||30-|
|from min||40-||from min||50-||from min||50+|
Figure 2 shows our simulation results of receiver completion times for bulk multicast transfers with receivers for a fixed arrival rate of . We considered both light-tailed and heavy-tailed transfer volume distributions. Techniques #1, #7, #8, #9 and #10 all used minimal edge Steiner trees, and so offer minimum bandwidth usage. However, this comes at the cost of increasing completion times especially when edges have a non-homogeneous capacity. Techniques #2 and #4 use utilization as criteria for load balancing. Minimizing maximum link utilization has long been a popular objective for traffic engineering over WAN. As can be seen, they have the highest bandwidth usage compared to other techniques (up to above the minimum) for almost all scenarios while their completion times are at least worse than the minimum for several scenarios. Techniques #3, #5, and #6 operate based on link load (i.e., total outstanding volume of traffic per edge) among which technique #3 (minimizing maximum load) has the highest variation between best and worst case performance (up to worse than the minimum in mean completion times). Techniques #5 and #6 (minimizing the sum of load including and excluding the new multicast request) on the other hand offer consistently good performance that is up to above the minimum (for all performance metrics) across all scheduling policies, topologies, and traffic patterns. These techniques offer lower completion times for the GEANT topology with non-uniform link capacity. Technique #6 also provides slightly better bandwidth usage and better completion times compared to #5 for the majority of scenarios (not shown). Our proposals rely on technique #6 for selection of load-aware forwarding trees, as shown in Algorithm 1.
V-B Receiver Set Partitioning
Receiver set partitioning allows separation of faster receivers from the slowest (or slower ones). This is essential to improve network utilization and speed up transfers when there are competing transfers or physical bottlenecks. For example, both GEANT and UNINETT have edges that vary by at least a factor of in capacity. We evaluate QuickCast over a variety of scenarios.
V-B1 Effect of Number of Receivers
We provide an overall comparison of several schemes (QuickCast, Single Load-Aware Steiner Tree, and DCCast ) along with two basic solutions of using a minimum edge Steiner tree and unicast minimum hop path routing as shown in Figure 3. We also considered both light and heavy load regimes. We used real inter-datacenter traffic patterns reported by Facebook for two applications of Cache-Follower and Hadoop . Also, all schemes use the fair sharing rate allocation based on max-min fairness except DCCast which uses the FCFS policy.
The minimum edge Steiner tree leads to the minimum bandwidth consumption. The unicast minimum hop path routing approach separates all receivers per bulk multicast transfer. It, however, uses a significantly larger volume of traffic and also does not offer the best mean completion times for the following reasons. First, it exhausts network capacity quickly which increases tail completion times by a significant factor (not shown here). Second, it can lead to many additional shared links that increase contention across flows and reduce throughput per receiver. The significant increase in completion times of higher percentiles increases the average completion times of the unicast approach.
With , we see that QuickCast offers the best mean and median completion times, i.e., up to less compared to QuickCast with , up to less compared to unicast minimum hop routing, and up to less than single load-aware Steiner tree. To achieve this gain, QuickCast with uses at most more bandwidth compared to using minimum edge Steiner trees which is still less than bandwidth usage of unicast minimum hop routing. We also see that while increasing the number of receivers, QuickCast with offers consistently small median completion times by separating fast and slow receivers since the number of partitions are not limited. Overall, we see a higher gain under light load as there is more capacity available to utilize. We also recognize that QuickCast with either or performs almost always better than unicast minimum hop routing in mean completion times.
V-B2 Speedup by Receiver Rank
Figure 4 shows how QuickCast can speed up multiple receivers per transfer by separating them from the slower receivers. The gains are normalized by when a single partition is used per bulk multicast transfer. In case the number of partitions is limited to two similar to , the highest gain is usually obtained by the first two to three receivers while allowing more partitions, we can get considerably higher gain for a significant fraction of receivers. Also, by not limiting the partitions to two, we see higher gains for all receiver ranks that is above for multiple receiver ranks. This comes at the cost of higher bandwidth consumption which we saw earlier in the previous experiment.
V-B3 Partitioning Factor ()
The performance of QuickCast as a function of the partitioning factor has been shown in Figure 5 where gains are normalized by single load-aware Steiner tree which uses a single partition per bulk multicast transfer. We computed per receiver mean and 95th percentile completion times as well as bandwidth usage. As can be seen, bandwidth consumption increases with partitioning factor as more requests’ receivers are partitioned into two or more groups. The gains in completion times keep increasing if is not limited as we increase . That, however, can ultimately lead to unicast delivery to all receivers (every receiver as a separate partition) and excessive bandwidth usage. We see a diminishing return type of curve as is increased with the highest returns coming when we increase from 1 to 1.1 (marked with a green dashed lined). That is because using too many partitions can saturate network capacity while not improving the separation of fast and slow nodes considerably. At , we see up to 10% additional bandwidth usage compared to single load-aware Steiner tree while mean completion times improve by between 40% to 50%. According to other experiments not shown here, with large , it is possible even to see reductions in gain that come from excessive bandwidth consumption and increased contention over capacity. Note that this experiment was performed considering four receivers per bulk multicast transfer. Using more receivers can lead to more bandwidth usage with the same , an increased slope at values of close to 1, and faster saturation of network capacity as we increase . Therefore, using smaller is preferred with more receivers per transfer.
V-C Effect of Rate Allocation Policies
As explained earlier in §IV-C, when scheduling traffic over large forwarding trees, fair sharing can sometimes offer significantly higher throughput and hence better completion times. We performed an experiment over the ANS topology and with both light-tailed and heavy-tailed traffic distributions. ANS topology has uniform link capacity across all edges which helps us rule out the effect of capacity variations on throughput obtained via different scheduling policies. We also considered an increasing number of receivers from 4 to 8 and 16. Figure 6 shows the results. We see that fair sharing offers a higher average throughput across all ongoing transfers compared to FCFS and SRPT and that with more receivers, the benefit of using fair sharing increases to up to with 16 receivers per transfer.
V-D Running Time
To ensure scalability of proposed algorithms, we measured the running time of our algorithms over various topologies (with different sizes) and with varying rates of arrival. We assumed two arrival rates of and which account for light and heavy load regimes. We also considered eight receivers per transfer and all the three topologies of ANS, GEANT, and UNINETT. We saw that the running time of Algorithm 1, and 2 remained below one millisecond and 20 milliseconds, respectively, across all of these scenarios. These numbers are less than the propagation latency between the majority of senders and receivers over considered topologies (a simple TCP handshake would take at least twice the propagation latency). More efficient realization of these algorithms can further reduce their running time (e.g., implementation in C/C++ instead of Java).
V-E Forwarding Plane Resource Usage
QuickCast can be realized using software-defined networking and OpenFlow compatible switches. To forward packets to multiple outgoing ports on switches where trees branch out to numerous edges, we can use group tables which have been supported by OpenFlow since early versions. Besides, an increasing number of physical switch makers have added support for group tables. To allow forwarding to multiple outgoing ports, the group table entries should be of type “ALL”, i.e., OFPGT_ALL in the OpenFlow specifications. Group tables are highly scarce (compared to TCAM entries) and so should be used with care. Some new switches support 512 or 1024 entries per switch. Another critical parameter is the maximum number of action buckets per entry which primarily determines the maximum possible branching degree for trees. Across the switches we looked at, we found that the minimum supported value was 8 action buckets which should be enough for WAN topologies as most of such do not have any nodes with this connectivity degree.
In general, reasoning about the number of group table entries needed to realize different schemes is hard since it depends on how the trees are formed which is highly intertwined with edge weights that depend on the distribution of load. For example, consider a complete binary tree with 8 receivers as leaves and the sender at the root. This will require 6 group table entries to transmit to all receivers with two action buckets per each intermediate node on the tree (branching at the sender does not need a group table entry). If instead, we used an intermediate node to connect to all receivers with a branching degree of 8, we would only need one group table entry with eight action buckets.
We measured the number of group table entries needed to realize QuickCast. We computed the average of the maximum, and maximum of the maximum number of entries used per switch during the simulation for the topologies of ANS, GEANT, and UNINETT, with arrival rates of and , considering both light-tailed and heavy-tailed traffic patterns and assuming that each bulk multicast transfer had eight receivers. The experiment was terminated when 200 transfers arrived. Looking at the maximum helps us see whether there are enough entries at all times to handle all concurrent transfers. Interestingly, we saw that by using multiple trees per transfer, both the average and maximum of the maximum number of group table entries used were less than when a single tree was used per transfer. One reason is that using a single tree slows down faster receivers which may lead to more concurrent receivers that increase the number of group entries. Also, by partitioning receivers, we make subsequent trees smaller and allow them to branch out closer to their receivers which balances the use of group table entries usage across the switches reducing the maximum. Finally, by using more partitions, the maximum number of times a tree needs to branch to reach all of its receivers decreases. Across all the scenarios considered above, the maximum of maximum group table entries at any timeslot was 123, and the average of the maximum was at most 68 for QuickCast. Furthermore, by setting which allows for more partitions, the maximum of maximum group table entries decreased by up to 17% across all scenarios.
Vi Conclusions and Future Work
A variety of applications running across datacenters replicate content between geographically dispersed sites for increased availability and reliability. Such data replication generates bulk multicast transfers with an apriori known sender and set of receivers per transfer which can be efficiently performed using multicast forwarding trees. We introduced the bulk multicast routing and scheduling problem with the objective of minimizing mean completion times of receivers and decomposed it into three sub-problems of receiver set partitioning, tree selection, and rate allocation. We then proposed QuickCast which applies three heuristic techniques to offer approximate solutions to these three hard sub-problems. For future research, we will consider parallel trees to increase throughput which is especially helpful under light traffic load. Also, application of BIER, which allows dynamic and low-cost updates to the forwarding trees, opens up new research opportunities.
-  M. Noormohammadpour, C. S. Raghavendra, S. Kandula, and S. Rao, “QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forwarding Tree Cohorts,” INFOCOM, 2018.
-  S. Jain, A. Kumar, S. Manda et al., “B4: Experience with a globally-deployed software defined wan,” SIGCOMM, vol. 43, no. 4, pp. 3–14, 2013.
-  “Building express backbone: Facebook’s new long-haul network,” https://code.facebook.com/posts/1782709872057497/building-express-backbone-facebook-s-new-long-haul-network/, visited on September 30, 2017.
-  “How microsoft builds its fast and reliable global network,” https://azure.microsoft.com/en-us/blog/how-microsoft-builds-its-fast-and-reliable-global-network/, visited on September 30, 2017.
-  S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaring for wide area networks,” SIGCOMM, vol. 44, no. 4, pp. 515–526, 2015.
-  Y. Zhang, J. Jiang, K. Xu et al., “BDS: A Centralized Near-optimal Overlay Network for Inter-datacenter Data Replication,” in EuroSys, 2018, pp. 10:1–10:14.
-  C. Tang, T. Kooburat, P. Venkatachalam et al., “Holistic Configuration Management at Facebook,” in SOSP, 2015, pp. 328–343.
-  K. Florance. (2016) How netflix works with isps around the globe to deliver a great viewing experience. [Online]. Available: https://media.netflix.com/en/company-blog/how-netflix-works-with-isps-around-the-globe-to-deliver-a-great-viewing-experience
-  C.-Y. Hong, S. Kandula, R. Mahajan et al., “Achieving high utilization with software-driven wan,” in SIGCOMM. ACM, 2013, pp. 15–26.
-  H. Zhang, K. Chen, W. Bai et al., “Guaranteeing deadlines for inter-datacenter transfers,” in EuroSys. ACM, 2015, p. 20.
-  X. Jin, Y. Li, D. Wei, S. Li, J. Gao, L. Xu, G. Li, W. Xu, and J. Rexford, “Optimizing bulk transfers with software-defined optical wan,” in SIGCOMM. ACM, 2016, pp. 87–100.
-  M. Cotton, L. Vegoda, and D. Meyer, “IANA guidelines for IPv4 multicast address assignments,” Internet Requests for Comments, pp. 1–10, 2010. [Online]. Available: https://tools.ietf.org/html/rfc5771
-  S. Banerjee, B. Bhattacharjee, and C. Kommareddy, “Scalable application layer multicast,” in SIGCOMM. ACM, 2002, pp. 205–217.
-  M. Kodialam, T. V. Lakshman, and S. Sengupta, “Online multicast routing with bandwidth guarantees: a new approach using multicast network flow,” IEEE/ACM Transactions on Networking, vol. 11, no. 4, pp. 676–686, Aug 2003.
-  L. H. Huang, H. C. Hsu, S. H. Shen et al., “Multicast traffic engineering for software-defined networks,” in INFOCOM. IEEE, 2016, pp. 1–9.
-  A. Iyer, P. Kumar, and V. Mann, “Avalanche: Data center multicast using software defined networking,” in COMSNETS. IEEE, 2014, pp. 1–8.
-  J. Cao, C. Guo, G. Lu et al., “Datacast: A scalable and efficient reliable group data delivery service for data centers,” IEEE Journal on Selected Areas in Communications, vol. 31, no. 12, pp. 2632–2645, 2013.
-  S. H. Shen, L. H. Huang, D. N. Yang, and W. T. Chen, “Reliable multicast routing for software-defined networks,” in INFOCOM, April 2015, pp. 181–189.
-  A. Nagata, Y. Tsukiji, and M. Tsuru, “Delivering a file by multipath-multicast on openflow networks,” in International Conference on Intelligent Networking and Collaborative Systems, 2013, pp. 835–840.
-  K. Ogawa, T. Iwamoto, and M. Tsuru, “One-to-many file transfers using multipath-multicast with coding at source,” in HPCC, 2016, pp. 687–694.
-  D. Bertsekas and R. Gallager, “Data networks,” 1987.
-  S. Liang and D. Cheriton, “TCP-SMO: extending TCP to support medium-scale multicast applications,” in INFOCOM, vol. 3, 2002, pp. 1356–1365.
-  B. Adamson, C. Bormann, M. Handley, and J. Macker, “Nack-oriented reliable multicast (norm) transport protocol,” 2009.
-  T. Zhu, F. Wang, Y. Hua, D. Feng et al., “Mctcp: Congestion-aware and robust multicast tcp in software-defined networks,” in International Symposium on Quality of Service, June 2016, pp. 1–10.
-  D. Li, M. Xu, M. c. Zhao, C. Guo et al., “RDCM: Reliable data center multicast,” in INFOCOM, 2011, pp. 56–60.
-  A. Rodriguez, D. Kostic, and A. Vahdat, “Scalability in adaptive multi-metric overlays,” in International Conference on Distributed Computing Systems, 2004, pp. 112–121.
-  K. Nagaraj, H. Khandelwal, C. Killian, and R. R. Kompella, “Hierarchy-aware distributed overlays in data centers using dc2,” in COMSNETS. IEEE, 2012, pp. 1–10.
-  M. Castro, P. Druschel, A.-M. Kermarrec, A. Nandi, A. Rowstron, and A. Singh, “Splitstream: High-bandwidth multicast in cooperative environments,” in SOSP. ACM, 2003, pp. 298–313.
-  K. Jeacle and J. Crowcroft, “Tcp-xm: unicast-enabled reliable multicast,” in ICCCN, 2005, pp. 145–150.
-  L. H. Lehman, S. J. Garland, and D. L. Tennenhouse, “Active reliable multicast,” in INFOCOM, vol. 2, Mar 1998, pp. 581–589 vol.2.
-  C. Gkantsidis, J. Miller, and P. Rodriguez, “Comprehensive view of a live network coding p2p system,” in IMC. ACM, 2006, pp. 177–188.
-  A. Shokrollahi, “Raptor codes,” IEEE Transactions on Information Theory, vol. 52, no. 6, pp. 2551–2567, 2006.
-  J. W. Byers, M. Luby, M. Mitzenmacher, and A. Rege, “A digital fountain approach to reliable distribution of bulk data,” in SIGCOMM. ACM, 1998, pp. 56–67.
-  L. Rizzo, “Pgmcc: A tcp-friendly single-rate multicast congestion control scheme,” in SIGCOMM, 2000.
-  C. A. C. Marcondes, T. P. C. Santos, A. P. Godoy, C. C. Viel, and C. A. C. Teixeira, “Castflow: Clean-slate multicast approach using in-advance path processing in programmable networks,” in IEEE Symposium on Computers and Communications, 2012, pp. 94–101.
-  J. Ge, H. Shen, E. Yuepeng et al., “An openflow-based dynamic path adjustment algorithm for multicast spanning trees,” in IEEE TrustCom, 2013, pp. 1478–1483.
-  I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin, “Multicast Using Bit Index Explicit Replication (BIER),” RFC 8279, Nov. 2017. [Online]. Available: https://rfc-editor.org/rfc/rfc8279.txt
-  M. Hefeeda, A. Habib, B. Botev et al., “Promise: Peer-to-peer media streaming using collectcast,” in MULTIMEDIA. ACM, 2003, pp. 45–54.
-  J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, “The bittorrent p2p file-sharing system: Measurements and analysis,” in Proceedings of the 4th International Conference on Peer-to-Peer Systems, ser. IPTPS’05. Berlin, Heidelberg: Springer-Verlag, 2005, pp. 205–216.
-  R. Sherwood, R. Braud, and B. Bhattacharjee, “Slurpie: a cooperative bulk data transfer protocol,” in INFOCOM, vol. 2, 2004, pp. 941–951.
-  N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez, “Inter-datacenter bulk transfers with netstitcher,” in SIGCOMM. ACM, 2011, pp. 74–85.
-  S. Su, Y. Wang, S. Jiang, K. Shuang, and P. Xu, “Efficient algorithms for scheduling multiple bulk data transfers in inter-datacenter networks,” International Journal of Communication Systems, vol. 27, no. 12, 2014.
-  N. Laoutaris, G. Smaragdakis, R. Stanojevic, P. Rodriguez, and R. Sundaram, “Delay-tolerant bulk data transfers on the internet,” IEEE/ACM TON, vol. 21, no. 6, 2013.
-  Y. Wang, S. Su et al., “Multiple bulk data transfers scheduling among datacenters,” Computer Networks, vol. 68, pp. 123–137, 2014.
-  Y. Zhang, J. Jiang, K. Xu et al., “BDS: A Centralized Near-optimal Overlay Network for Inter-datacenter Data Replication,” in EuroSys, ser. EuroSys ’18, 2018, pp. 10:1–10:14.
-  M. Noormohammadpour and C. S. Raghavendra, “DDCCast: Meeting Point to Multipoint Transfer Deadlines Across Datacenters using ALAP Scheduling Policy,” arXiv preprint arXiv:1707.02027, 2017.
-  S. Ji, S. Liu, and B. Li, “Deadline-Aware Scheduling and Routing for Inter-Datacenter Multicast Transfers,” in 2018 IEEE International Conference on Cloud Engineering (IC2E), 2018, pp. 124–133.
-  L. Luo, H. Yu, and Z. Ye, “Deadline-guaranteed Point-to-Multipoint Bulk Transfers in Inter-Datacenter Networks,” ICC, 2018.
-  D. Watel and M.-A. Weisser, A Practical Greedy Approximation for the Directed Steiner Tree Problem. Cham: Springer International Publishing, 2014, pp. 200–215.
-  L. Rokach and O. Maimon, Clustering Methods. Springer US, 2005, pp. 321–352.
-  T. Lan, D. Kao, M. Chiang, and A. Sabharwal, “An Axiomatic Theory of Fairness in Network Resource Allocation,” in 2010 Proceedings IEEE INFOCOM, 2010, pp. 1–9.
-  M. Alizadeh, S. Yang, M. Sharif et al., “pFabric: Minimal Near-optimal Datacenter Transport,” SIGCOMM Comput. Commun. Rev., vol. 43, no. 4, pp. 435–446, August 2013.
-  W. Bai, L. Chen, K. Chen et al., “PIAS: Practical information-agnostic flow scheduling for data center networks,” Proceedings of the 13th ACM Workshop on Hot Topics in Networks, p. 25, 2014.
-  Y. Lu, G. Chen, L. Luo et al., “One more queue is enough: Minimizing flow completion time with explicit priority notification,” INFOCOM, pp. 1–9, 2017.
-  The Internet Topology Zoo (ANS). [Online]. Available: http://www.topology-zoo.org/files/Ans.gml
-  The Internet Topology Zoo (GEANT). [Online]. Available: http://www.topology-zoo.org/files/Geant2009.gml
-  The Internet Topology Zoo (UNINETT). [Online]. Available: http://www.topology-zoo.org/files/Uninett2011.gml
-  https://github.com/mouton5000/DSTAlgoEvaluation, visited on Apr 27, 2017.
-  A. Roy, H. Zeng, J. Bagga et al., “Inside the Social Network’s (Datacenter) Network,” in SIGCOMM. ACM, 2015, pp. 123–137.
-  M. Noormohammadpour, C. S. Raghavendra, S. Rao et al., “DCCast: Efficient Point to Multipoint Transfers Across Datacenters,” in HotCloud. USENIX Association, 2017.