Adaptive Tree Selection for Efficient Point to Multipoint Transfers Across Datacenters
Using multiple datacenters allows for higher availability, load balancing and reduced latency to customers of cloud services. To distribute multiple copies of data, cloud providers depend on inter-datacenter WANs that ought to be used efficiently considering their limited capacity and the ever-increasing data demands. In this paper, we focus on applications that transfer objects from one datacenter to several datacenters over dedicated inter-datacenter networks. We present DCCast, a centralized Point to Multi-Point (P2MP) algorithm that uses forwarding trees to efficiently deliver an object from a source datacenter to required destination datacenters. With low computational overhead, DCCast selects forwarding trees that minimize bandwidth usage and balance load across all links. With simulation experiments on Google's GScale network, we show that DCCast can reduce total bandwidth usage and tail Transfer Completion Times (TCT) by up to 50% compared to delivering the same objects via independent point-to-point (P2P) transfers.READ FULL TEXT VIEW PDF
Large inter-datacenter transfers are crucial for cloud service efficienc...
Several organizations have built multiple datacenters connected via dedi...
Bulk transfers from one to multiple datacenters can have many different
The Tor network uses a measurement system to estimate its relays' forwar...
As applications become more distributed to improve user experience and o...
An important concern for many Cloud customers is data confidentiality. O...
Adaptive Tree Selection for Efficient Point to Multipoint Transfers Across Datacenters
Increasingly, companies rely on multiple datacenters to improve quality of experience for their customers. The benefits of having multiple datacenters include reduced latency to customers by mapping users to datacenters according to location, increased failure resiliency and availability, load balancing by mapping users to different datacenters and respecting local data laws. Companies may either own the datacenters or depend on infrastructure and services from providers such as Microsoft Azure , Google Compute Engine  or Amazon Web Services . Large providers use geographically distributed dedicated networks to connect their datacenters [4, 5, 6, 7].
Nowadays, many services operate across multiple datacenters. Such services require efficient data transfers among datacenters including replication of objects from one datacenter to multiple datacenters which is referred to as geo-replication [8, 9, 10, 11, 12, 13, 14, 15, 7, 16, 4, 17]. Examples of such transfers include synchronizing search index information , replication of databases [18, 19], and distribution of high definition videos across CDNs [20, 21, 17, 22, 23]. Table 1 provides a brief list of how many replicas are made for some applications.
Across availability regions , 
, for various object types including large machine learning configs
|CloudBasic SQL Server||Up to secondary databases with active Geo-Replication (asynchronous) |
|Azure SQL Database||Up to secondary databases with active Geo-Replication (asynchronous) |
|Oracle Directory Server||Up to the number of datacenters owned by an enterprise for regional load balancing of directory servers [29, 30]|
|AWS Route GLB||Across multiple regions and availability zones for global load balancing |
|Youtube||Function of popularity, content potentially pushed to many locations (could be across datacenters )|
|Netflix||Across to availability regions , and up to cache locations |
In this paper, we focus on transfers that deliver an object from a source to multiple destinations and call them Point to Multipoint (P2MP) transfers. One solution to make such transfers is to initiate multiple independent point-to-point (P2P) transfers that are scheduled separately [35, 36, 37, 38, 14, 39, 40, 41, 5, 22, 4, 42]. There may however be more efficient ways, in terms of total bandwidth usage and transfer completion times, to perform P2MP transfers by sending at most one copy of the message across any link given that the source datacenter and destination datacenters are known apriori. We present an elegant solution using minimum weight Steiner Trees  for P2MP transfers that achieves reduced bandwidth usage and tail completion times.
Another approach would be to select trees that connect from sources to all destinations and complete transfers using store-and-forward [35, 36, 37, 38, 14]. This will impose the overhead of transferring and storing additional copies of objects on intermediate datacenters during the delivery process incurring storage costs as the number of transfers increase, and wasting intra-datacenter bandwidth since such objects have to be stored on a server inside the intermediate datacenters.
Alternatively one can use multicast protocols, which are designed to form group memberships, provide support for members to join or leave, and manage multicast trees as users come and go [44, 45]. These approaches involve complex management algorithms and protocols to cope with changes in multicast group membership [46, 47] which are unnecessary for our problem given that the participants of each P2MP transfer are fixed.
Application layer multicast techniques [48, 46] reduce the implementation and management complexity by creating an overlay network across the multicast group. However, since these methods are typically implemented in end-host applications, they lack full visibility to underlying network properties and status, such as topology and available bandwidth, and may still send multiple copies of packets along the same links wasting bandwidth.
How can one efficiently perform P2MP transfers? Objective is minimizing tail Transfer Completion Times (TCT) considering limited available bandwidth of networks connecting datacenters and the many transfers that share the network which arrive in an online manner. Also, the system does not have prior knowledge of incoming P2MP transfers; therefore, any solution has to be fast and efficient to deal with them as they arrive.
To perform a P2MP transfer, traffic can be concurrently sent to all destinations over a minimum weight Steiner Tree that connects the source and destinations with transfers’ demands as link weights. We refer to such trees as forwarding trees using which we can reduce total bandwidth usage. A central controller with global view of network topology and distribution of load [7, 49, 6, 5, 50, 51] can carefully weigh out various options and select forwarding trees for transfers.
We can implement forwarding trees using SDN  capable switches that support Group Tables [53, 54] such as [55, 56, 57, 58]. Using this feature, a packet can be replicated and supplied to multiple buckets each processing and forwarding it to a required output switch port. A group might have many buckets, for example one vendor supports up to . There is growing vendor support for group tables as later OpenFlow standards are adopted. In this paper, we use abstract simulations to verify our techniques. In the future, we plan on implementing forwarding trees using Group Tables.
Motivating Example: In Figure 1, an object is to be transferred from datacenter to two datacenters considering a link throughput of . In order to send to destinations, one could initiate individual transfers, but that wastes bandwidth and increases delivery time since the link attached to turns into a bottleneck.
In this paper, we present an efficient scheme for P2MP transfers called DCCast. It selects forwarding trees according to a weight assignment that tries to balance load across the network. In addition, it uses temporal planning  and schedules P2MP requests on a first come first serve (FCFS) basis to provide guarantees on completion times. FCFS is shown to reduce tail completion times under light-tailed job size distribution [59, 60] (optimal discipline depends on this distribution ).
A related concept is Coflows  that improve performance by jointly scheduling groups of flows with a collective objective. A P2MP transfer can be viewed as a coflow; however unlike other coflows, P2MP transfers present a better opportunity for optimization since the same data is being delivered to different destinations. Using this, DCCast provides the added benefit of reduced bandwidth usage.
We conducted extensive simulation experiments to evaluate DCCast compared with other strategies for picking forwarding trees and scheduling techniques. We performed simulations using synthetic traffic over the Google’s GScale topology  with nodes and edges and random topologies with nodes and
edges. Our evaluation metrics include bandwidth usage as well as mean and tail TCT. Our current solution assumes the same class of service for all transfers. In a general setting, transfers may have different priorities and should be allotted resources accordingly (e.g. near real-time video vs. cross-region backups).
We evaluated various forwarding tree selection methods. Clearly, there is benefit in carefully picking trees and we observed up to improvement in completion times while using DCCast compared to random tree selection and up to compared to selection of trees that minimize the maximum load over any edge.
We compared DCCast with P2P schemes by viewing each P2MP transfer as multiple independent transfers and using multipathing ( shortest paths) to spread the load. For GScale topology and with to destinations per transfer, DCCast reduced both bandwidth usage and tail TCT by over to , respectively, while providing guarantees to users on completion times.
In summary, our contributions are the following:
DCCast minimizes packet reordering and provides guarantees on completion times.
The list of variables and their definitions is provided on table 2. To allow for flexible bandwidth allocation, we consider a slotted timeline [5, 41, 42] where the transmission rate of senders is constant during each timeslot, but can vary from one timeslot to next. This can be achieved via rate-limiting at end-hosts [6, 49]. A central scheduler is assumed that receives transfer requests from end-points, calculates their temporal schedule, and informs the end-points of rate-allocations when a timeslot begins. We focus on scheduling large transfers that take more than a few timeslots to finish and therefore, the time to submit a transfer request, calculate the routes, and install forwarding rules is considered negligible in comparison. We assume equal capacity for all links in an online scenario where requests may arrive anytime.
|A P2MP transfer|
|Volume of in bytes|
|Source datacenter of|
|Set of destinations of|
|The inter-datacenter network graph|
|A Steiner Tree |
|Set of edges of|
|Set of edges of|
|Total load currently scheduled on edge|
|Available bandwidth on edge at timeslot|
|Width of a timeslot in seconds|
Forwarding Trees: Our proposed approach is, for each P2MP transfer, to jointly route traffic from source to all destinations over a forwarding tree to save bandwidth. Using a single forwarding tree for every transfer also minimizes packet reordering which is known to waste CPU and memory resources at the receiving ends especially at high rates [62, 63]. To perform a P2MP transfer with volume , the source transmits traffic over a Steiner Tree that spans across . At any timeslot, traffic for any transfer flows with the same rate over all links of a forwarding tree to reach all the destinations at the same time. The problem of scheduling a P2MP transfer then translates to finding a forwarding tree and a transmission schedule over such a tree for every arriving transfer in an online manner. A relevant problem is the Minimum Weight Steiner Tree 
that can help minimize total bandwidth usage with proper weight assignment. Although it is a hard problem, heuristic algorithms exist that often provide near optimal solutions[64, 65].
Scheduling Discipline: When forwarding trees are found, we schedule traffic over them according to First Come First Serve (FCFS) policy using all available residual bandwidth on links to minimize the completion times. This allows us to provide guarantees to users on when their transfers will complete upon their arrival. We do not use a preemptive scheme, such as Shortest Remaining Processing Time (SRPT), due to practical concerns: larger transfers might get postponed over and over which might lead to the starvation problem and it is not possible to make promises on exactly when a transfer would complete. Optimal scheduling discipline to minimize tail times rests on transfer size distribution .
Algorithms: DCCast is made of two algorithms111Please find an implementation of DCCast algorithms on Github: https://github.com/noormoha/DCCast. Update() is executed upon beginning of every timeslot. It simply dispatches the transmission schedule, that is the rate for each transfer, to all senders to adjust their rates via rate-limiting and adjusts by deducting the total traffic that was sent over during current timeslot.
Allocate() is run upon arrival of every request which finds a forwarding tree and schedules to finish as early as possible. Pseudo-code of this function has been shown in Algorithm 1. Statically calculating minimal forwarding trees might lead to creation of hot-spots, even if there exists one highly loaded edge that is shared by all of such trees. As a result, DCCast adaptively chooses a forwarding tree that reduces the tail transfer completion times while saving considerable bandwidth.
It is possible that larger trees provide higher available bandwidth by using longer paths through least loaded edges, but using which would consume more overall bandwidth since they send same traffic over more edges. To model this behavior, we use a weight assignment that allows balancing these two possibly conflicting objectives. The weights represent traffic load allocated on links. Selecting links with lower weights will improve load balancing that would be better for future requests. The tradeoff is in avoiding heavier links at the expense of getting larger trees for even distribution of load.
The forwarding tree selected by Algorithm 1 will have the total weight . This weight is essentially the total load over if request were to be put on it. Selecting trees with minimal total weight will most likely avoid highly loaded edges and larger trees. To find an approximate minimum weight Steiner Tree, we used GreedyFLAC [65, 66], which is quite fast and in practice provides results not far from optimal.
|MINMAX||Selects forwarding trees to minimize maximum load on any link. Schedules traffic using FCFS policy §3|
|RANDOM||Selects random forwarding trees. Schedules traffic using FCFS policy §3|
|BATCHING||Batches (enqueues) new requests arriving in time windows of . At the end of batching windows, jointly schedules all new requests according to Shortest Job First (SJF) policy and picks their forwarding trees using weight assignment of Algorithm 1|
|SRPT||Upon arrival of a new request, jointly reschedules all existing requests and the new request according to SRPT policy §3 and picks new forwarding trees for all requests using weight assignment of Algorithm 1|
Views each P2MP request as multiple independent point-to-point (P2P) requests. Uses a Linear Programming (LP) model along with SRPT policy §3 to (re)schedule each request over -Shortest Paths between its source and destination upon arrival of new requests
|P2P-FCFS-LP||Similar to above while using FCFS policy §3|
We evaluated DCCast using synthetic traffic. We assumed a total capacity of
for each timeslot over every link. The arrival of requests followed a Poisson distribution with rate. Demand of every request
was calculated using an exponential distribution with meanadded to a constant value of (fixing the minimum demand to ). All simulations were performed over as many timeslots as needed to finish all requests with arrival time of last request set to be or less. Presented results are normalized by minimum values in each chart.
We measure three different metrics: total bandwidth used and mean and tail TCT. The total bandwidth used is the sum of all traffic over all timeslots and all links. The completion time of a transfer is defined as its arrival time to the time its last bit is delivered to the destination(s). We performed simulations using Google’s GScale topology , with nodes and edges, on a single machine (Intel Core i7-6700T and 24 GBs of RAM). All simulations were coded in Java and used Gurobi Optimizer  to solve linear programs for P2P schemes. We increased the destinations (copies) for each object from to
picking recipients according to uniform distribution. Table3 shows list of considered schemes (first are P2MP schemes and last are P2P).
We tried various forwarding tree selection criteria over both GScale topology and a larger random topology with nodes and edges as shown in Figures 2 and 3, respectively. In case of GScale, DCCast performs slightly better than RANDOM and MINMAX in completion times while using equal overall bandwidth (not in figure). In case of larger random topologies, DCCast’s dominance is more obvious regarding completion times while using same or less bandwidth (not in figure).
We also experimented various scheduling disciplines over forwarding trees as shown in Figure 4. The SRPT discipline performs considerably better regarding mean completion times; it however may lead to starvation of larger transfers if smaller ones keep arriving. It has to compute and install new forwarding trees and recalculate the whole schedule, for all requests currently in the system with residual demands, upon arrival of every new request. This could impose significant rule installation overhead which is considered negligible in our evaluations. It might also lead to lots of packet loss and reordering. Batching improves performance marginally compared to DCCast and could be an alternate road to take. Generally, a smaller batch size results in a smaller initial scheduling latency while a larger batch size makes it possible to employ collective knowledge of many requests in a batch for optimized scheduling. Batching might be more effective for systems with bursty request arrival patterns. All schemes performed almost similarly regarding tail completion times and total bandwidth usage (not in figure).
In Figure 5 we compare DCCast with a Point-To-Point scheme (P2P-SRPT-LP) using SRPT scheduling policy which uses various number of shortest paths () and delivers each copy independently. The total bandwidth usage is close for all schemes when there is only one destination per request. Both bandwidth usage and tail completion times of DCCast are up to less than that of P2P-SRPT-LP as the number of destinations per transfer increases. Although DCCast follows the FCFS policy, its mean completion time is close to that of P2P-SRPT-LP and surpasses it for copies due to bandwidth savings which leave more headroom for new transfers.
In a different experiment, we compared DCCast with P2P-FCFS-LP. DCCast again saved up to bandwidth and reduced tail completion times by up to almost while increasing the number of destinations per transfer.
Computational Overhead: We used a network with nodes and edges and considered P2MP transfers with destinations per transfer. Transfers were generated according to Poisson distribution with arrival times ranging from to timeslots and the simulation ran until all transfers were completed. Mean processing time of a single timeslot increased from to per timeslot while increasing from to . Mean processing time of a single transfer (which accounts for finding a tree and scheduling the transfer) was and per transfer for equal to and , respectively. This is negligible compared to timeslot lengths of minutes in prior work .
In this paper, we presented DCCast, which aims to reduce the total network bandwidth usage while minimizing transfer completion times using point to multipoint (P2MP) transfers. To save bandwidth, DCCast uses forwarding trees that connect source of a P2MP transfer to its destinations. Selection of forwarding trees is performed in a way that attempts to balance load across the network by avoiding highly loaded links while reducing bandwidth usage by choosing smaller trees. To provide guarantees on transfer completion times, DCCast schedules new traffic to finish as early as possible while not changing the schedule of already allocated transfers. Our evaluations show that DCCast can significantly reduce bandwidth usage compared to viewing each P2MP transfer as multiple independent transfers.
In the future, we would like to perform testbed experiments using traces of real traffic. An alternate scheduling scheme to what we proposed would be Fair Sharing which we aim to study. Next, we would like to evaluate DCCast using a mix of P2MP transfers with different number of destinations to better understand how various applications might interact. It would be interesting to consider multiple classes of traffic with different priorities as well. Moreover, further investigation is needed to measure the fraction of traffic with multiple destinations which benefit from forwarding trees. Finally, it is necessary to study approaches for handling and recovering from failures.
We would like to thank our shepherd Sindhu Ghanta and the anonymous reviewers of the HotCloud workshop whose comments helped us greatly improve the quality of this paper.
Prior work on inter-datacenter transfers uses a combination of various techniques, such as rate-control [49, 6], temporal planning , store-and-forward , and topology changes , to improve the performance and efficiency of point-to-point (P2P) inter-datacenter transfers. In this paper, we focused on point-to-multipoint (P2MP) transfers and a new direction in which forwarding trees are used to deliver an object simultaneously to all destinations. We discussed the benefits of this approach and presented evaluations to back up our proposal.
The extent to which our approach can benefit operators as well as cloud and datacenter applications used in industry would be one possible discussion point. This depends partly on the type of applications, desired level of reliability, and customer SLOs such as user access latency. Finally, a crucial factor is the fraction of transfers with multiple destinations, which are ones that can actually benefit from use of forwarding trees.
There are several discussion points regarding implementation. In case of using tools, such as SDN, to cheaply setup and tear down forwarding trees, how can we meet the necessary performance metrics, such as delay, considering number of required forwarding rules, transfer arrival rate, and latency overhead of rule installations? What are possible tensions between performance efficiency and practical feasibility of techniques?
Another topic of discussion would be selection of the forwarding trees. We presented and evaluated three possible selection methods. However selection of forwarding trees that reduce usage of bandwidth while minimizing transfer completion times is an open problem. In addition, would it be better to select and setup trees dynamically, or statically build many trees and use them?
Scalability of our approach considering various network topologies, different network sizes as well as arrival rate of transfers, effectiveness of the scheduling discipline used in satisfying users and operators, and handling of network or end-point failures would be other possible discussion topics. Would it be possible to apply known methods of improving scalability if necessary?
Annals of Applied Probability, pages 1–48, 2001.