Enabling Work-conserving Bandwidth Guarantees for Multi-tenant Datacenters via Dynamic Tenant- eue Binding

12/19/2017 ∙ by Zhuotao Liu, et al. ∙ University of Illinois at Urbana-Champaign The Hong Kong University of Science and Technology 0

Today's cloud networks are shared among many tenants. Bandwidth guarantees and work conservation are two key properties to ensure predictable performance for tenant applications and high network utilization for providers. Despite significant efforts, very little prior work can really achieve both properties simultaneously even some of them claimed so. In this paper, we present QShare, an in-network based solution to achieve bandwidth guarantees and work conservation simultaneously. QShare leverages weighted fair queuing on commodity switches to slice network bandwidth for tenants, and solves the challenge of queue scarcity through balanced tenant placement and dynamic tenant-queue binding. QShare is readily implementable with existing switching chips. We have implemented a QShare prototype and evaluated it via both testbed experiments and simulations. Our results show that QShare ensures bandwidth guarantees while driving network utilization to over 91

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Sharing the network of multi-tenant datacenters has been a critical theme for public clouds. The two primary objectives, among others, are bandwidth guarantees and work conservation. Bandwidth guarantees ensure predictable lower bound network performance for tenant applications. Recent studies show that, without bandwidth guarantees, network performance can experience 5x or more variations, leading to poor application performance (oktopus, ). Work conservation enables a tenant to use spare bandwidth beyond its minimum guarantee to further improve its application performance as well as boost provider network utilization. Given that datacenter traffic is bursty in nature and that the average network utilization is low (facebook, ; dc_mesaure, ; imc2010, ), work conservation can deliver over 10x additional bandwidth to a tenant VM upon its minimum guarantee (elasticswitch, ).

However, it is hard to achieve both bandwidth guarantees and work conservation simultaneously. Prior works such as Oktopus (oktopus, ) and SecondNet (secondnet, ) can achieve bandwidth guarantees, but they are not work-conserving. Seawall (seawall, ) and NetShare (netshare, ) achieve work conservation, but they do not provide bandwidth guarantees (more details in 8).

ElasticSwitch (elasticswitch, ) takes the first step toward achieving both properties at the same time. It is an endhost based solution that first needs to translate per-VM hose-model bandwidth guarantees into VM-to-VM pair rate limiters (referred as Guarantee Partitioning, GP), and then dynamically allocates spare bandwidth to these VM pairs to achieve high utilization (referred as Rate Allocation, RA). However, this approach confronts two challenges: (i)

as tenant applications are typically agnostic to network operators, it is difficult for GP to estimate each tenant’s traffic matrix (including VM-to-VM communication patterns and per VM-pair demand), affecting both bandwidth guarantees and work conservation;

(ii) to detect spare bandwidth, RA needs to probe the network by increasing rates, which causes a tradeoff between accurately providing bandwidth guarantees and being work conserving (elasticswitch, ; trinity, ): a conservative RA sacrifices work conservation, while an aggressive RA affects other tenants’ bandwidth guarantees (see our experiments in 10).

Trinity (trinity, ) moves one step further to complement ElasticSwitch with simple in-network support. It uses two priority queues in switches to segregate and prioritize the bandwidth guarantee traffic over work conservation traffic, so that aggressive RA of a tenant does not affect bandwidth guarantees of others. While Trinity solves the second challenge of ElasticSwitch, it still suffers from the more fundamental challenge of executing GP without prior knowledge of tenant traffic matrix. Further, it introduces other issues such as packet reordering and starvation due to traffic segregation and priority queuing.

As a result, prior solutions, essentially, do not achieve both goals in a sufficient manner. To give some sense, our testbed experiments show that without prior knowledge of tenant traffic matrix, state-of-the-art solutions relying on GP fail to achieve good work-conversation (given bandwidth guarantees are satisfied), which, for instance, causes 2x long flow completion times (FCTs) for tenant applications compared to our proposed solution.

Motivated by this, in this paper, we propose QShare, a comprehensive in-network solution to address the above challenges so as to achieve both goals in a sufficient manner. Instead of using two priority queues to segregate traffic for two different types, QShare directly leverages multiple weighted fair queues (WFQs) to slice network bandwidth for tenants. This ensures that (i) bandwidth guarantees are achieved through proper queue weight configuration and tenant placement rather than endhost rate limiters, thus relieving us of GP; (ii) the network link is driven to full utilization instantly as long as one tenant has sufficient demand; (iii) no matter how aggressively a tenant transmits, bandwidth guarantees of other tenants are not affected as they are served in separate weighted queues; (iv) no packet reordering or starvation arises. While promising, QShare faces a practical challenge of queue scarcity—the number of queues on a commodity switch port (typically 8) can be less than the number of tenants served by this port (see §7.3.1 for detailed analysis in large scale datacenters).

To address this challenge, we make the following observation: although the total number of embedded tenants associating with a port may be large, during a short time interval (e.g., a few seconds), the number of concurrent tenants whose traffic demands exceed their bandwidth guarantees is small. This is also reflected by the measurement results in production datacenters, where the average link utilization is low (facebook, ; dc_mesaure, ; imc2010, ). Thus, to support more tenants with limited queues, QShare dynamically assigns dedicated queues for tenants with higher demands than their guarantees, while serving the tenants whose current demands are relatively low than their bandwidth guarantees in a shared queue altogether.

QShare mainly contains two modules: a balanced tenant placement module and a dynamic tenant-queue binding module (3). The tenant placement module is responsible for allocating network resources to tenants to provide bandwidth guarantees. To facilitate the dynamic queue allocation for embedded tenants, our placement module also aims to balance the usage of switch ports among tenants to avoid overwhelming certain ports. The tenant-queue binding module then takes into account the traffic demands of tenants and their payment factors to dynamically distribute queue resources among tenants.

We implement a prototype of QShare with lines of code (C for Linux kernel space and Python for user space), and perform extensive evaluations on testbed and via simulations. Our evaluation results suggest that:

  • Without sacrificing bandwidth guarantees, QShare achieves (i) perfect work conservation given correct prediction on demand trends (not the exact traffic matrix), and (ii) over link utilization given completely unpredictable demands.

  • Given the above desirable properties, QShare significantly benefits applications, for instance, by reducing their flow completion times (FCTs) by up to compared with the state-of-the-art (elasticswitch, ; trinity, ).

  • With production datacenter settings, QShare can assign dedicated queues to of all embedded tenants even when the datacenter is fully reserved, yielding at least throughput gain over their bandwidth guarantees and better efficiency in link utilization.

2. Background and Motivation

2.1. Background

In multi-tenant datacenters (oktopus, ; secondnet, ; tag, ; netlord, ), a conceptually centralized tenant manager with global view of the datacenter state is responsible for managing all tenants, including tenant embedding, routing updates, logging,failure handling & recovery and so forth. By designing various components for the tenant manager, datacenter operators are able to achieve self-interested goals, such as accommodating more tenants and achieving efficient resource utilization. QShare can be viewed as a newly designed component in the tenant manager to simultaneously accomplish the following two desirable properties: bandwidth guarantees and work conservation.

Network bandwidth guarantees are preferable properties in cloud computing to offer tenants predicable performance. A typical way to model bandwidth guarantees is using the hose model (hose, ; oktopus, ; faircloud, ; eyeq, ; gatekeeper, ; elasticswitch, ; tag, ). As an illustrative example, Figure 1(a) shows a tenant A’s bandwidth guarantees defined in a hose model, and Figure 1(b) illustrates the reserved bandwidth on each physical link to satisfy the bandwidth guarantees after tenant embedding. For simplicity, a symmetric hose model is plotted in Figure 1. Providing accurate bandwidth guarantees for VMs that can use multiple paths is an open problem since it requires a perfect load balancer to accurately distribute each VM’s traffic over multiple paths such that the sum of guarantee on each path equals to the total amount of guaranteed bandwidth. As a result, prior proposals for providing bandwidth guarantees are either within the scope of tree-based network topology (secondnet, ; oktopus, ; elasticswitch, ) or confining each tenant’s traffic within a tree in multi-path network topologies (tag, ; silo, ; OpReduce, ). QShare belongs to the second category as typical datacenters (e.g., Clos (clos, ; fattree, )) are built with path redundancy. QShare, however, can still fully utilizes the redundant network links via balanced tenant placement.

(a) Hose model for B.G.
(b) Bandwidth reservation
Figure 1. Figure 1(a) shows the tenant (VM) bandwidth guarantees defined in a symmetric hose model. Figure 1(b) shows the reserved bandwidth on each link after embedding the tenant: ; ; .

Work conservation is desired for achieving efficient resource utilization. Formally, in the context of multi-tenant datacenters, work conservation is defined as follows: for any link in the network, as long as there exists at least one tenant that has packets to send along link , cannot have spare bandwidth (faircloud, ). We note that work-conservation does not guarantee that there are no idle links in the network. Idle links may exist due to the lack of traffic demands or high-demanded tenants are bottlenecked by other links.

2.2. State-of-the-Art Solutions

ElasticSwitch (elasticswitch, ) makes the first attempt to achieve work-conserving bandwidth guarantees. It is an end-host based solution composed of two modules: a Guarantee Partitioning (GP) module that divides VM ’s hose-model guarantee into guarantees to/from each other VM that communicates with, and a Rate Allocation (RA) module that assigns spare bandwidth to these VM pairs to achieve high network utilization. However, it suffers from the following two key challenges.

First, since the traffic matrix (TM) among the VMs of a tenant is typically agnostic to cloud providers, GP has to gradually learn each VM-pair’s demand via periodic source-destination VM coordination and throughput measurement. Whenever the TM changes, GP needs to re-estimate the TM even if per-VM demand remains the same (see illustrative example in §10). Given highly bursty and dynamic TM in datacenters, it is challenging for the GP to capture the real communication pattern and estimate the TM correctly, especially considering that tens of thousands of VMs can produce billions of VM pairs. Performing GP in a dynamic context without the prior knowledge of tenant TM imposes a fundamental challenge for ElasticSwitch since it affects both bandwidth guarantees and work conservation. Further, at such scale, the overhead of maintaining these VM-pair rate limiters at hypervisors is non-negligible (elasticswitch, ).

Second, RA in ElasticSwitch (elasticswitch, ) aims to grab available network bandwidth beyond the provided guarantees. It probes the network by increasing rates, detects congestion via packets losses or ECN, and then allocates the spare bandwidth to VM pairs in max-min fashion following weighted TCP algorithms (seawall, ; seawall-like, ). As mentioned in (elasticswitch, ; trinity, ), it has a tradeoff between accurately providing bandwidth guarantees and being work-conserving: aggressive RA could affect other tenants’ guarantees whereas conservative RA ends up with bandwidth waste. In practice, RA’s performance depends on the parameter choice and system tuning.

Trinity (trinity, ) moves one step further to complement the endhost based ElasticSwitch with simple in-network support. It exploits two priority queues in switches to segregate and prioritize the bandwidth guarantee traffic over work conservation traffic. As a result, VMs can send work-conservation traffic more aggressively than ElasticSwitch with less affect on bandwidth guarantees of other tenants. Thus, Trinity achieves work conservation in a static context, i.e., the demand of each VM-pair (i.e., the TM) is a priori knowledge. However, it still suffers from the fundamental challenge of executing GP in dynamic context since it still needs to translate per-VM house-model bandwidth guarantees into VM-pair rate limiters on hypervisors. Further, since network traffic is segregated and served with strict priorities, Trinity raises packet reordering and starvation issues in practice.

We do preform detailed experiments and analysis to quantify these limitations. Please see detailed results in §10. Motivated by this, we propose QShare to address these challenges.

3. QShare Overview

QShare is a comprehensive in-network solution to address the above challenges. Instead of using two priority queues to segregate traffic for two different types, QShare leverages multiple weighted fair queues (WFQ) (WFQ is emulated by WRR on some switches) to slice network bandwidth for tenants. This enables QShare to provide tenant-level bandwidth guarantees and work conservation (instead of rigid VM-to-VM pair level as in both ElasticSwitch (elasticswitch, ) and Trinity (trinity, )), thus leaving tenant applications full flexibility to use their allocated bandwidth as needed. We note that such tenant-level bandwidth guarantees are also used in (end-to-end, ; oktopus, ), but they fail to achieve work conservation.

(a) In-network Support
(b) Pure endhost-based
Figure 2. Compared with pure endhost-based solution, QShare achieves perfect (no tradeoff) work-conserving bandwidth guarantees via in-network WFQ support.
(a) In-network support
(b) Balanced tenant placement
(c) Tenant-queue allocation
Figure 3. Three illustrative examples for QShare’s design. Figure 3(a) shows that QShare incorporates the in-network WFQ support. Thus, tenants A and B sharing the link are served in separate weighted queues. Figure 3(b) shows that QShare’s tenant placement algorithm balances the usage of switch ports among the embedded tenants to avoid overwhelming certain ports. Figure 3(c) plots an illustrative example to show QShare’s key design: assigning a dedicated queue for the high-demanded tenant A achieves perfect work conservation without estimating A’s TM, and meanwhile, tenants B and C served in the shared queue immediately receive their guaranteed bandwidth once they become active.

3.1. In-Network WFQ Support

WFQ on commodity switches offers desirable in-network support for achieving work-conserving bandwidth guarantees. We use the following toy experiment to demonstrate its benefit. We place two tenants A and B, both provisioned with VMs, on our testbed. Each tenant’s VMs are evenly distributed across two racks connected by a core link with Gbps capacity. As A and B share the core link , their flows are served in two separate WFQ queues whose weights are configured proportionally to their guaranteed bandwidth on the link (Figure 3(a)).

Consider a case where both A and B adopt the same symmetric hose model, in which each VM is guaranteed Mbps bandwidth in its hose model. Thus, both tenants have Mbps guaranteed bandwidth on the core link. To generate traffic, each VM in one rack is configured to communicate with randomly selected VMs, using our client/server program described in §7. Each VM’s demand and its communication pattern are completely random. Only intra-tenant communication is considered. We measure the amount of core-link bandwidth utilized by each tenant. As plotted in Figure 2(a), without relying on any TM estimation, WFQ enables QShare to achieve perfect work-conserving bandwidth guarantees without imposing packet reordering or starvation issues. We repeat the experiment using self-implemented prototype of ElasticSwitch. As illustrated in Figure 2(b), we notice a significant gap (over Mbps) between the aggregate bandwidth of A & B and the link capacity, i.e., over 60% of the unreserved bandwidth is wasted. We are aware that ElasticSwitch’s performance depends on parameter settings. We consider different settings in §10.

3.2. Design Overview

The key challenge of QShare is to address the problem of queue scarcity: the number of queues on each switch port (typically ) can be less than the number of tenants served by this port, and therefore we cannot allocate a dedicated queue for each tenant. Thus, to achieve the benefits for in-network WFQ support, QShare designs two modules: a balanced tenant placement module and a dynamic tenant-queue binding module.

The placement module first seeks to provision tenant network to ensure bandwidth guarantees for each tenant. Further, it balances the usage of switch ports among tenants to reduce the stress of performing the dynamic queue allocation in the binding module. For instance, if both placements in Figure 3(b) satisfy bandwidth guarantees, QShare prefers the one on the right side since the switch ports (and their queues) are more evenly utilized by the tenants.

The tenant-queue binding module dynamically assigns dedicated queues to tenants whose demands are higher than their guaranteed bandwidth, and meanwhile serves all the low-demanded tenants in a shared queue (they may employ ElasticSwitch-like rate allocation to improve the worst case performance, as explained below). As a result, tenants in dedicated queues can burst their traffic in arbitrary communication patterns without affecting other tenants. This design is the key to avoid the challenging GP and to eliminate the tradeoff between bandwidth guarantees and work conservation in the endhost based solutions (elasticswitch, ; trinity, ). We perform an experiment to demonstrate this. Consider that three tenants A, B and C compete on a link with Gbps capacity, shown in Figure 3(c). Each tenant has Mbps guarantee on . Assume for now only has WFQ queues. Suppose tenant A is high-demanded so that QShare assigns it a dedicated queue (with weight ) whereas B and C share a common queue (with weight ). When only A is active, it can fully utilize the link capacity with arbitrary communication pattens, achieving work conservation. Further, tenants B and C are not overwhelmed by A, and immediately receive their guaranteed bandwidth once becoming active.

Figure 4. The architecture of QShare.

The key challenge of our dynamic binding mechanism is how to assign right tenants dedicated queues since traffic demand is dynamic. QShare addresses the challenge as follows. First, rather than predicting traffic matrix for each tenant as proposed in (elasticswitch, ; trinity, ), QShare’s demand prediction relies on only a scalar metric (detailed in §5.1) of each tenant, which greatly reduces stress of prediction. Second, to improve the worst case performance when traffic demand prediction is inaccurate and high-demanded tenants are mistakenly placed in the shared queue, QShare can employ ElasticSwitch for tenants in the shared queue to achieve moderate work-conserving bandwidth guarantees in the spirit of ElasticSwitch. Finally, we perform testbed experiments (§7.1) to quantify effects of the binding mechanism: (i) the average utilization deficit caused by binding errors is less than of the total capacity; (ii) to achieve good performance, it is sufficient to perform dynamic binding at more coarse time granularity (e.g., a few seconds) compared with the traffic matrix estimation performed at the granularity of milliseconds in (elasticswitch, ; trinity, ), which would significantly reduce the stress for large scale deployment in practice.

We plot the system architecture in Figure 4. Next we briefly discuss the components of each module, and defer their design details in §4 and §5, respectively.

3.2.1. Balanced Tenant Placement Module

The balanced tenant placement module has two components. In particular, given a tenant embedding request, the routing explorer (§4.1) outputs all Tenant Routing (TR) candidates that can accommodate the tenant. The placement algorithm (§4.2) evaluates each candidate to select the most desired one.

Tenant Routing (TR) Exploration. As explained in §2.1, a tenant ’s TR is a tree in the physical network topology that connects the servers/hypervisors hosting ’s VMs (“servers” and “hypervisors” are used interchangeably). Traffic generated by ’s VMs is confined within its TR. Thus, the TR needs to be provisioned with sufficient VM slots and network bandwidth to fulfill ’s requirement. TR exploration, essentially, is the topology search process that produces a set of virtual networks (i.e., overlay trees) that can accommodate . Admission rules are applied here to accept new tenants only if the datacenter has sufficient spare capacities.

TR Candidate Election. Each TR candidate is evaluated based on two criteria: bandwidth reservation cost and queue occupation cost. Reducing bandwidth reservation cost allows datacenters to accommodate more tenants, while the key reason for considering the queue occupation cost is to reduce the management stress for the dynamic tenant-queue binding module.

3.2.2. Dynamic Tenant-queue Binding Module

The dynamic binding module executes periodically to distribute queues among tenants. It is built on (i) tenant demand trend prediction (§5.1), (ii) the queue-to-tenant allocation algorithm (§5.3) and (iii) the policy enforcer enforcing allocation decisions inside the network (§5.4).

Traffic Demand Trend Prediction. Based on the usage measurement in current control interval, QShare predicts demand trends of tenants in the next interval, i.e., whether a tenant tends to have higher demand than its bandwidth guarantees in the next interval. Note that QShare’s prediction relies on only a scalar metric rather than the per-VM pair traffic matrix as proposed in (elasticswitch, ; trinity, ).

Queue-to-tenant Allocation. The allocation algorithm dynamically distributes queues among tenants. In case of queue scarcity, it ranks the competing tenants based on both their demands and payment factors. Considering payment factors mitigates the problem of demand lying by tenants.

Policy Enforcer. To enforce bandwidth allocation decisions inside the network, the policy enforcer need to perform a set of tasks include performing network-related operations (e.g., switch configuration), tagging tenant packets with proper dscp values and running ElasticSwitch-like rate allocations at hypervisors for tenants without dedicated queues.

4. Balanced Tenant Placement

The goals of tenant placement are (i) provisioning virtual networks for tenants to satisfy their computation and bandwidth guarantees and (ii) balancing the overall switch queue utilization among tenants. The prior placement algorithms proposed in (tag, ; oktopus, ) aim to maximize the number of accepted tenant requests, which is an NP-hard problem similar to (hose, ). However, different from prior algorithms that make greedy embedding decisions (i.e., embed a tenant immediately once a feasible option is found), our balanced tenant placement requires global topology investigation, i.e., evaluating all feasible options before making embedding decisions. Towards this end, we design our own tenant placement algorithm. Formulated in Algorithm 1, our placement algorithm contains two major parts (i) TR candidate exploration and (ii) TR candidate election.

4.1. TR Candidate Exploration

We first explain TR candidate exploration in the widely adopted multi-rooted tree datacenter topology (fattree, ; vl2, ; tag, ; silo, ; jupiter, ). Then, we discuss extending such exploration to support randomly connected topology (jellyfish, ; small_dc, ).

1 0.85 Input: A tenant request with explicit guarantees.
2 Output: The desired TR or an embedding error.
3
4 
5Main Procedure:
6 begin
7        ;
8        while True do
9               ;
10               for  do
11                      ;
12                      if feasible then .add((T, cost));
13                     
14              if  is empty then
15                      ;
16                      if  then return False;
17                     
18              else
19                      return get_desired_TR()
20              
21       
22
23 
24Function: evaluate_TR():
25;
26if OA is feasible then return ;
27else return ;
Algorithm 1 Balanced Tenant Placement

Given a tenant request, Algorithm 1 explores the topology from the lowest layer (hypervisor layer) towards the highest layer (core switch layer). At each layer, function get_TRs_at_layer (line 1) obtains all TR options at this layer. A layer- TR option is a tree rooted at layer . Its leaves are the servers reachable from the root using only downward paths. Then the algorithm evaluates these TRs to produce feasible ones, called TR candidates (line 1). Generally speaking, a TR option is feasible if it has enough capacity to accommodate the tenant. Function evaluate_TR, detailed in §4.2, determines such feasibility. If no TR candidates can be found, the algorithm continues exploration in the next layer (line 1). Otherwise, it stops further exploration and returns the desired TR elected from all candidates (line 1) using the criteria described in §4.2. The early return confines tenants at the lowest possible layer to avoid unnecessary network usage at higher layers. If no TR candidates can be found after exploring the entire topology with layers, the algorithm returns false (line 1), indicating an embedding error due to the lack of resources.

Random Topology. To support random topology, Algorithm 1 adopts the -shortest path algorithm (shortest, ) to obtain a set of paths between each hypervisor pair and then combines them to produce TR options. The parameter , similar to in Algorithm 1, determines TR exploration space.

4.2. TR Evaluation and Candidate Election

A TR option is feasible if (i) the total available VM slots from all its servers are enough to hold the tenant’s VMs and (ii) each link of the TR has enough available capacity to satisfy the tenant’s bandwidth guarantees. Although evaluating the first rule is straightforward, the second rule requires more investigation. In particular, given a TR option, the amounts of bandwidth required on its links depend on the VM locations inside the TR. Specifically, consider a homogeneous hose model where all VMs have the same inbound and outbound bandwidth guarantee . Given a link of the TR, removing breaks the TR into two disjoint components. If VMs are in one component and VMs are in the other one, then the bandwidth required on is . Figure 5 plots a TR rooted at . For the VM location in Figure 5(a), the two links ( and ) both need to reserve () bandwidth whereas they have to reserve for the VM location in Figure 5(b).

To reduce the total network bandwidth required for embedding the tenant, function get_optimal_allocation (line 1) produces the VM location that requires the least bandwidth reservation. For homogeneous hose models, the optimal allocation is produced as follows: (i) find the server in the TR with the largest usable VM slots, (ii) allocate as many VMs as possible to , (iii) update the remaining network/server capacity after allocation and (iv) repeat step one until either all VMs are allocated (indicating feasibility) or all servers in the TR have been investigated (indicating infeasibility). The usable VM slots for in the TR is restricted by both the available VM slots in and the available bandwidth on the path from to the TR’s root. For instance, in Figure 5, if we assume the available bandwidth on link is less than , the usable VM slots in is , rather than .

If the TR’s optimal allocation is feasible, it becomes a candidate for embedding the tenant. Algorithm 1 then computes its bandwidth cost as the sum of reserved bandwidth for the tenant on each link of the TR, and the queue occupation cost as the largest number of tenants served by any of the TR’s links (line 1).

Candidate election is based on both and . Each TR candidate is associated with a combining and . The desired TR is the one with lowest . One strategy for computing is assigning more weight to when the datacenter load is light to prefer more balanced placement whereas assigning more weight to for heavy-loaded network to prefer the placement with fewer bandwidth cost.

(a) Requring on
(b) Requring on
Figure 5. Given the TR rooted at , the bandwidth required on its link depends on VM locations inside the TR.

Supporting High Availability. Algorithm 1 can be extended to support high availability (high_ava, ). In (high_ava, ), the worst-case survival ratio (WCS) is defined as the smallest fraction of VMs remaining functional during a single point failure. Consider server as the fault domain. Given a tenant with VMs and WCS as , one server can host at most VMs for this tenant. By patching the constraint in function get_optimal_allocation, Algorithm 1 can produce TR that satisfies the high availability requirement.

Search Complexity. The search complexity for embedding tenants depends on the layers at which Algorithm 1 returns. In a fattree topology (fattree, ), the worse case complexity (i.e., the algorithm returns at the core switch layer) is , where is the number of nodes in the network. Thus, although our algorithm perform comprehensive topology search, its time complexity is polynomial rather than exponential. For topologies built with higher over-subscription ratios than fattree, the search complexity is smaller as the number of TR options at each layer is smaller. Further, the topology search results can be cached to achieve long-term efficiency (OpReduce, ).

5. Dynamic Tenant-Queue Binding

To support more tenants with limited number of queues, QShare’s design is inspired by how the working set of a process is often much smaller than the total memory it consumes. Similarly, only tenants whose traffic demands exceed their bandwidth guarantees need dedicated queues. Thus, there is an opportunity for QShare to dynamically allocate limited number of queues to high-demanded tenants. In particular, QShare periodically evaluates each tenant and allocates queues among tenants based on their scores. Each tenant’s score encapsulates its usage factor5.1) and payment factor5.2) so as to prioritize high-demanded and honest tenants.

5.1. Tenant Demand Trend Prediction

Because prior works (elasticswitch, ; trinity, ) rely on Guarantee Partitioning (GP) to achieve bandwidth guarantees, they need to predict each tenant’s traffic matrix, i.e., per VM-pair traffic demand. However, since tenant applications are often agnostic to cloud operators, it is challenging in practice to capture real-time communication patterns among VMs and predict traffic demand between each VM pair. Realizing that, QShare’s tenant-queue binding module only relies on predicting whether a tenant tends to have higher demands than its guaranteed bandwidth. Thus, rather than predicting traffic matrix, QShare proposes to use a scalar metric, usage factor (U-factor), to indicate a tenant’s network utilization with respect to its guaranteed bandwidth. We do not claim that U-factor is the optimal metric for demand prediction. However, it does greatly reduce the stress of prediction by focusing on tenant-level demand trend rather than VM-level traffic matrix.

Each tenant’s U-factor is computed per control interval. Specifically, in each control interval, all hypervisors measure the bandwidth utilization of their hosted VMs. As VMs can have both inbound and outbound traffic, bi-directional bandwidth usage is considered. For instance, consider a hypervisor hosting VMs of a tenant . Then ’s inbound (outbound) bandwidth usage () measured by is the sum of inbound (outbound) bandwidth usage from all these VMs.

At the end of each control interval, QShare computes each tenant’s U-factor. For tenant , one way of computing its U-factor is as follows

(1)

where is the set of hypervisors hosting ’s VMs and is ’s guaranteed bandwidth on ’s network interface. If hosts VMs from (provisioned with total VMs), considering a symmetric and homogeneous model with per-VM guarantee .

The design rationale of Equation (1) is as follows. The innermost max is necessary as the high-demanded VMs may either send or receive large volumes of traffic. The middle max is designed to handle many-to-one traffic pattern in which many source VMs in remote servers are communicating with a few destination VMs in the local server. Although source hypervisors may measure small usage since source VMs are bottlenecked by destination VMs. actually has large traffic demand at these receivers. Taking the largest usage among all hypervisors will capture such a communication pattern. Finally, the outermost min sets a cap of . We leave the exploration of other U-factor definitions in future work.

5.2. Tenant Intent Lying Mitigation

Merely using U-factors to allocate queues has problems when tenants lie about their real bandwidth guarantees: a tenant can deliberately request smaller guaranteed bandwidth so as to have high U-factors. Note that no work-conserving allocation policies can completely prevent tenants from gaining advantages via lying, i.e., being strategy-proof (faircloud, ). To mitigate the problem caused by lying, QShare proposes to consider payment factors, along with U-factors, when scoring tenants. Each tenant’s payment factor and its guaranteed bandwidth are positively correlated such that deliberately requesting lower guarantees reduces a tenant’s score whereas exaggerating guarantees requires higher payment. As designing pricing model is not the focus of this paper, QShare assumes, for simplicity, that a tenant’s payment factor is proportional to the total guaranteed bandwidth required by its hose model111Payment for computation resources is not considered as QShare focuses on network bandwidth management.. Thus, given tenant with VMs and each VM requests guaranteed bandwidth , its payment factor is , where is a constant depending on the pricing model.

Simplifying in Equation (1) as , then tenant ’s score is computed as follows

(2)

, where is determined by maximizing the inner max operation of Equation (1).

Using as the criterion for queue allocation can mitigate problems caused by lying. On the one hand, as is bounded by , deliberately requesting smaller would result in a lower cap of , which is disadvantageous when competing with other tenants. On the other hand, deliberately requesting higher than real demand also has problems as (i) tenant would have to pay more and (ii) its would be determined by ’s real usage rather than its claimed guarantees if has smaller demands than its guarantees (i.e., ). Generally, tenants with higher traffic demands are preferred since is non-decreasing as bandwidth usage increases. Such a property is desired for the queue allocation detailed below.

5.3. Dynamic Queue Allocation

We present the queue allocation logic in Algorithm 2. A tenant is assigned a dedicated queue only if it is assigned a dedicated queue on each link of its TR. Otherwise, the tenant will be served in the shared queue on each link of its TR. To prioritize tenants with higher scores, Algorithm 2 starts queue assignment from the tenant with the highest score, breaking tie randomly (line 2).

If a tenant already occupies a dedicated queue, it continues to hold the queue (line 2) for the next control interval. This indicates that either maintains its high score or owns a dedicated queue on each link of its TR due to the lack of queue contention, which is possible due to the balanced placement (see analysis in §7.3.1).

If tenant is currently placed in the shared queue, Algorithm 2 determines whether allocating a dedicated queue is possible. To satisfy the condition on line 2, each link of ’s TR needs to have at least one spare queue. If positive, function enqueue_tenant assigns a queue on each link of its TR (line 2).

1 0.85 Input: The set of embedded tenants .
2 Output: Tenant-queue assignment.
3
4 
5Sort the tenants in decreasingly by their scores;
6 for T  do
7        if T has a dedicated queue then continue ;
8        else if T’s TR has a spare queue then
9               enqueue_tenant(T);
10              
11       else opportunistically_enqueue(T);
12        Update queue allocation state;
13       
14Queue weight computation;
15
16 
17Function: enqueue_tenant(T):
18 for L T’s TR do
19        ;
20       
21
22 
23Function: opportunistically_enqueue(T):
24 for L T’s TR do
25        get_opportunistic_queues_from_LSTs(L);
26       
27if T’s TR has an opportunistic queue then
28        enqueue_tenant(T);
Algorithm 2 Queue Allocation Algorithm

Finally, if at least one link of ’s TR runs out of queues, function opportunistically_enqueue (line 2) opportunistically finds queues for by preempting queues from low-scored tenants (LSTs). Specifically, on link without spare queues, the algorithm obtains an opportunistic queue occupied by a tenant such that (i) ’s score is less than ’s score and (ii) ’s score is the smallest among all tenants owning a queue on (line 2). If an available queue, either opportunistic or unoccupied, exists on each link of ’s TR, we say that ’s TR has an opportunistic queue (line 2), and then enqueue . The dequeued tenants will be served in shared queues during the next control interval.

Queue allocation state is updated after handling (line 2). Once queue allocations for all tenants are finished, QShare computes weight for each queue (line 2). For a queue on link , its normalized weight is the ratio of reserved bandwidth in to the total reserved bandwidth on link . In practice, weights need to be proportionally translated into the supported values on commodity switches (e.g., to on our switches).

Tenant departures will trigger network allocation state update as well. Newly arrived tenants are served in shared queues, and will be evaluated at the end of current interval.

5.4. Policy Enforcer

To enforce queue allocation decisions inside the network, QShare needs to perform (i) packet tagging and (ii) network configuration. Packet tagging is to ensure that packets from different tenants can be correctly identified and therefore served in correct queues. We use dscp tagging to achieve this. To avoid ambiguity, the D-tenants (tenants with dedicated queues) whose TRs share at least one common link cannot use the same dscp value. D-tenants whose TRs are non-overlapping can reuse the same dscp value. Given that dscp values range from to , finding the smallest possible number of dscp values in a legal assignment can be reduced to the k-coloring of a graph, which is NP-hard (graph_coloring, ).

To address the dscp usage concern, we analyze the efficiency of a greedy assignment in large scale datacenters based on production datacenter settings. The results show that dscp values are sufficient to avoid conflict even when the datacenter is fully reserved (see details in §7.3.1). Further, technically, it is possible to mutate dscp values on switch ports via the dscp-to-dscp mutation map (dscp_swap, ). Thus, based on a dscp mapping on each port, a tenant can use different dscp values on different ports, eliminating the static dscp value reservation required on each link of its TR, which in turn eliminates the possibility of dscp conflict.

Network configuration involves configuring queues on each link with proper weights and dscp values, which requires WFQ configuration on both ports of the link. For edge links connecting servers and switches, software WFQ is required on hypervisors. To support automation, QShare designs a network action container to perform configuration in a batch: operations on different switches are parallelized via multi-threading so that the marginal configuration latency is negligible.

The final part of the policy enforcer is that QShare can run ElasticSwitch-like rate allocation mechanisms (seawall, ; seawall-like, ) for tenants without dedicated queues to provide them bandwidth guarantees and achieve moderate work conservation in the worse case when all D-tenants have insufficient demands. However, QShare imposes smaller overhead than ElasticSwitch (elasticswitch, ) since it only performs rate allocations for tenants without dedicated queues.

6. Implementation

In this section, we introduce the prototype implementation of QShare. The prototype of QShare contains both user-space and kernel-space programs, as plotted in Figure 6. The user-space programs, executed globally, are responsible for managing the whole datacenter whereas the kernel-space program, running on each hypervisor, manages the local hypervisor. Two spaces interact with each other such that queue allocation decisions are made based on the distributed measurement reported by all hypervisors, and afterwards the allocation decisions are pushed back to the kernel module for enforcement on each hypervisor. The current implementation has lines of code (Python in user space and C in kernel space).

The user-space programs include tenant placement, queue allocation and the network action container. The kernel-space module, built on NetFilter (netfilter, ), includes tenant traffic monitor, rate allocation (for tenants in shared queues), software WFQ (for tenants with dedicated queues) and packet dscp tagging. On each hypervisor, a user-space deamon (not plotted) based on Netlink (netlink, ) interacts with the kernel module.

Note that implementing a hypervisor that can support all kinds of VM management is out of this paper’s scope. Our prototype builds a simple hypervisor that can support QShare-related operations, such as identifying the VMs of each tenant.

Figure 6. The software implementation of QShare.

7. Evaluation

(a) Per-tenant runtime bandwidth utilization given predictable demand trends.
(b) Aggregate core link utilization given unpredictable traffic trend.
(c) Avg core link utilization with varying control intervals.
Figure 7. Figure 7(a) plots runtime bandwidth utilization of all tenants given correct demand trend prediction. QShare achieves perfect work-conserving bandwidth guarantees in this case. Figure 7(b) plots the total runtime utilization given completely unpredictable demands. Only few under-utilized cases are observed during the measurement period, yielding over average utilization. Figure 7(c) shows the average link utilization given different lengths of the control interval.

Our evaluation centers around the following questions:

(i) How does traffic dynamic affect QShare’s performance? With correct predictions on demand trend (not the exact traffic matrix), QShare achieves perfect work-conserving bandwidth guarantees: all bandwidth guarantees are satisfied and meanwhile the bottleneck link is fully utilized (§7.1.1). Even when demand trends are completely unpredictable, QShare drives the bottleneck link to over utilization (§7.1.2) without comprising bandwidth guarantees.

(ii) How well can QShare benefit applications? Given the above desirable properties, QShare significantly benefits applications, for instance, by reducing their flow completion times (FCTs) by up to compared with the state-of-the-art solutions (elasticswitch, ; trinity, )7.2).

(iii) How well can QShare manage large scale datacenters? Based on observations from production datacenters, we analyze QShare in a large scale datacenter. We show that QShare can assign dedicated queues to of the tenants in any control interval even when the datacenter is fully reserved. Thus, QShare produces at least throughput gain over the guarantees and achieves higher efficiency in link utilization (§7.3).

(iv) How much overhead does QShare impose? QShare imposes small overhead for switch configuration, running rate allocations and embedding tenants (§7.4).

Testbed Experiment Setup. We build a physical testbed containing servers and each server provisions VM slots, for a total of VMs. Each server installs a Gigabit Ethernet NIC and runs the Linux kernel. We evenly distribute the servers into two racks inter-connected by two Pronto-3297 48-port Gigabit (ToR) switches. Thus, the topology is oversubscribed and the core link may be congested when VMs are sufficient demands. Each port supports up to WFQ queues. We embed multiple tenants in the testbed, with random sizes from to VMs.

We develop a client/server program to generate traffic. The clients initiate long-lived TCP connections to randomly selected servers and request flow transmission. All VMs run both the client and server programs. Only intra-tenant communication is allowed.

7.1. Work-Conserving Bandwidth Guarantees

In this section, we consider how traffic dynamics may affect QShare’s performance for enabling work-conserving bandwidth guarantees. We consider the following two scenarios. The first case is that a tenant’s demand trend is predictable: i.e., once a tenant has high traffic demand, this trend continues for few seconds. Trend predictability is not over-optimistic since hot spots in production datacenters can last over tens of seconds (dc_mesaure, ). The second case is that the demand trend is completely unpredictable: i.e., a tenant’s future demands are independent on its current or previous demands. In both cases, QShare does not impose any constraint on VM communication patterns, i.e., one client can request flow transfers from arbitrary servers at any time.

To quantify the worst-case performance degradation caused by traffic unpredictability, we first disable the ElasticSwitch-like rate allocations for the tenants without dedicated queues, and allocate them at most their guaranteed bandwidth.

7.1.1. Predictable Demand Trend

In this experiment, we consider tenants competing on the core link. Each tenant is guaranteed Mbps bandwidth on the core link. To generate traffic, we randomly pick tenants (referred to as T1 to T5) as high-demanded tenants whose clients request sufficient flow transfers during our measurement period. The remaining tenants (referred to as T6 to T10) have insufficient demands during the measurement period. Low-demanded tenants may initiate their flow transfers at any time during the measurement period. In this experiment we first fix the length of control interval as seconds. Different settings are considered in §7.1.2.

Figure 7(a) plots the runtime core link bandwidth obtained by each tenant in a 10-second measurement period. During this period, QShare’s tenant-queue binding algorithm assigns each of the tenants in T1 to T7 a dedicated queue on the core link; T8, T9 and T10 are served in a shared queue. When low-demanded tenants are inactive at the early stage, T1 through T5 fairly share the entire core link capacity. Later on, low-demanded tenants T6, T8, T9 and T10 become active. As T8, T9 and T10 are in the shared queue, they all obtain their guaranteed bandwidth. T1 to T6, each exclusively occupying a queue, equally share the remaining capacity. At about second, T7 becomes active and fairly shares the core link with T1 to T5. It is clear that all tenants receive at least their guaranteed bandwidth regardless of their communication patterns and other tenants’ traffic demands. Meanwhile, the core link is always fully utilized. Thus, QShare achieves perfect work-conserving bandwidth guarantees.

(a) Overall average FCTs
(b) Small flows (KB)
(c) Medium flows
(d) Large flows (MB)
Figure 8. FCT statics for varying fabric loads (part of the results for static reservation are out of the plot scope). Despite its improvement over static reservation, ElasticSwitch (elasticswitch, ) has a large performance degradation compared with QShare (up to long FCTs) even if it adopts aggressive RA at the expense of compromising bandwidth guarantees. In term of bandwidth utilization, Trinity has roughly the same performance as ElasticSwitch with aggressive RA.

7.1.2. Unpredictable Demand Trend

In this section, we consider the case when tenant demand trend is unpredictable. Note that the predictability of traffic demand is only relevant to QShare’s tenant-queue binding module, which affects the performance of work conservation. Thus, we mainly focus on evaluating of work conservation when handling unpredictable demands.

We use the same set of tenants as in §7.1.1. To generate unpredictable traffic demands, each client requests flow transmissions from randomly selected servers. Flow sizes are sampled from the empirical datacenter workloads in deployed datacenters (conga, ). When the current flow finishes, a client randomly switches between being active (i.e., requesting a new flow transmission) or dormant (i.e., sleeping for a random period of time between 0 to 1 second before requesting a new flow transfer).

Figure 7(b) illustrates the runtime core link utilization over a one-minute measurement period. We measure the aggregated link utilization from all tenants at the granularity of 0.1 second. As illustrated in Figure 7(b), in spite of unpredictable demands, under-utilized cases are rare, rendering over average link utilization (plotted Figure 7(c)

). This is because that QShare does not rely on good TM estimation to achieve work conservation. Instead, for any D-tenant (tenant with a dedicated queue), its VMs can burst traffic with arbitrary communication patterns, allowing them to effectively grab possible spare bandwidth. As long as one VM pair from all D-tenants is high-demanded, it can drive the core link to full utilization. Mathematically, the probability that all VM pairs from D-tenants have insufficient demands is low. In particular, assuming each VM pair independently determines to be either active or dormant with equal probability during a small time interval, the probability that the core link observes insufficient demands in the small interval

222Given a small interval (e.g., sub-millisecond), a small flow transmission may be considered as sufficient demand. is , where is the number of VM pairs from all D-tenants. Thus, demand unpredictability has minor effects on work conservation.

We further plot the average core link utilization for different lengths of control interval in Figure 7(c). For predictable demand trend, QShare achieves perfect work conservation as long as the length of control interval is comparable with how long the trend lasts. For unpredictable trend, the utilization drops slightly as the length of control interval increases.

The takeaway of this evaluation is that to achieve good work conservation, (i) QShare does not require perfect demand prediction and (ii) it is sufficient to perform tenant-queue allocation at coarse time granularity (e.g., seconds). Thus, QShare’s dynamic queue-tenant binding module does not need to react quickly enough to capture traffic bursts, which significantly reduces the stress for large scale datacenter deployment in practice.

Fairness. We now consider the benefits of enabling ElasticSwitch-like rate allocations for the tenants without dedicated queues. First, it improves the fairness for sharing the spare bandwidth since both tenants in the shared queue and tenants with dedicated queues are able to utilize such bandwidth. Further, it slightly improves the overall link utilization by leveling up those under-utilized cases shown in Figure 7(b).

7.2. Tenant Application Benefits

Given the desirable property in §7.1, QShare can benefit tenant applications by significantly reducing their flow completion times (FCTs). In this section, we demonstrate QShare’s edges over ElasticSwitch (elasticswitch, ), Trinity (trinity, ) as well as the static reservation for improving FCTs. Among all embedded tenants, we consider one tenant with VMs evenly distributed in two racks. Tenant has Mbps guaranteed bandwidth on the core link. We consider the shuffle phase of MapReduce jobs where a client requests flow transfers from all servers (recall that a VM runs both the client and server program). The flow sizes, illustrated in Figure 9, are sampled from empirically observed traffic patterns in two deployed datacenter traces (vl2, ) and (conga, ). Each client requests a new flow once the previous one is finished, indicating that has higher demands.

(a) Enterprise workload
(b) Data-mining workload
Figure 9. Empirical traffic distributions used for measuring FCTs. The Bytes CDF shows the distribution of traffic bytes across different flow sizes.

In the experiment, we create different datacenter fabric loads by varying the bandwidth guarantees required by background tenants (i.e., the tenants competing with on the core link). The load is computed as the ratio of total guaranteed bandwidth from background tenants to the core link capacity. The results for using the enterprise datacenter workload (conga, ) are illustrated in Figure 8 (results for using the data-mining workload (vl2, ) are similar and we omit them for brevity). Because of the efficient resource utilization, QShare greatly reduces FCTs compared with both ElasticSwitch (elasticswitch, ) and the static bandwidth reservation. Such improvement is even more significant for smaller fabric loads. In spite of its improvement over static reservation, ElasticSwitch (elasticswitch, ) has a non-trivial performance degradation from QShare (up to long FCTs) even if it adopts very aggressive RA to probe available bandwidth (scarifying bandwidth guarantees (elasticswitch, )). We are aware that ElasticSwitch’s performance depends on parameter settings and system tuning. Our self-implemented ElasticSwitch prototype uses the default parameter setting in its paper. We do not further plot the results of Trinity (trinity, ) since ElasticSwitch with aggressive RA has roughly the same performance with Trinity in terms of bandwidth utilization (§10), whereas Trinity has reordering and starvation issues.

We note that QShare is orthogonal to the approaches (dctcp, ; d2tcp, ; pdq, ; pfabric, ; pias, ) that reduce FCTs by creating more efficient transport protocols. Rather, QShare is an approach focusing on allocating network resources for tenants, which are agnostic to both transport protocols and tenant applications.

7.3. QShare in Large Scale

In this section, we evaluate QShare in large scale. In particular, we shed light on the extent of switch queue scarcity (compared with the number of tenants) in large scale datacenters. Further, we show QShare’s benefit for providing tenants more bandwidth than their guarantees and improving link utilization efficiency in large scale datacenters. We consider a three-layer multi-rooted tree topology with 1024 servers and 100 VMs per server, for a total of thousand VMs. The network interface of each server is Gbps and the switch port capacity is Gbps. The network topology is constructed based on the fattree (fattree, ) topology. By disabling certain links and switches, we can create a topology with different over-subscription ratios.

7.3.1. The Extent of Switch Queue Scarcity

To be consistent with the production datacenters (seawall, ; oktopus, )

, the number of VMs requested by each tenant follows an exponential distribution with mean

. The bandwidth guarantee of each VM is randomly sampled from five values Mbps, Mbps, Mbps, Mbps and Mbps to better represent various bandwidth requirements from tenants. In the experiment, we keep embedding tenants until either network resources or computation resources are fully reserved, i.e., the datacenter operates at load. To do a stress test for queue scarcity, we assign more weight to in Algorithm 1. We test three different over-subscription ratios , and .

The tenant placement results are tabulated in Table 1. Overall, the extent of queue scarcity is moderate, counterintuitive to the common assumption (faircloud, ). For instance, only switch ports are overloaded in the 1:1 over-subscribed topology. Among the over-utilized ports, the largest number of tenants handled by a single port is , slightly higher than the total number of queues. From the tenants’ perspective, two thirds of them are assigned dedicated queues throughout their lifetime due to the lack of queue contention, i.e., on any link of their TRs, the number of competing tenants is less than . of all tenants can have dedicated queues, either permanently or opportunistically, in any control interval, indicating that only a small fraction of tenants need to run rate allocations at hypervisors. After tenant placement, we assign tenants dscp values to analyze the dscp usage concern mentioned in §5.4. dscp is reserved for tenants in shared queues. For each tenant with dedicated queues, we greedily assign it the next non-conflicting dscp value. It turns out that dscp values are sufficient even for the fully reserved datacenter.

We further emulate the processes of tenant arrival and departure. The tenant arrival is modeled by a Poisson Process with rate and the lifetime of each tenant is a constant, similar to (tag, ). By varying , we tune the datacenter load. As the datacenter load drops, the queue scarcity is mitigated as well. When the load is less than , all tenants are permanently assigned dedicated queues. This demonstrates that our tenant placement module effectively spreads tenants across available switch queues to relieve queue contention.

The takeaway for this evaluation is that in reality, the problem of queue scarcity is moderate. By performing dynamic tenant-queue binding, QShare can effectively handle such scarcity in large scale datacenters.

7.3.2. QShare’s Performance in Large Scale

O. R.
Table 1. Tenant placement results in a large scale datacenter. is the percentage of ports serving less than tenants. and have similar definitions. is the percentage of tenants permanently assigned a dedicated queue and is the percentage of tenants assigned a dedicated queue, either permanently or opportunistically, in any control interval.

In this section, we evaluate QShare’s performance in large scale datacenters. We show that QShare produces significant throughput gain for tenants over their guaranteed bandwidth and achieves efficient link utilization. We develop a simulator incorporating QShare’s tenant placement module and dynamic queue allocation algorithm. Due to the scalability of accurately simulating detailed packet-level commutations involving billions of VM pairs, our simulator does not further study the performance of ElasticSwitch (elasticswitch, ) and Trinity (trinity, ) since both of them require GP which depends on accurately modeling packet-level communications. Instead, our simulator focuses on modeling tenant-level throughput, assuming tenant applications can use the available bandwidth with arbitrary communication patterns. The experiment is performed on the over-subscribed topology since it has the highest level of queue scarcity compared with other settings. Meanwhile we still consider the tough scenario where the datacenter operates at full load, i.e., resources are fully reserved. We define the inactive ratio as the percentage of low-demanded tenants.

Throughput Gain. The throughput gain for a tenant is defined as the ratio of its actual achieved throughput to its guaranteed bandwidth. For simplicity, we assume the throughput gain for tenants in shared queues is (no gain). For a tenant with dedicated queues, its bandwidth gain on different links of its TR may vary since the actual traffic demands on each link vary. We quantify the throughput gain of as the smallest bandwidth gain that obtains on any link of its TR. Thus, our experiment shows the worst-case throughput gain for when the link with the smallest bandwidth gain is the bottleneck.

SecondNet (secondnet, ),
Oktopus (oktopus, ), TIVC (only, )
CloudMirror (tag, )
ElasticSwitch (elasticswitch, ),
Trinity (trinity, )
EyeQ (eyeq, ),
GateKeeper (gatekeeper, )
Silo (silo, ) QJump (qjump, ) QShare
BG
Yes Yes Tradeoff (elasticswitch, ) Yes Yes Yes Yes
WC
No No Tradeoff (elasticswitch, ) Yes No Yes Yes
Multi-tenant
isolation & placement
Yes Yes No Yes Yes No Yes
Others
None
Application
driven
TM estimation;
Starvation & reordering (trinity, )
Non-congested
network core
None None None
Table 2. Property comparison with closely related works. “BG” and “WC” mean bandwidth guarantee and work conservation.

Figure 10(a) illustrates the average throughput gain given varying inactive ratios. Overall, QShare produces significant throughput gains (e.g., over for all inactive ratios) over bandwidth guarantees. The throughput gain increases dramatically (up to ) as the inactive ratio increases, demonstrating that QShare can effectively utilize spare bandwidth.

(a) Throughput gain
(b) Link utilization
Figure 10. QShare’s performance in large scale datacenters. Figure 10(a) plots the tenants’ average throughput gain over the static reservation with various inactive ratios. Figure 10(b) shows the CDFs of normalized link utilization for QShare and static reservation.

Utilization Efficiency. A natural benefit of work conservation is that QShare can improve link utilization efficiency, i.e., more links are operating at high utilization, which ultimately reduces the cost of over-provisioning network bandwidth. Specifically, consider that tenant ’s throughput gain allows it to receive an extra Mbps bandwidth besides its guaranteed bandwidth. This extra bandwidth will distribute among the links of ’s TR, driving these links to higher utilization. Without loss of generality, we consider a communication pattern spreading the throughput gain across ’s links proportionally to ’s guaranteed bandwidth on these links. As the throughput gain is obtained as the minimal bandwidth gain among all links on ’s TR, this distribution will not drive any link to over utilization.

Figure 10(b) plots the CDFs of normalized link utilization (to the link capability) in the datacenter given . The results show that QShare achieves better efficiency in link utilization than static reservation. For instance, with QShare, half of the links’ utilization is over compared with in static reservation; links are fully utilized with QShare compared with percentage in static reservation. These bottleneck links show that QShare has driven the network to the maximum possible utilization, i.e., achieving work conservation.

7.4. System Properties

In this section, we report the following system properties to demonstrate QShare’s scalability.

Switch Configuration. The network action container (§5.4) executes switch configuration commands in a batch. The latency for configuring queues on all 48 ports of our legacy switch is less than ms, and the configurations on different switches are parallelized using multi-threading. For OpenFlow switches, configurations can be finished almost in real time via SDN controllers such as OpenDayLight (opendaylight, ). Thus, even in large scale datacenter with thousands of switches, the overall configuration latency is negligible, compared with the length of control intervals (§7.1.2).

CPU Overhead. The major CPU overhead is contributed by the kernel module on hypervisors (§6), which is affected by traffic volume. At the full NIC speed (about Mbps), we measure CPU overhead on our servers (shipped with a quad-core Intel 2.8 GHz CPU).

Tenant Placement. In the large scale network topology in §7.3, the average time for figuring out the most desired TR for a tenant request is ms whereas the worst case takes no more than ms.

8. Related Work

Table 2 summarizes the properties of some closely related work. SecondNet (secondnet, ), Oktopus (oktopus, ), and TIVC (only, ) provide static, non work-conserving bandwidth guarantees. EyeQ (eyeq, ) and GateKeeper (gatekeeper, ) achieve work-conserving bandwidth guarantees only if the network core is congestion-free, which may be not true for many datacenters (high_ava, ; imc2010, ; dc_mesaure, ). ElasticSwitch (elasticswitch, ) relies on challenging traffic matrix estimation and has a tradeoff between providing accurate bandwidth guarantees and being sufficiently work-conserving. Trinity (trinity, ) improves ElasticSwitch’s work-conservation in static context via in-network priority queuing. However, it has starvation and packet reordering issues. Although Silo (silo, ) and QJump (qjump, ) can provide both bandwidth and in-network latency guarantee, Silo is not work-conserving and QJump lacks the tenant placement and isolation.

Using switch queues has been proposed before. For instance, vShaper (vshaper, ) proposes to virtualize the physical queues to mimic the traffic shaping behavior of more queues, without considering bandwidth guarantees. pFabric (pfabric, ), QJump (qjump, ) and PIAS (pias, ) instead use priority queues to achieve low latency, although pFabric requires new hardware support, such as P4 (p4, ).

FairCloud (faircloud, ) proposes several models (or design principles) for sharing the network resources in datacenter. QShare’s design follows the PS-P model, which, in theory, supports both work conservation and bandwidth guarantees simultaneously.

The bandwidth guarantees defined in the hose model can be enforced either at the level of per-tenant (e.g., (end-to-end, ; oktopus, )) or at the level of VM pairs, as proposed in (elasticswitch, ; tag, ; trinity, ). Generally, QShare enforces per-tenant guarantees. However, for tenants without dedicated queues, their bandwidth guarantees need to be enforced through VM-pair guarantees.

9. Conclusion

This paper presented QShare, the first comprehensive in-network solution enabling work-conserving bandwidth guarantees in multi-tenant datacenters. At its core, QShare’s tenant placement module provides accurate bandwidth guarantees, and its tenant-queue binding module dynamically assigns high-demand tenants dedicated switch queues to achieve work conservation. We implement a prototype of QShare, and perform extensive evaluations on physical testbed and via simulations to validate QShare’s design goals. The results show that QShare improves state-of-the-art solutions in two aspects: (i) it does not rely on challenging traffic matrix prediction to achieve good performance and (ii) it eliminates the tradeoff of providing good bandwidth guarantees and being work conserving without raising starvation or packet reordering issues. Finally, QShare imposes small system overhead.

References

  • (1) Configuring the DSCP-to-DSCP-Mutation Map for Cisco Switches. http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst2960/software/release/12-2_55_se/configuration/guide/scg_2960/swqos.html#wp1028614.
  • (2) Netlink: Communication between Kernel and User Space. http://man7.org/linux/man-pages/man7/netlink.7.html.
  • (3) The Netfilter Project. http://www.netfilter.org.
  • (4) The OpenDayLight Project. https://www.opendaylight.org.
  • (5) Al-Fares, M., Loukissas, A., and Vahdat, A. A Scalable, Commodity Data Center Network Architecture. ACM SIGCOMM (2008).
  • (6) Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R., Chu, K., Fingerhut, A., Matus, F., Pan, R., Yadav, N., Varghese, G., et al. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In ACM SIGCOMM (2014).
  • (7) Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and Sridharan, M. Data Center TCP (DCTCP). ACM SIGCOMM CCR (2011).
  • (8) Alizadeh, M., Yang, S., Sharif, M., Katti, S., McKeown, N., Prabhakar, B., and Shenker, S. pFabric: Minimal Near-optimal Datacenter Transport. ACM SIGCOMM CCR (2013).
  • (9) Angel, S., Ballani, H., Karagiannis, T., O’Shea, G., and Thereska, E. End-to-end Performance Isolation through Virtual Datacenters. In USENIX OSDI (2014).
  • (10) Bai, W., Chen, L., Chen, K., Han, D., Tian, C., and Wang, H. Information-Agnostic Flow Scheduling for Commodity Data Centers. In USENIX NSDI (2015).
  • (11) Ballani, H., Costa, P., Karagiannis, T., and Rowstron, A. Towards Predictable Datacenter Networks. In ACM SIGCOMM (2011).
  • (12) Benson, T., Akella, A., and Maltz, D. Network Traffic Characteristics of Data Centers in the Wild. In IMC (2010).
  • (13) Bodík, P., Menache, I., Chowdhury, M., Mani, P., Maltz, D. A., and Stoica, I. Surviving Failures in Bandwidth-constrained Datacenters. In ACM SIGCOMM (2012).
  • (14) Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G., et al. P4: Programming Protocol-independent Packet Processors. ACM SIGCOMM CCR (2014).
  • (15) Crowcroft, J., and Oechslin, P. Differentiated End-to-end Internet Services Using a Weighted Proportional Fair Sharing TCP. ACM SIGCOMM CCR (1998).
  • (16) Dally, W. J., and Towles, B. P. Principles and practices of interconnection networks. Elsevier, 2004.
  • (17) Duffield, N. G., Goyal, P., Greenberg, A., Mishra, P., Ramakrishnan, K. K., and van der Merive, J. E. A Flexible Model for Resource Management in Virtual Private Networks. In ACM SIGCOMM (1999).
  • (18) Eppstein, D. Finding the k Shortest Paths. In Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on (1994), IEEE.
  • (19) Garey, M. R., Johnson, D. S., and Stockmeyer, L. Some Simplified NP-complete Problems. In

    Proceedings of the sixth annual ACM symposium on Theory of computing

    (1974), ACM.
  • (20) Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. VL2: a Scalable and Flexible Data Center Network. In ACM SIGCOMM (2009).
  • (21) Grosvenor, M. P., Schwarzkopf, M., Gog, I., Watson, R. N., Moore, A. W., Hand, S., and Crowcroft, J. Queues Don’t Matter When You Can JUMP Them! In USENIX NSDI (2015).
  • (22) Guo, C., Lu, G., Wang, H. J., Yang, S., Kong, C., Sun, P., Wu, W., and Zhang, Y. Secondnet: A Data Center Network Virtualization Architecture with Bandwidth Guarantees. In ACM CoNEXT (2010).
  • (23) Hong, C.-Y., Caesar, M., and Godfrey, P. Finishing Flows Quickly With Preemptive Scheduling. ACM SIGCOMM CCR (2012).
  • (24) Hu, S., Bai, W., Chen, K., Tian, C., Zhang, Y., and Wu, H. Providing Bandwidth Guarantees, Work Conservation and Low Latency Simultaneously in the Cloud. In IEEE INFOCOM (2016).
  • (25) Jang, K., Sherry, J., Ballani, H., and Moncaster, T. Silo: Predictable Message Latency in the Cloud. In ACM SIGCOMM (2015).
  • (26) Jeyakumar, V., Alizadeh, M., Mazières, D., Prabhakar, B., Kim, C., and Greenberg, A. EyeQ: Practical Network Performance Isolation at the Edge. REM (2013).
  • (27) Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. The Nature of Data Center Traffic: Measurements & Analysis. In ACM IMC (2009).
  • (28) Kumar, G., Kandula, S., Bodik, P., and Menache, I. Virtualizing Traffic Shapers for Practical Resource Allocation. In USENIX HotCloud (2013).
  • (29) Lam, T., Radhakrishnan, S., Vahdat, A., and Varghese, G. NetShare: Virtualizing Data Center Networks Across Services. [Department of Computer Science and Engineering], University of California, San Diego, 2010.
  • (30) Lee, J., Turner, Y., Lee, M., Popa, L., Banerjee, S., Kang, J.-M., and Sharma, P. Application-driven Bandwidth Guarantees in Datacenters. In ACM SIGCOMM (2014).
  • (31) Liu, Z., Chen, K., Wu, H., Hu, S., Hu, Y.-C., Wang, Y., and Zhang, G. Enabling Work-conserving Bandwidth Guarantees for Multi-tenant Datacenters via Dynamic Tenant-Queue Binding. In IEEE INFOCOM (2018).
  • (32) Mudigonda, J., Yalagandula, P., Mogul, J., Stiekes, B., and Pouffary, Y. NetLord: a Scalable Multi-tenant Network Architecture for Virtualized Datacenters. In ACM SIGCOMM CCR (2011).
  • (33) Popa, L., Kumar, G., Chowdhury, M., Krishnamurthy, A., Ratnasamy, S., and Stoica, I. FairCloud: Sharing the Network in Cloud Computing. In ACM SIGCOMM (2012).
  • (34) Popa, L., Yalagandula, P., Banerjee, S., Mogul, J. C., Turner, Y., and Santos, J. R. ElasticSwitch: Practical Work-conserving Bandwidth Guarantees for Cloud Computing. In ACM SIGCOMM (2013).
  • (35) Rodrigues, H., Santos, J. R., Turner, Y., Soares, P., and Guedes, D. Gatekeeper: Supporting Bandwidth Guarantees for Multi-tenant Datacenter Networks. In WIOV (2011).
  • (36) Roy, A., Zeng, H., Bagga, J., Porter, G., and Snoeren, A. C. Inside the Social Network’s (Datacenter) Network. In ACM SIGCOMM (2015).
  • (37) Shieh, A., Kandula, S., Greenberg, A. G., Kim, C., and Saha, B. Sharing the Data Center Network. In USENIX NSDI (2011).
  • (38) Shin, J.-Y., Wong, B., and Sirer, E. G. Small-world Datacenters. In ACM SOCC (2011).
  • (39) Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P., et al. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network. In ACM SIGCOMM (2015).
  • (40) Singla, A., Hong, C.-Y., Popa, L., and Godfrey, P. B. Jellyfish: Networking Data Centers Randomly. In USENIX NSDI (2012).
  • (41) Vamanan, B., Hasan, J., and Vijaykumar, T. Deadline-aware Datacenter TCP (D2TCP). ACM SIGCOMM CCR (2012).
  • (42) Xie, D., Ding, N., Hu, Y. C., and Kompella, R. The only Constant is Change: Incorporating Time-varying Network reservations in Data Centers. ACM SIGCOMM CCR (2012).
  • (43) Xu, Q., Lumezanu, C., Liu, Z., Arora, N., Sharma, A., Zhang, H., and Jiang, G. Optimization Framework for Multi-tenant Data Centers, 2017. US Patent 9813301 B2.

10. Appendix

In this section, we revisit the experiments in §3.1. We would like to share our experience of experimenting with ElasticSwitch (elasticswitch, ) and Trinity (trinity, ) to validate our motivation of proposing QShare. We implement a prototype of both ElasticSwitch and Trinity based on the designs in their papers.

We first quantify ElasticSwitch’s tradeoff of providing accurate bandwidth guarantees and being sufficiently work-conserving. Since the traffic matrix is unknown a priori, ElasticSwitch needs to allocate the bandwidth to each VM pair based on network condition probing. As a result, conservative allocation may result in bandwidth waste, especially when the total guaranteed (reserved) bandwidth is smaller than the link capability, whereas aggressive allocation may affect other tenants’ guarantees, especially when large numbers of VM pairs are competing one congested link.

Conservative Allocation. During conservative allocation, ElasticSwitch (elasticswitch, ) uses the following three mechanisms: (i) Headroom: leaving a gap between the link capacity and the maximum offered guarantees on any link; (ii) Hold-Increase (HI): delaying the rate increase after each congestion event and (iii) Rate-Caution (RC): being less aggressive to increase rates once the current rates are above the guarantees. Please refer to (elasticswitch, ) for the details of each mechanism. We use the following experiment to show the bandwidth waste caused by the conservative allocation. Consider the case where both tenant A and B adopt the same symmetric hose model, in which each VM is guaranteed Mbps bandwidth. Thus, both tenants have Mbps guarantees on the core link, i.e., the network link is half reserved. To generate traffic, each client requests flow transmissions from randomly selected servers. Flow sizes are sampled from the empirical datacenter workloads (conga, ). When the current flow finishes, a client randomly switches between being active (i.e., requesting a new flow transmission) or dormant (i.e., sleeping for a random period of time between zero to one second before requesting a new flow transfer). We measure the total amount of core-link bandwidth utilized by each tenant. As illustrated in Figure 11(a), we notice a significant gap (over Mbps) between the aggregated bandwidth of A & B and the link capacity, i.e., over 60% of the unreserved bandwidth is wasted.

Aggressive Allocation. To achieve aggressive allocation, we disable all those three mechanisms used in conservative allocation so that the RA module immediately increases rates on each positive feedback (lack of congestion). We consider a case where tenant A has Mbps guarantee and B Mbps guarantee on the core link. As shown in Figure 11(b), ElasticSwitch fails to guarantee A’s bandwidth although it drives the link to higher utilization with aggressive allocation. In fact, ElasticSwitch (elasticswitch, ) has demonstrated that bandwidth guarantees will be compromised once disabling RC and HI.

We are aware that ElasticSwitch’s performance depends on parameter choices. Thus, in practice, network operators can boost ElasticSwitch’s performance via system tuning. Our implementation uses the parameters specified in its paper.

Analysis. The causes of the above performance degradation are twofold: (i) the challenge of guarantees partitioning (GP) without prior knowledge of communication patterns and VM-pair demands (ii) the challenge of relying congestion feedback to learn real-time network bandwidth. Trinity (trinity, ) is effective to resolve the second cause since it does not need to learn the spare network bandwidth. Instead, it serves bandwidth-guarantee traffic in a prioritized queue so that senders can aggressively send work-conservation traffic without worrying that such aggressiveness would affect other tenants’ guarantees. Thus, in terms of achieving high link utilization, Trinity and ElasticSwitch with aggressive RA have roughly the same performance. Both our experiments and its paper show that Trinity achieves good work-conservation in static context (the traffic matrix is known and stable). However, we also notice some practical issues such as starvation and packet reordering in our experiments with Trinity.

(a) Conservative probing
(b) Aggressive probing
Figure 11. Quantify the tradeoff between providing accurate bandwidth guarantees and being sufficiently work-conserving for ElasticSwitch (elasticswitch, ).

Figure 12. The GP has to re-learn each VM pair’s guarantee each time the TM changes, even when both per-VM and tenant-level demand remains the same.

However, achieving good GP without prior knowledge still remains as an open problem. Since GP transforms each tenant’s per-VM bandwidth guarantee defined in the hose model into traffic matrix (TM), it essentially complicates the model. Consider the illustrative example in Figure 12. The item in the TM represents VM ’s sending demand to VM . When the actual TM changes from the left pattern to the right one, GP needs to gradually learn the update via probing. However, the demand of each VM actually remains the same despite the TM change. Further, given the VM placement in Figure 12 (three VMs are placed in three different hypervisors), the amount of bandwidth required for the tenant on each link also remains the same. Therefore, GP needs to do extra TM estimation even when per-VM and tenant-level demand remain the same, which essentially increases the stress of traffic demand prediction in practice.

Different from state-of-the-art solutions ElasticSwitch and Trinity (elasticswitch, ; trinity, ), QShare does not rely on GP in its design. Although QShare’s tenant-queue binding module does require demand prediction, it is much more lightweight than TM estimation since QShare only predicts a scalar metric for each tenant. Additionally, as shown in our evaluations (§7.1), QShare does not require perfect prediction in order to achieve good work conserving bandwidth guarantees.