Data Center Network (DCN) is critical in supporting cloud applications. Numerous works have been published in recent years to improve the performance of DCN by proposing new switch design (Alizadeh et al., 2014; Perry et al., 2014), new transport protocol (Alizadeh et al., 2010; Cardwell et al., 2016; Mittal et al., 2015; Gao et al., 2015; Lee et al., 2015), and the combination of both (Alizadeh et al., 2013; Wilson et al., 2011; Zhu et al., 2015; Handley et al., 2017).
While the commodity switch hardware is slow to pick up new algorithms and new architectures, a more appealing and effective option is to only modify the end-host transport protocol software for better performance, given a data center is basically an autonomous system. Indeed, DCTCP (Alizadeh et al., 2010) has been included in official Linux kernel release and adopted in many data centers as the default transport protocol. It also serves as the foundation for some newer algorithms (Zhu et al., 2015; Bai et al., 2015). In a multi-tenant data center where the guest’s transport protocol cannot be modified, software innovation at edge is still preferred over installing new switch hardware. It is shown that new transport algorithms can be emulated by host hypervisors without modifying guest OSes (He et al., 2016; Cronkite-Ratcliff et al., 2016).
Buffer is the main facility in switches to work in conjunction with end-host congestion control algorithms. It helps switches retain bandwidth utilization and absorb traffic burst. The buffer occupancy, which is reflected in queue depth, is a direct signal for network congestion status. While some chip vendors boast a bigger buffer at the cost of other resources, we argue that this can do more harm than good.
First, buffer poses a serious design challenge for switches. Because the off-chip memory is slow and demands a large amounts of I/Os, the state-of-art switch chips have to use on-chip SRAM as packet buffer. A chip can afford no more than a few tens of megabytes of memory due to the foot print of SRAM cells. In a typical switch chip with balanced resource distribution, roughly 1/3 of the chip area is dedicated to memory, 1/3 to I/Os, and the remaining 1/3 to logics. The memory is further partitioned to support lookup tables/queues and packet buffer. As the switch chips become more flexible and programmable (Pat Bosshart and Glen Gibb and Hun-Seok Kim and George Varghese and Nick McKeown and Martin Izzard and Fernando Mujica and Mark Horowitz, 2013), the requirement for more logics and lookup tables escalates. Meanwhile, the throughput and port density of switch chips continue to increase. A 12.8Tbps switch chip, which includes 128x 100Gbps ports, is on the horizon. This trend implies that I/Os will claim more chip area and the average buffer size per port will shrink. If we are able to reduce the buffer size, we can make switch chips scale to higher throughput and port density, or reserve more resources for richer functions and higher flexibilities.
Second, data center applications have strict latency and throughput requirements. Simply using large buffer to absorb bursts is lazy and ineffective. An unattended large buffer can lead to poor performance (i.e., long queuing delay, high jitter, and excessive buffer bloat). For example, a 1MB filled buffer introduces an extra 200us of queuing delay at a 40Gbps port, while the RTT of a light-loaded DCN is less than 250us (Alizadeh et al., 2010). In an extreme case, the 20MB fully shared buffer in a switch can be occupied by a single congested port and the worst-case queuing delay will be up to 4ms.
The delay-sensitive applications in data centers expect lower RTT (e.g., ¡100us) and smaller RTT jitter (e.g., ¡50us). The only hope to achieve this performance target is to keep the switch buffer occupancy near zero and stable regardless of the network condition. We believe the end-host congestion control still has the potential to help achieve this target. Tiny Buffer TCP (TBTCP) is therefore designed to meet the modern DCN’s need: lowering the chip buffer requirement while improving the flow transport performance.
2. DCTCP’s Performance Issue
A DCN spans only a few hops in a limited geographic distance, so the packet propagation and processing delay is small. The queuing delay is the main contributor to packet transport latency, and the RTT is directly proportional to the queue depth. Many Active Queue Management (AQM) algorithms have been developed to drop or mark packets based on different queue conditions (Pan et al., 2013; Nichols and Jacobson, 2012; Shan and Ren, 2017; Bai et al., 2016). Since these algorithms are designed to interact with normal TCP, their performance cannot meet the data center requirements.
For data center applications, Flow Completion Time (FCT) is the most important criterion in measuring the data center network performance (Dukkipati and McKeown, 2006)
. The relationship between FCT and queue depth is subtle. For small flows, only a few RTTs are needed to finish the packet transmission, which prevent the flow window from growing large. In this case, FCT is sensitive to the queuing delay, making it desirable to keep the queue depth shallow. If the queue is deep, the queuing delay is prolonged and the probability of buffer overflow is increased. Packet drops for small flows are especially catastrophic because the RTO can be many times longer than the normal flow FCT. For large flows, while it is important to avoid buffer overflow, it is equally important to avoid the buffer underflow in order to maintain the full throughput.
In summary, the bottleneck queue depth should stay as low as possible, but not completely empty if active flows are present. If this is achieved under any traffic condition, the FCT for both large and small flows can profit. In data centers, DCTCP presents a better FCT performance than TCP (Alizadeh et al., 2010), which means that DCTCP’s buffer control is better than TCP. However, DCTCP is still far from ideal. Modern data center applications (e.g., big data analytics) raise the performance requirement bar higher than what DCTCP can offer, which begs for further improvements. Next, we examine the underlying relationship between DCTCP’s performance and the queue depth/buffer size in switches.
DCTCP uses a single queue threshold for ECN mark. If the queue depth exceeds
packets when enqueuing a packet, the packet is marked. The ECN mark is piggybacked by a returning ACK packet from the receiver to the sender. The sender counts the ratio of marked packets in a flow and adjusts the flow’s congestion window accordingly. Specifically, the estimated ratiois updated every RTT as follows, in which is the weight parameter and is the actual counted mark ratio in the last window:
In response to an ECN mark, the sender updates its congestion window as:
Theoretical analysis shows that the maximum queue depth (as well as the buffer size) should be at least (where is the number of concurrent flows) to avoid buffer overflow. Meanwhile, should be at least (where is the bottleneck link bandwidth) to avoid buffer underflow. Although is smaller than the BDP, it is proportional to it and can be of large value.
Consider a 40Gbps link and a 250us RTT. In this case, is 178KB or equal to 118 MTU-sized packets. Even worse, the actual queue depth can be anywhere between to . In reality, can be large (e.g., 1000 current flows are not uncommon in data center workloads). Therefore, the queuing delay and jitter can be large, and the switch needs a large buffer. In this example, the filled queue will introduce more than 300us extra delay. Figure 1 illustrates the queue depth dynamics of DCTCP (as well as TBTCP for comparison).
This wide queue-oscillating range leads to poor FCT for small flows. The large buffer required is also a cost burden to switch. The buffer requirement of DCTCP is coupled with the BDP and the number of flows , which seriously limits DCTCP’s scalability. An interesting question is: can we lower the buffer size requirement so better FCT performance and lower switch cost can be realized simultaneously? We believe DCTCP’s issue is rooted in its coarse and imprecise congestion window adjustments. Next we study the optimal window adjustment strategy and show how a fine, agile, and precise window control can answer this question.
3. Fine Control the Buffer Size
3.1. Achieve Optimal BDP
According to Gail and Kleinrok’s analysis, the optimal network operating condition happens at the inflection point in Figure 2, where the in-flight data equals the Bandwidth-Delay Product (BDP) and the RTT is minimum (Cardwell et al., 2016). At this point, the bottleneck link bandwidth is fully utilized but the bottleneck queue is empty. The current TCP algorithms, including DCTCP, tend to produce a larger amount of in-flight data, which deviates from the optimal point and causes long queue in switches. While BBR (Cardwell et al., 2016) uses direct network measurements to approach the optimal point, TBTCP strives to work as close to the optimal point as possible with a simpler scheme.
In DCTCP, the congestion control algorithm makes the queue depth vary between and with the center of . In contrast, TBTCP aims to reduce the steady-state queue depth to the range of , where .
For convenience, we consider all the flows sharing the bottleneck link as one aggregated flow , which window is the sum of the flow windows. The congestion avoidance algorithm tries to dynamically reduce ’s congestion window so as to eliminate the buffer bloat and maintain 100% bottleneck bandwidth utilization. Assume the congestion window size of is and before and after a window adjustment. The following equation should hold true:
In Equation 3, and , where is the one-way propagation delay and is the bottleneck queue depth. Hence,
We can deduce that the right amount of window adjustment for in one RTT, , is:
We draw two conclusions from Equation 5. First, the smaller the queue depth is, the smaller the window adjustment needs to be made. For example, if , then , which agrees with the classical TCP’s behavior; if , as suggested in DCTCP, then . This implies that DCTCP needs a much smaller window adjustment. As shown in Equation 2, a factor of is applied for the window adjustment. However, since cannot track the queue depth well (recall that only tracks the ECN mark ratio for a single queue threshold ), the adjustment is inaccurate, leading to deep queue and high queue oscillation.
Second, if the queue depth is small enough, then . This is because when , we can get . Therefore, when the queue depth is small, we should try to reduce the window size by the exact amount that equals the queue depth in each RTT. This makes the network work closer to the optimal point and avoid buffer underflow as shown in Figure 2.
Based on these observations, we develop the first key feature of TBTCP: Queue Canceling Decrease (QCD). That is, in each RTT, we reduce the overall flow window by exactly the size of the bottleneck queue depth. Moreover, we distribute the window reductions to as many flows as possible so each affected flow only needs to reduce its window by one MSS. We defer the discussion on how QCD is realized to Section 4.
3.2. Smooth Queue Oscillation
Even though QCD can make precise window reduction, the following window recovery may be too fast, which can nullify the reduction effects and introduce drastic queue oscillations.
For example, the amplitude of DCTCP’s queue oscillation goes up to . We refer to this as the coupling issue between the buffer size and the concurrent flow count. The fundamental reason is the additive increase of congestion windows: in every RTT, a flow’s window increases one Maximum Segment Size (MSS) after the slow start phase. The additive increase works in wide area networks, but appears to be excessive for DCN due to its short RTT. For example, if RTT is 100us, increasing a flow’s cwnd by one is equivalent to adding network bandwidth consumption. For 100 concurrent flows, the instant bandwidth surge is up to 12Gbps. As a result, the bottleneck queue becomes volatile and the flow performance suffers.
To avoid the bandwidth surge, we would like to reduce the window recovery speed. Specifically, we take RTTs to increase a flow’s cwnd by one MSS. This is equivalent to reducing the amount of window size increase in each RTT to . Hence we develop the second key feature of TBTCP: Reduced Additive Increase (RAI). That is, in each RTT, we virtually increase a flow’s cwnd by (In real implementation, the window increase by 1 in RTTs). Note that the seemingly “slow” increase is offset by the fact that the reduction is small in the first place.
We have the following theorem:
Theorem 3.1 ().
Due to QCD and RAI, the bottleneck queue depth will be stable at in one RTT and the overall cwnd approximates to the bottleneck BDP.
Assume that at time the queue depth is and the total window size of the flows is . At time RTT, is reduced by due to QCD; meanwhile, is increased by due to RAI. Hence, the new overall window size . Since is increased by , the new queue depth will become .
Following the same deduction process, at time RTT, (as well as ) is reduced by due to QCD and increased by due to RAI, so (as well as ) is no longer changed (although the distribution of among the flows is changing).
That is, the queue depth is stable at and the overall cwnd size is stable at after the first RTT, if the flow number remains unchanged.
Based on the analysis in Section 3.1, BDP due to the small . ∎
Since the queue depth can grow to at most , TBTCP’s queue oscillation amplitude is more than times smaller than DCTCP’s.
4. TBTCP Algorithm
Like DCTCP, the TBTCP algorithm contains three components which work in switch, receiver, and sender, respectively.
4.1. RED-based Marking on Switches
Random Early Detection (RED) is a commodity active queue management scheme in modern switches. It monitors the instantaneous queue size and marks packets with the CE codepoint based on statistical probabilities. Specifically, it sets two queue depth thresholds, t_min and t_max. The packets are randomly marked and the marking probability linearly increases from 0 to P_max as the queue depth grows from t_min to t_max.
TBTCP takes advantage of the RED scheme. Recall that QCD requires the overall flow window to be reduced by the size of the bottleneck queue depth in each RTT. To realize QCD, we want to mark exactly packets when the queue depth is .
From Equation 5, we can derive that is the ideal packet marking probability to ensure the proper window adjustment. That is:
This is because in one RTT, packets are sent through the bottleneck buffer. With the marking probability , exact packets will be marked. This simplified analysis assumes that in one RTT the queue depth remains unchanged and the synchronized window adjustments happen at each RTT boundary. In reality, the queue depth constantly changes. However, the number of marked packets in an RTT is bounded by the largest queue depth during this RTT.
The ideal marking probability graph appears to be a curve, but RED can only provide a polyline graph. We need to fit the two graphs so that RED can approximate the ideal marking probability. As shown in Figure 3, we can set and , so the actual marking probability . By choosing the proper t_max, we can get a good approximation of the marking probability.
4.2. ECN Echo from Receivers
Once a marked packet is received at a receiver, the receiver needs to echo the information back to the sender by setting the ECN-Echo flag in an ACK packet. In case the delayed-ACK feature is preferred, we modify the delayed-ACK behavior to ensure the timely feedback. Whenever a marked packet is received, an ACK with the ECN-Echo flag set for the accumulated packets so far is immediately sent; otherwise, a normal delayed ACK is sent for every a few fixed number of packets.
4.3. Congestion Control at Senders
Senders host TBTCP’s core congestion avoidance mechanisms, QCD and RAI. Other features, such as slow start and retransmission on packet loss, remain the same as DCTCP.
Queue Canceling Decrease: The RED-based ECN guarantees that in each RTT, exactly packets are marked when the bottleneck queue depth is , so every time a flow sender receives an ECN, it can simply reduce its cwnd by one MSS.
The total window reduction is amortized to up to randomly selected flows. The selected flows are referred as victim flows. Clearly, since large flows send more packets in an RTT, they have a higher probability to become victim than small ones. This is in line with the idea that small flows should be less disturbed.
Reduced Additive Increase: After the slow start stage (i.e., when the flow receives its first ECN), a flow’s cwnd increases every RTT. This means that after every RTTs, the flow can actually increase its cwnd by one MSS. For example, if is set to 0.1, then each flow’s cwnd will be increased by one for every 10 RTTs. By doing so, the bottleneck queue depth will become stable at .
4.4. Analysis on Steady State Window
To analyze the steady state behavior of TBTCP, we assume that long-lived and synchronized flows, with the same RTT, share a bottleneck link of capacity . In the steady state, the overall window size and each flow’s window size is:
In each RTT, a flow’s window is increased by due to RAI. Now we consider QCD’s effect.
Since each packet is marked with a probability of , if it starts with window size , the expected new window size . That is, each sending packet will reduce the flow’s window size by . During one RTT, the flow sends packets. Therefore, due to QCD, the flow’s window is reduced by .
To maintain the steady state, we need to make the window increase and decrease equal in each RTT. That is, .
This exactly reproduces the key observation of the TBTCP algorithm: the steady state bottleneck queue depth is stable at . It suggests that, (1) when is fixed, the queue depth becomes shallower as the number of flows is reduced; and (2) when the number of flows is fixed, a smaller leads to a shallower queue.
5. Algorithm Evaluation
To validate TBTCP, we begin by simulating the algorithm using a simulator. The network topology is a simple dumbbell in which flows share a single bottleneck link with the bandwidth of . We monitor the bottleneck queue depth . All the TCP flows last 10 to 30 seconds. We enable the switch’s RED-based ECN feature and use Equation 6 to calculate the ECN marking probability.
5.1. Queue Depth and Buffer Occupancy
TBTCP Queue Depth on : First, we examine the parameter ’s effect on TBTCP’s performance. In this test, 100 flows share a 40Gbps link and RTT is set to 160us. Figure 4 shows that the stable queue depth is close to with small swing. Even when is as small as 0.1, queue underflow does not occur, due to small queue oscillation. In the following simulations, we set to 0.1 to maximize the buffer reduction potential.
In these simulations, the average queue depth is around 5 for 50 flows and 10 for 100 flows. Besides, the queue oscillation range is small — about two times of the average queue depth. The queue underflow is rarely observed, which means the bottleneck bandwidth is fully utilized.
RAI and Buffer Size: To demonstrate that RAI can reduce the buffer size requirement even if the number of flows is large, we implement a DCTCP variant for which the window recovery algorithm is replaced by RAI (a.k.a. DCTCP with RAI). The queue depth distribution, for different algorithms (i.e., the original DCTCP, DCTCP with RAI, and TBTCP) and under different configurations, are shown in Figure 6.
The simulations conform to DCTCP’s analysis on buffer size requirement (i.e., at least in order to avoid packet drop). Furthermore, the queue depth of DCTCP oscillates between 0 and
with nearly uniform distribution, which implies a volatile queuing delay as well as RTT. On the other hand, when DCTCP is implemented with RAI, the buffer size requirement is no longer related tobut just slightly higher than . This confirms that RAI helps decouple the buffer size requirements from the number of concurrent flows, which is critical to implement a small-buffer switch.
Regardless of various configurations, TBTCP constantly requires a small buffer with about 20 packets. The small buffer implies small jitter and latency for flow transport. We can also see that when the other conditions are constant, the different bottleneck bandwidth and the flow RTT have little effect on the queue depth for all three algorithms.
RAI effectively narrows the queue oscillating range and reduces the jitter of queuing delay. Since the minimum queue depth tends to be greater than 0, RAI also helps lower the value of for DCTCP without leading to buffer underflow and throughput loss. We show the results of another simulation in Figure 7. In this case, we assume that 100 flows share a 40Gbps link and the RTT is 160us. The theoretical value for DCTCP is 76 packets, whereas the simulation shows that must be increased to 90 to avoid buffer underflow. In contrast, if RAI is enabled, can be reduced to 38 packets without losing any link throughput.
5.2. Fairness Between TBTCP Flows
To test TBTCP’s fairness, we start a new flow every 10 seconds with each flow lasting 40 seconds. Four flows are started in total. We repeat the test on a 1Gbps bottleneck link and a 10Gbps bottleneck link. The CWND of each flow as a function of time are shown in Figure 8.
In both cases, the bottleneck link reaches 100% throughput. The CWND curve shows that each flow gains its fair share of bandwidth. When there is more than one concurrent flow, the oscillation range of the CWND value is larger than the case of a single flow. However, the range becomes insensitive to the number of flows past one. This is due to the complex interaction between flows: more flows increase the queue oscillation, so the ECN marking probability is higher and in turn the CWND adjustments are more frequent.
5.3. Convergence Speed
Since RAI recovers the CWND at a slower pace than the normal TCP and DCTCP, the flow convergence time for gaining fair share of bandwidth may seem concerning. Figure 9 shows the simulation results for TBTCP, DCTCP, and DCTCP with RAI enabled. Each case has two flows on a 10Gbps with 80us RTT: one flow starts at 0 second and stops after 20 seconds, and the other flow starts at the 10th second and stops after 20 seconds. We examine the CWND transition of both flows at the 10th second and the 20th second. TBTCP takes 60ms to converge at both check points; DCTCP takes 40ms to converge at the first check point and 10ms at the second; DCTCP with RAI takes 250ms to converge at the first check point and 120ms at the second.
DCTCP with RAI is the worst in terms of convergence time, but TBTCP is just slightly worse than DCTCP. When more concurrent flows are involved, the convergence time difference will be further reduced for the following reasons: First, the slow-start phase of a flow still uses the normal additive increase mechanism — RAI only taking effect at the congestion-avoidance phase — so a new flow can quickly converge to its fair share of bandwidth at the same speed that some existing flows give up their bandwidth share. In TBTCP, flows reduce speed based on QCD. The aggregated reduction is comparable to the window reduction in DCTCP. Second, although RAI slows down the window recovery, QCD amortizes the window reduction over multiple flows, with each flow only reducing its CWND by one each time during congestion control. The slow recovery is accompanied by a moderate reduction.
Convergence Speed on Different : To examine how can affect the convergence speed, we repeat the above experiment for TBTCP with different values. As shown in Figure 10, a larger value can reduce the bandwidth convergence time. Clearly, we can use a larger value as a trade-off to accelerate the convergence speed at the cost of a larger buffer size.
We also note that when is equal to or greater than 0.5, TBTCP loses its micro adjustment efficacy. Experiments show that new flow can never gain its fair share of bandwidth. This is because a large value leads to a large stable queue depth and marking probability, which in turn restrains the new flow’s window increase potential.
5.4. RTT Fairness
To test the flow behavior of TBTCP in contrast to DCTCP when experiencing different RTTs, we assume four long-lived flows from different sources compete for a 40Gbps bottleneck link. Each flow’s maximum bandwidth is 20Gbps. The RTT of these flows is configured to be 20us, 60us, 100us, and 140us, respectively (assuming the bottleneck queue is empty). For DCTCP, we set to 60KB based on the theoretical formula , in which RTT is taken the average value of 80us. For TBTCP, the ECN marking probability is , in which BDP is calculated with the RTT setting to the minimum, average, and maximum values, respectively (i.e., 50us, 80us, and 110us). The experiments last 10 seconds each.
As shown in Figure 11(a), the RTT fairness of DCTCP in this case is poor. The Jain’s fairness index is only 0.51. The two flows with smaller RTT seize almost all the bandwidth, and the other two flows with longer RTT are permanently starved. Meanwhile, the bottleneck queue depth also fluctuates violently. When we use smaller or larger values, the fairness starts getting better, at the cost of bandwidth loss or memory consumption.
On the other hand, TBTCP presents a much better RTT fairness. The RTT value used to calculate the ECN marking probability can affect the RTT fairness. In the above three configurations, their Jain’s fairness index are 0.98, 0.87, and 0.85, respectively. It appears that when the smallest RTT value is used for , the fairness is the best and the average bottleneck queue depth is the shallowest. However, in this case, we can occasionally observe the queue underflow hurt the throughput. As a trade-off, we consider the average RTT a good balance between fairness and queuing performance.
The simulation results also show a significant mismatch of the total window size between DCTCP and TBTCP. This is due to the buffer bloat introduced by in DCTCP (recall that ). However, DCTCP’s large window does not translate into higher throughput. TBTCP achieves the full throughput with the near optimal total window size and the minimum latency.
Note that the RTT imbalance in DCN is a real issue, given the queuing delay contribution has been significantly reduced. Assume the processing and transport delay of one switch is 15us. In a 3-tier DCN, the topology-induced RTT can range from the minimum value of 30us to the maximum value of 150us. TBTCP is preferred to DCTCP for a better balanced performance in such networks.
Flows may experience different RTT due to the path length variance. TCP flow’s throughput is observed to be inversely proportional to RTT because flows with smaller RTT increase their window size faster. DCTCP also exhibits a negative bias against flows with longer RTT(Alizadeh et al., 2011). This is a serious concern for DCN where the longest RTT can be multiple times of the shortest RTT.
However, the RTT fairness is not a big issue for TBTCP. It has been shown that RED can help improve RTT fairness (Altman et al., 2000). On the other hand, due to RAI, the flows with smaller RTT increase their window sizes just slightly faster than the other flows. They will also receive ECNs faster so their window size reduction is faster. The net effect is that the RTT unfairness is well controlled.
Small Flow Fairness. Since TBTCP’s window recovery speed is slow, the small flows, if their windows are reduced, would suffer more than large flows. This is mitigated by several factors: 1) Small flows have smaller probability to be chosen for window reduction; 2) RAI only takes effect after the slow start stage; 3) QCD only takes effect when the flow’s window is greater than 2; and 4) the minimized RTT in TBTCP can compensate the window size loss (i.e., halving the RTT is equivalent to doubling the flow throughput).
Summing up all these factors, TBTCP shows a surprisingly better small-flow FCT performance than DCTCP.
Mixed Flow Fairness. It should be obvious that if TBTCP flows coexist with other type of flows such as TCP and DCTCP, the performance of TBTCP flows can be negatively affected. This means TBTCP can only be exclusively deployed as the only congestion control algorithm in a data center.
Incast. Incast (Chen et al., 2009) is not much of a concern for TBTCP. Incast happens when a large number small flows compete for the same queue. However, the incast bursts build up in a few RTTs during which TBTCP can generate enough ECNs to curb enough number of flows in time.
TBTCP requires a few simple modifications to DCTCP’s congestion control algorithm. It also takes advantages of the RED-based ECN feature, which is generally available in commodity switches.
6.1. Host TCP Stack Modification
Implementing QCD and RAI in existing TCP protocol code in host OS is straightforward. The modification is made based on the DCTCP implementation in linux-source-4.4.0. Only 10 lines of code were added or modified for the sender, and 30 for the receiver. The modifications are distributed in four functions: tcp_cong_avoid_ai(), dctcp_ssthresh(), tcp_enter_recovery(), and dctcp_cwnd_event().
6.2. Switch Configuration
Shaping the ECN Probability Curve. Figure 3 shows the ideal ECN marking probability curve for QCD. However, in real implementation, we need to consider two issues.
First, many factors may increase TCP burstiness (Alizadeh et al., 2010). For example, the NIC’s Large Send Offload (LSO) function would send a big chunk of data with multiple MTU-sized packets in a burst. Since such bursts are transient and do not reflect an overall sending window boost, it is best to avoid triggering QCD. The measure we take is similar to that in DCTCP. We shift the ideal curve to the right and make the curve start from a threshold . In our current implementation, we set to 138KB. This is because our NIC enables a 128KB segment size, which implies a burst size of 86 packets. Likewise, the DCTCP’s should be adjusted to be at least 120 packets or 180KB to avoid triggering excessive ECNs. When the queue depth is below , no packet is ECN marked. The new ECN marking probability derived from Equation 6 is therefore:
Second, as queue depth increases, the ECN marking probability will eventually approach 1. When the probability is high enough, there is a possibility that the sending window is overly reduced to cause buffer underflow. To counter this, the implementation of TCP and DCTCP allows the flow’s sending window to be reduced by a maximum of 50% in the case of packet drop or ECN. Similarly, our implementation limits the ECN marking probability to a maximum of 0.5. This means that no more than 50% of packets can be marked in a RTT, which is equivalent to reducing each flow’s cwnd by a maximum of 50% in a RTT when the network is heavily congested. Since is a monotone increasing function, when , the ECN marking probability is saturated at .
Hence, the corrected ECN marking probability is:
Fitting the Curve to Switch ECN Configuration. For convenience and simplicity, the RED function in commercial switch chips is actually implemented as a step function that approximates the theoretical linear function (Broadcom, 2012). Specifically, the range between t_min and t_max is divided into eight equal-sized intervals. Hence, the step size and the -th step interval . At interval , the step value . The actual RED function is:
In the above equation, is the indicator function of interval . Our goal is to find the most fitting for and set the parameters t_min, t_max, and P_max accordingly. For simplicity, we set . We also know that . Once the queue depth reaches t_max, we follow the RED rule to mark all packets in order to avoid buffer overflow.
With the above considerations, the fitting problem can be translated into an optimization one:
We depict and in Figure 12. The optimization goal is to find the optimal value of P_max that can minimize the square error between the two curves.
Numerical analysis shows that when t_min, t_max, and BDP are set to 138KB, 550KB and 180KB, respectively, the proper P_max is 0.7 and the resulting square error is 6.66.
Scaling the Probability Curve. The above fitting result is not satisfactory. If we use this configuration directly, the results will significantly diverge from the ideal. However, if we scale the curve of by a factor of (i.e., the probability at each point of becomes ), the curve appears more flat and linear. Thus, we can find the best P_max for to fit with a reduced fitting error.
Table 1 shows the optimal fitting results for different values of , with the other parameters being the same as above.
The new curve reduces the ECN marking probability by a factor of , which is equivalent to reducing the number of ECN-marked packets by a factor of . To compensate for this, a victim flow needs to reduce its congestion window by instead of whenever an ECN is received. Similar to DCTCP, TBTCP does not allow the window to be reduced by more than 50% for each ECN. So, the actual QCD window adjustment for a flow is: .
Although a larger can help reduce the fitting error, the larger window size reduction causes greater performance impact on victim flows. It is necessary to evaluate the different configurations of and come up with the best tradeoff.
7. Test Results
To test the performance of the TBTCP implementation, we set up a physical network which includes a few Ethernet switches and servers. Each switch has 24x 10GE ports and a Broadcom Trident switch chip. The chip has 9MB shared buffer in total and each port can use up to 1MB. Each ESX server has two CPUs with each CPU containing six 2.4GHz cores. Each server is equipped with 8x 10GE NIC, so we can start up to eight VMs each with a dedicated NIC. The VM guest OS is Ubuntu 16.04. We use iPerf (Jon Dugan et. al., 2017) for performance measurement. In each guest, we run either an iPerf client or an iPerf server. Each iPerf client can initiate multiple TCP flows towards an iPerf server.
The test bed configuration is shown in Figure 13. Multiple clients send multiple TCP flows to a single server. The bottleneck link bandwidth is 10Gbps and the measured end-to-end RTT is 150us on average, so the BDP is about 180KB and the theoretical value for DCTCP is about 26KB.
In the tests, the throughput is the sum of the measured throughputs for all the flows at the iPerf server. Due to the link layer overhead, the theoretical maximum throughput can only reach 95% of the bottleneck link bandwidth. Meanwhile, the iPerf software overhead also causes some throughput loss, especially when the number of flows is large. Since the real-time buffer occupancy in the switches is invisible, the amount of throughput loss caused by buffer underflow cannot be determined. In the following tests, we use DCTCP as the benchmark and compare it with TBTCP.
Choose the Curve Scaling Factor: The fitting error is inversely related with the scaling factor . We test the achievable throughput when 100 flows compete for the bottleneck link with different scaling factors and find that the throughput is 7.7Gbps when . A larger leads to higher throughput. For example, when , the throughput becomes 8.8Gbps. However, a larger also increases the QCD granularity, which violates our fine adjustment principle. Indeed, as shown in Figure 14, although the maximum queue depth does not increase by much when becomes larger, the queue depth oscillation becomes more intense. As a tradeoff, we choose to use and .
Bandwidth Utilization and Queue Depth: Figure 15 shows that, due to buffer underflow, the theoretical value for DCTCP cannot guarantee full bottleneck bandwidth utilization when the number of flows is large. Increasing helps improve the bandwidth utilization, but when the flow number is large, the improvement diminishes. This is because a larger consumes more buffer and will eventually cause buffer overflow (see Figure 16(b)).
When the number of flows is large, TBTCP can improve the bottleneck link throughput by up to 15% compared with DCTCP. At same time, TBTCP reduces the maximum queue depth by up to 80%, as shown in Figure 16.
In the following tests, we set 275KB for DCTCP as a tradeoff of throughput and queue depth.
Flow Completion Time (FCT): To emulate the real network traffic, we configure 80% of the current flows to be shorter than 100KB and the remaining 20% to be longer than 100KB. The flow size ranges from 4KB to 5MB. We run the test until we get 50 FCT values for each flow length. The average FCT values for TBTCP and DCTCP are shown in Figure 17(a).
The test shows that, compared with DCTCP, TBTCP improves FCT for all flow sizes. TBTCP reduces the average FCT of short flows (512KB) by about 200us (which accounts for 7% to 37% improvements) and the average FCT of long flows (1MB) by 2-10ms (which account for 29% to 39% improvements). The significant FCT improvement is attributed to TBTCP’s shallow and stable queue, as shown in Figure 17(b). This result also confirms that the slower convergence time of TBTCP does not hurt the FCT performance.
Fairness and Convergence Time: We test bandwidth fairness and convergence time for both TBTCP and DCTCP. We start a new flow every 20 seconds until there are four flows. We then stop a flow every 20 seconds until only one flow is left. The test results are shown in Figure 18.
While both algorithms allow flows to share bandwidth fairly, TBTCP’s flow throughput fluctuates less than DCTCP’s. We compare the bandwidth convergence time at each transition point (i.e., when a new flow starts or an old flow stops) and find that TBTCP and DCTCP are comparable when there is more than one flow. The two exceptions are: when the second flow starts, it takes TBTCP 0.3 seconds to converge to its share of bandwidth while it takes DCTCP 0.2 seconds; when the second-to-last flow stops, the last flow takes TBTCP 1.1 seconds to converge to full bandwidth while it takes DCTCP 0.4 seconds.
Although DCTCP converges faster in general, the convergence time difference becomes negligible as the number of concurrent flows increases.
8. Summary and Discussion
TBTCP is a pure host TCP-based congestion control algorithm on a par with DCTCP. We did not compare TBTCP with other more recent data center congestion control algorithms because all those algorithms need network device modifications other than simple host OS updates. The design goal of TBTCP is to alleviate the buffer sizing challenge of high-density commodity switches while achieving better performance than DCTCP.
A large buffer leads to poor FCT and strains the hardware; an unattended small buffer suffers both underflow and overflow. Therefore, the ideal strategy is to prevent congestion in the first place with fine control over a small buffer. QCD duly eliminates the residual packets in buffer and RAI gently explores any available bandwidth. Jointly, QCD and RAI help TBTCP work close to the optimal point.The TBTCP’s bottleneck queue depth is not coupled with BDP but merely a small ratio of the flow count, which enables a big buffer size reduction.
We retrospect the rationality of TBTCP as follows: First, theoretical analysis tells us exactly how much we should adjust the window size to fit the network condition. The RED-based ECN is proactive through “early” notifications, which prevents the buildup of queues in advance. Meanwhile, the “random” notifications disperse the window adjustments to multiple flows. So, 1) the large flows are likely to reduce their windows more; 2) each notification only reduces the window of an affected flow by one; and 3) the aggregated adjustments exactly cancel the residual packets in buffer.
Second, the conventional additive window increase is an overreaction in DCN where RTT is small. It causes bandwidth overshoot and queue buildup, which in turn leads to more congestion notifications. The violent queue oscillation demands larger buffer and fails to maintain a predictable performance. RAI’s slower window recovery largely avoids this problem and keeps the queue depth stable. The smaller recovery step is compensated by the facts that the window reduction size is smaller and large flows are more likely to receive a window reduction.
Third, in DCN, FCT is the most important performance indicator for applications. As the link bandwidth is maximized, RTT becomes the most influential factor affecting FCT. Due to the small hop count in DCN, RTT is mainly determined by the queuing delay. TBTCP’s small buffer design therefore guarantees better FCT.
Fourth, The hierarchical topology of DCN makes a flow’s RTT value drop into a few distinct bins. For example, in a 3-tier fat tree, flows may have different hop count of 1, 3, or 5. Conceivably, one flow’s RTT can be multiple times of another’s. The RED approach taken by TBTCP properly addresses the RTT fairness issue by introducing early and biased punishment to flows with smaller RTT when they try to grab more bandwidth.
At last, QCD and RAI are both simple operations which demand less host computing power than DCTCP and other sophisticated congestion control algorithms. The required OS network protocol stack modification is minuscule.
Nevertheless, TBTCP does not gain these benefits without any cost. We acknowledge that TBTCP retains some issues of DCTCP. For example, when TBTCP flows coexist with flows running other congestion control algorithms, the conservative use of buffer may negatively affect TBTCP flows. However, TBTCP, just like DCTCP, is intended to be deployed in a confined network environments with a single administration entity. It would be not difficult to deploy TBTCP exclusively throughout a data center, as the success of DCTCP has already proved the feasibility.
We also admit that TBTCP’s flow bandwidth convergence speed is inferior to DCTCP’s. However, the convergence time is an indirect performance indicator. TBTCP outperforms DCTCP is almost every other direct performance indicator, showing that the advantages of TBTCP dominate.
Hardware Consideration: While TBTCP works on commodity hardware, we suggest two possible implementation improvements. First, if we can configure the probability of each step in the chip’s RED function, we can fit the ideal marking probability curve without resorting to the curve scaling approach. This way, we can stick to the original QCD design: a flow only reduces its window by one for each ECN. Since the chip has already supported the step function, we speculate the capability is ready and the vendor just needs to open the API to switch users. We will confirm this with the chip vendor and suggest this enhancement if the chip is incapable of such configuration.
Second, as in the case of DCTCP, we find the burstiness introduced by the LSO feature on NIC to be harmful to TBTCP’s performance. A better burst control on NIC with packet pacing will be very helpful. This can be done through a customized NIC with extra packet processing capability. The smart NIC with an FPGA can be a perfect platform (Greenberg, 2015).
Our future work will pursue these directions to see how close the real implementation can approach the ideal performance of the TBTCP algorithm.
9. Related Work
Many new DCN architectures and algorithms have been published in recent years. Unlike TBTCP, some schemes are network-based, which require new switch hardware to cope with congestion and make better use of network resource (e.g., NDP (Handley et al., 2017) assumes a specialized switch for packet trimming and queuing). These schemes often make little or no assumption on end-host’s transport protocols. For example, CONGA (Alizadeh et al., 2014) actively measures the congestion level of all paths for each flow and dispatches flowlets to less congested paths for load balancing; LetFlow (Vanini et al., 2017) shows that random flowlet-based load balancing performs surprisingly well in reality; RackCC (Zhuo et al., 2016) conducts rack-level congestion control directly on switches. A well-behaved TCP, such as TBTCP, will work with these algorithms to provide even better performance.
Data center operators have a strong incentive to keep complexity at edge and make use of commodity switches as the network fabric (Casado et al., 2012), because low cost and instant deployability are essential for data center services. The most straightforward approach is to deploy a better TCP congestion control algorithm. Unfortunately, TCP variants such as New Reno (Henderson et al., 2012) and CUBIC (Rhee and Xu, 2008) cannot meet the stringent performance requirements in DCN, partially due to their design focus on the Wide Area Network (WAN) environments. DCTCP (Alizadeh et al., 2010) uses ECN rather than packet drop as a congestion signal, and applies gentler window adjustment. The congestion sensing and reaction is much faster, and catastrophic packet drops are mostly avoided. As a result, the FCT performance is improved. Further optimizations (e.g., DPP (Cisco, 2015) and PIAS (Bai et al., 2015)) distinguish long and short flows by giving short flows higher forwarding priority at switches so as to imporve flow FCT performance. Similarly, these optimizations can be applied on TBTCP to boost its performance.
Instead of adjusting the sending window based on congestion signals, some end-host congestion control algorithms determine the flow sending rates based on the RTT measurements. For example, BBR (Cardwell et al., 2016) uses RTT measurements to estimate the network BDP and pace packets at the highest possible sending rate. TIMELY (Mittal et al., 2015) applies a similar idea but bypasses OS-kernel and offloads the RTT measurement and traffic pacing to NIC. QUIC (Jim Roskind, 2013) implements new flow control algorithms in UDP to surpass TCP’s performance for particular applications such as HTTP. Other new transport algorithms have been proposed to replace TCP completely (e.g, PCC (Dong et al., 2015), pHost (Gao et al., 2015), ExpressPass (Cho et al., 2016), and HOMA (Montazeri et al., 2018)).
More flow related information is useful for better transport performance. In order to gain better performance than TCP, some algorithms rely on application layer information, such as flow deadline or priority, for flow scheduling (e.g., DTCP (Vamanan et al., 2012), D (Wilson et al., 2011), PDQ (Hong et al., 2012), and pFabric (Alizadeh et al., 2013)). In addition to the availability assumption of the application layer information, some of these algorithms also need switch arbitration (e.g., D (Wilson et al., 2011) and PDQ (Hong et al., 2012)) for flow rate allocation.
Due to the dominant scale of TCP deployment and applications relying on TCP, it may be challenging to deploy such algorithms in reality. However, there is a noticeable exception. For High Performance Computing (HPC), Artificial Intelligence (AI), and storage disaggregation applications in data centers, Remote Direct Memory Access (RDMA), due to its low latency and low CPU overhead, is gaining its momentum by providing the full stack networking solution(Zhu et al., 2015). For a low system cost, it is desirable to avoid proprietary hardware and apply commodity Ethernet technologies to the DCN fabric. RDMA over Converged Ethernet (RoCE) has been an active research area. Similar to DCTCP, DCQCN uses ECN as signal for flow rate control (Zhu et al., 2015). Since RDMA requires a lossless network fabric, the link based flow control, PFC (IEEE, 2011), is used to prevent packet drop due to buffer overflow in networks. The derived issues such as deadlock and head-of-line blocking are addressed in (Guo et al., 2016; Hu et al., 2016). TBTCP can be used as an orthogonal optimization to mitigate these issues. PFC is less likely to be triggered when using TBTCP, since TBTCP applies more graceful window adjustments and maintains low and stable buffer occupancy.
Some other algorithms combine in-network scheduling/load balancing and end-host rate adjustment together to achieve better performance. PASE (Munir et al., 2014) proposes a distributed framework while Fastpass (Perry et al., 2014) takes a centralized approach. In either case, TBTCP can be used for end-host rate control, which is complementary to in-network optimizations.
In spite of the prosperity of DCN research, TCP remains the most popular data center transport protocol and it still has room to improve. Small and stable queue occupancy in switches is essential for flow performance and switch efficiency. The core design principle of TBTCP is to enable finer control of TCP congestion window, which helps avoid drastic queue oscillation under dynamic traffic conditions. TBTCP is instantly deployable. It outperforms DCTCP in jitter, latency, throughput, FCT, and RTT fairness. Moreover, TBTCP’s small buffer requirement allows data center switches to continue scaling for higher throughput, port density, and flexibility. We believe TBTCP is a capable contender of DCTCP in data centers.
In our future work, in addition to exploring the hardware improvements, we will study TBTCP’s applicability in wide area networks, where router buffer is also facing the scalability issue (Vishwanath et al., 2009). We will also try to apply the underlying QCD and RAI mechanisms in the RDMA protocol stack and explore the possibility to improve the performance of RDMA-based DCNs.
- CONGA: Distributed Congestion-Aware Load Balancing for Datacenters. In ACM SIGCOMM, Cited by: §1, §9.
- Data Center TCP (DCTCP). In ACM SIGCOMM, Cited by: §1, §1, §1, §2, §6.2, §9.
- Analysis of dctcp: stability, convergence, and fairness. In ACM SIGMETRICS, Cited by: §5.5.
- pFabric: Minimal Near-Optimal Datacenter Transport. In ACM SIGCOMM, Cited by: §1, §9.
- Fairness analysis of TCP/IP. In 39th IEEE Conference on Decision and Control, Cited by: §5.5.
- Information-Agnostic Flow Scheduling for Commodity Data Centers. In USENIX NSDI, Cited by: §1, §9.
- Enabling ECN in Multi-Service Multi-Queue Data Centers. In USENIX NSDI, Cited by: §2.
- BCM56840 Programmer’s Register Reference Guide. Cited by: §6.2.
- BBR: Congestion-Based Congestion Control. acmqueue 14. Cited by: §1, §3.1, §9.
- Fabric: A Retrospective on Evolving SDN. In ACM SIGCOMM HotSDN Workshop, Cited by: §9.
- Understanding tcp incast throughput collapse in datacenter networks. In 1st ACM Workshop on Research on Enterprise Networking (WREN), Cited by: §5.5.
- ExpressPass: End-to-End Credit-based Congestion Control for Datacenters. CoRR abs/1610.04688. Cited by: §9.
- Dynamic Packet Prioritization, Nexus 9000 Series Switches. Cited by: §9.
- Virtualized Congestion Control. In ACM SIGCOMM, Cited by: §1.
- PCC: Re-architecting Congestion Control for Consistent High Performance. In USENIX NSDI, Cited by: §9.
- Why Flow-Completion Time is the Right Metric for Congestion Control. ACM SIGCOMM CCR 36 (1). Cited by: §2.
- pHost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric. In ACM CoNEXT, Cited by: §1, §9.
- SDN for the Cloud. In ACM SIGCOMM, Cited by: §8.
- RDMA over Commodity Ethernet at Scale. In ACM SIGCOMM, Cited by: §9.
- Re-architecting datacenter networks and stacks for low latency and high performance. In ACM SIGCOMM, Cited by: §1, §9.
- AC/DC TCP: Virtual Congestion Control Enforcement for Datacenter Networks. In ACM SIGCOMM, Cited by: §1.
- IETF RFC 6582: The NewReno Modification to TCP’s Fast Recovery Algorithm. Cited by: §9.
- Finishing Flows Quickly with Preemptive Scheduling. In ACM SIGCOMM, Cited by: §9.
- Deadlocks in Datacenter Networks: Why Do They Form, and How to Avoid Them. In ACM HotNets-XV, Cited by: §9.
- 802.1Qbb: Priority-based Flow Control. Cited by: §9.
- QUIC: Multiplexed Stream Transport over UDP. Cited by: §9.
- iPerf: The ultimate speed test tool for TCP, UDP and SCTP. External Links: Cited by: §7.
- Accurate latency-based congestion feedback for datacenters. In USENIX Annual Technical Conference (USENIX ATC 15), Cited by: §1.
- TIMELY: RTT-based Congestion Control for the Datacenter. In ACM SIGCOMM, Cited by: §1, §9.
- Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities. In review, Cited by: §9.
- Friends, not Foes – Synthesizing Existing Transport Strategies for Data Center Networks. In ACM SIGCOMM, Cited by: §9.
- Controlling Queue Delay. acmqueue 10. Cited by: §2.
- PIE: A Lightweight Control Scheme to Address the Bufferbloat Problem. In IEEE HPSR, Cited by: §2.
- Forwarding Metamorphosis: Fast Programmable Match-action Processing in Hardware for SDN. In ACM SIGCOMM, Cited by: §1.
- Fastpass: A Centralized “Zero-Queue” Datacenter Network. In ACM SIGCOMM, Cited by: §1, §9.
- CUBIC: A New TCP-Friendly High-Speed TCP Variant. ACM SIGOPS Operating System Review 42 (5). Cited by: §9.
- Improving ECN Marking Scheme with Micro-burst Traffic in DCN. In IEEE INFOCOM, Cited by: §2.
- Deadline-Aware Datacenter TCP (D2TCP). In ACM SIGCOMM, Cited by: §9.
- Let It Flow: Resilient Asymmetric Load Balancing with Flowlet Switching. In 14th USENIX NSDI, Cited by: §9.
- Perspectives on router buffer sizing: recent results and open problems. SIGCOMM Comput. Commun. Rev. 39 (2). Cited by: §10.
- Better Never than Late: Meeting Deadlines in Datacenter Networks. In ACM SIGCOMM, Cited by: §1, §9.
- Congestion Control for Large-Scale RDMA Deployments. In ACM SIGCOMM, Cited by: §1, §1, §9.
- RackCC: Rack-level Congestion Control. In ACM HotNets-XV, Cited by: §9.