To deal with the growing application demands of ultra-low latency (Kalia et al., 2014; Dragojević et al., 2014; Gu et al., 2017; Gao et al., 2016), high throughput (Kalia et al., 2014; Dragojević et al., 2014; Mitchell et al., 2013), and high bandwidth (Huang et al., 2012; Li et al., 2014; Abadi et al., 2016; Yu et al., 2014; Nelson et al., 2015), modern datacenters are aggressively deploying RDMA (Zhu et al., 2015; Guo et al., 2016; Mittal et al., 2015). The intuition is simple: RDMA can provide low latency, high throughput (measured in messages/second), and high bandwidth (measured in bytes/second) with low CPU overhead. Indeed, RDMA-based applications experience orders-of-magnitude improvements in latency (s) and message throughput (10s of millions operations/second) (Kalia et al., 2014; Dragojević et al., 2014). Similarly, bandwidth-sensitive applications have been scaled to many users without CPU becoming the bottleneck (Huang et al., 2012; Zhu et al., 2015; Guo et al., 2016).
Unfortunately, modern RDMA usages are often limited to optimizing individual applications with careful tuning of RDMA verbs and transport types – each combination with its own advantages and drawbacks (Kalia et al., 2016a, 2014, b; Mitchell et al., 2013; Dragojević et al., 2014). However, even in a private datacenter, it is reasonable to assume that diverse RDMA-enabled applications will coexist (Zhu et al., 2015; Guo et al., 2016). In this paper, we answer the question: What happens when multiple RDMA-enabled applications coexist?
To this end, we performed a series of experiments using two state-of-the-art RDMA-based systems, FaSST (Kalia et al., 2016b) and eRPC (Kalia et al., 2019), and three different commercial RDMA implementations: InfiniBand, RoCEv2, and iWARP (§2). From our measurements, we conclude that RDMA’s low latency, high throughput, and high bandwidth are not guaranteed when multiple applications compete. In fact, the throughput of FaSST and eRPC drops by 74% and 93%, respectively, and eRPC’s median (99th percentile) latency increases by () when competing with an RDMA-based storage application. Those highly optimized systems have their Achilles’ heel that only in fully isolated environments does the performance stay very good – which they rarely are in practice (Zhu et al., 2015; Guo et al., 2016).
Our flow-level analyses further justify our conclusion. The median (99th percentile) latency of a latency-sensitive flow – one that sends 16B messages – increases by () in InfiniBand, () in RoCEv2, and () in iWARP when running alongside a single 1MB bandwidth-sensitive flow. Similarly, throughput-sensitive flows also get throttled with throughput loss of 69.5% in InfiniBand, and worse in RoCEv2 and in iWARP. Surprisingly, even bandwidth-sensitive flows sending different sizes of messages do not compete fairly against each other, even though both can independently saturate line-rate.
Unfortunately, RDMA NICs (RNICs) have not been designed for multi-tenant use cases, and their isolation mechanisms are not sufficient. Although RDMA standards support up to 15 hardware virtual lanes (Association, 2015) for separating traffic classes, such a small number of hardware shapers and/or priority queues are rarely sufficient in shared environments (Kumar et al., 2013; Alizadeh et al., 2013). We have also confirmed that the state-of-the-art congestion control protocols such as DCQCN (Zhu et al., 2015) do not mitigate these latency and throughput anomalies either.
RDMA performance isolation is further complicated by the multi-resource nature of RNICs. Each RNIC has two primary resources: link bandwidth (i.e., the number of bytes it can transfer each second) and execution unit throughput (i.e., the number of messages it can process each second). Bandwidth-sensitive flows send large volumes of data, throughput-sensitive ones send a large number of messages, and latency-sensitive ones care about individual message latencies – all three need both resources in different amounts.
Given that current RNIC implementations cannot provide performance isolation, we aim to answer the following simple yet fundamental question: Can we isolate applications and flows sharing an RNIC purely in software without compromising RDMA’s performance benefits?
An ideal solution should provide performance isolation without sacrificing RNIC utilization; it should do so in a scalable manner, with low CPU overhead, and without any hardware changes (§3). Note that we focus on cooperative datacenters in this paper, where the aforementioned RDMA performance anomalies arise due to RNIC implementations and not from users/tenants gaming the system.
However, simultaneously achieving performance isolation and work conservation has a well-known tradeoff even in cooperative environments (Popa et al., 2012; Chowdhury et al., 2016). We address this by presenting Justitia (§4), a pragmatic alternative that guarantees sharing incentive (Jaffe, 1981; Chowdhury et al., 2016), wherein each of the flows competing on an RNIC receives at least th of one of its two resources. We then maximize utilization as long as latency-sensitive flows are well isolated. To minimize application-level overhead, Justitia monitors system-wide latency characteristics by maintaining a reference flow on its own, and it arbitrates among throughput-and bandwidth-sensitive flows via multi-resource shaping. At the possibility of slightly decreasing utilization, Justitia can effectively isolate latency-sensitive flows and ensure that throughput- and bandwidth-sensitive ones are not unfairly penalized either. The proposed solution requires no hardware changes, provides a non-invasive service interface, and is applicable to different RDMA implementations.
We have implemented (§5) and evaluated (§6) Justitia on InfiniBand and RoCEv2. It mitigates the performance isolation anomalies between different types of flows while guaranteeing sharing incentive within the confines of the tradeoff space without compromising low CPU usage, introducing additional overhead, or modifying application codes. Furthermore, it complements RDMA congestion control protocols such as DCQCN (Zhu et al., 2015) and hardware virtual lanes (Carlson, 2009; Data Center Bridging Task Group, [n. d.]) (when available). In a large-scale experiment, Justitia improved the median and 99th percentile latencies of latency-sensitive flows by and , respectively, when competing against large bandwidth-sensitive flows. It scales well, effectively handles remote READs, and works well in simple incast scenarios. Justitia also isolates the performance of real-world RDMA applications. Using Justitia, eRPC’s throughput and latency improve by and when sharing RNIC resources with another storage service application.
Our paper makes the following contributions:
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
To the best of our knowledge, we are the first to perform a comprehensive analysis on RDMA sharing characteristics across all three RDMA implementations.
We design and implement Justitia, a software-only, host-based, and easy-to-deploy performance isolation solution that supports a wide range of RNICs.
We demonstrate Justitia’s benefits on both microbenchmarks and using real-world RDMA applications.
2. Performance Isolation Anomalies in RDMA
This section establishes a baseline understanding of RDMA sharing characteristics and identify common anomalies across different RDMA implementations (§2.1), followed by performance isolation analyses of highly optimized, state-of-the-art RDMA-based applications (§2.2). We then discuss the impact of RDMA congestion control on these anomalies (§2.3) and provide our hypothesis on the source of the anomalies (§2.4).
2.1. Flow-Level Analyses
We define a sequence of RDMA messages between the same pair of queue pairs (QPs) to be a flow. We focus on three primary types of flows and study how they affect each other.
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
Latency-Sensitive: Flows with small messages that care about individual message latencies.
Throughput-Sensitive: Flows with small messages trying to maximize the number of messages sent per second.
Bandwidth-Sensitive: Flows with large messages with high bandwidth requirements.
We performed microbenchmarks between two machines with the same type of RNIC, where both are connected to the same RDMA-enabled switch. For most of the experiments, we used 56 Gbps Mellanox ConnectX-3 Pro for InfiniBand, 40 Gbps Mellanox ConnectX-4 for RoCEv2, and 40 Gbps Chelsio T62100 for iWARP; 10 and 100 Gbps settings are described later. All of the switches provide non-blocking forwarding at line-rate between ports, and we use a single switch in each experiment to avoid issues caused by path length asymmetry (Zhu et al., 2015). Further details of our hardware setups can be found in Table 1 of Appendix A.
We used Mellanox perftest 4.2 (Technologies, 2017) as the benchmarking tool with minor modifications to enable latency and throughput logging and event-triggered polling in sending bandwidth-sensitive flows. Unless otherwise specified, latency-sensitive flows in our microbenchmarks send a continuous stream of 16B messages, throughput-sensitive ones send a continuous stream of batches with each batch having 64 16B messages, and bandwidth-sensitive flows send a continuous stream of either 1MB or 1GB messages. Latency- and throughput-sensitive flows use busy polling, whereas bandwidth-sensitive flows use event-triggered polling. Although all flows send data using RDMA WRITEs over reliable connection (RC) QPs in the observations below, other verbs show similar anomalies as well. Experiments on iWARP use RDMA Communication Manager to create and connect QPs. We do not enable hardware virtual lanes in these experiments.
2.1.1. Latency-Sensitive Flows are Unprotected
The biggest isolation issue appears to be the performance degradation of latency-sensitive flows in the presence of bandwidth-sensitive flows. The performance of the former deteriorate for all RDMA implementations (Figure 1). Out of the three implementations we benchmarked, InfiniBand and RoCEv2 observes and degradations in median latency and and at the 99th percentile. While iWARP performs well in terms of median latency, its tail latency degrades dramatically () in the presence of a bandwidth-sensitive flow. The background bandwidth-sensitive flows were not affected across all three implementations.
2.1.2. Throughput-Sensitive Flows Require Isolation
Throughput-sensitive flows also suffer. When a background bandwidth-sensitive flow is running, the throughput-sensitive ones observe a throughput drop of or more across all RDMA implementations (Figure 4).
2.1.3. Adding More Flows Exacerbates the Anomalies
The lack of protection for the latency-sensitive flows further exacerbates as more elephant flows (or equivalently more QPs) are created. We increase the number of bandwidth-sensitive flows in our experiment to simulate more realistic datacenter applications. Although InfiniBand performs relatively well in the presence of a single background bandwidth-sensitive flow (Figure 1), adding one more flow incurs an additional drop of and in median and 99th percentile latencies (Figure 4). With 16 or more bandwidth-sensitive flows, the latency-sensitive flow can barely make any progress. We observed a similar trend in other RDMA technologies.
Similarly, a throughput-sensitive flow experiences a continuous falloff in performance with the increasing number of background bandwidth-sensitive flows, losing 90% of its original throughput with 16 elephant flows (Figure 4).
Those anomalies illustrate RNIC’s inability to handle multiple types of flows, which could stem from the limited number of queues inside the RNIC hardware, increasing head-of-line (HOL) blocking of small flows.
2.1.4. Latency-Sensitive Flows Coexist Well; So Do
We observe no obvious anomalies among latency- or throughput-sensitive flows, or a mix of the two. Detailed results can be found in Appendix B.
2.1.5. Bandwidth-Sensitive Flows Hurt Each Other
Unlike latency- and throughput-sensitive flows, bandwidth-sensitive flows with different message sizes do affect each other, especially when using event-triggered polling of completion events. Although busy-polling can mitigate the unfairness in some cases (Zhang et al., 2017), using busy-polling – especially for bandwidth-sensitive flows where throughput is not the primary issue – leads to unnecessary CPU waste. Figure (a)a shows that a bandwidth-sensitive flow using 1MB messages receive smaller share than one using 1GB messages. The larger flow receives , and more bandwidth in InfiniBand, RoCEv2, and iWARP, respectively.
Moreover, the current RNIC allocates bandwidth resources based on the unit of QPs without distinguish which application those QPs come from((Figure (b)b)). In other words, users can use more QPs (similar to multiple connections in TCP/IP) to gain more bandwidth. Althoguh we assume a cooperative datacenter, it is hard to restrain users from using a certain number of QPs in their applications, especially when an application indeed needs to establish connections to multiple receivers. Without a proper control on the bandwidth share, multiple bandwidth-sensitive applications can result in unexpected bandwidth share.
2.1.6. Anomalies are Present in Faster Networks Too
We performed the same benchmarks on 100 Gbps InfiniBand, only to observe that most of the aforementioned anomalies are still present. Appendix C has the details.
2.2. Application-Level Analyses
In this section, we demonstrate how real RDMA-based systems fail to preserve their performance in the presence of the aforementioned anomalies.
2.2.1. RDMA-Based Blob Storage
To generate background traffic, we have implemented a simple RDMA-based blob storage backend across 16 machines. Users read/write data to this storage using a PUT/GET interface via frontend servers. Objects larger than 1MB are divided into 1MB splits and distributed across the backend servers. This generates a stream of 1MB transfers, and the following RDMA-optimized systems have to compete with them in our experimental setup.
FaSST (Kalia et al., 2016b) is an RDMA-based RPC system optimized for high message rate. We deploy FaSST in 2 nodes with message size of 32 bytes and a batch size of 8. We use 4 threads to saturate FaSST’s message rate at 9.8 Mrps. In the presence of the storage application, FaSST’s throughput experiences a 74% drop (Figure (a)a).
eRPC (Kalia et al., 2019) is a brand-new RPC system built on top of RDMA. We deploy eRPC in 2 nodes with message size of 32 bytes. We evaluate eRPC’s latency and throughput using the microbenchmark provided by its authors. For the throughput experiment, we use 2 worker threads with a batch size of 8 on each node because 2 threads are enough to saturate the message rate in our 2-node setting. In the presence of the storage application, eRPC’s throughput drops by 93% (Figure (b)b), and its median and tail latencies increase by and , respectively (Figure (c)c).
2.3. Congestion Control Does Not Fix It
To demonstrate that DCQCN (Zhu et al., 2015) and PFC do not fix these anomalies, we performed the benchmarks again with PFC enabled at both the NICs and switch ports, DCQCN (Zhu et al., 2015) enabled at the NICs, and ECN markings enabled on a Dell 10 Gbps Ethernet switch (S4048-ON). In these experiments, latency- and throughput-sensitive flows still suffer unpredictably (§6.4.1).
2.4. Source of RDMA Performance Anomalies
We perform all our flow-level analyses in a simple 1-switch 2-node setting. These anomalies occur even though the switch is non-blocking and there are only two active ports on the switch. This implies that the network is not the source of anomalies in these experiments, and thus explains why DCQCN does not fix those anomalies. Rather, at the end hosts, RNICs’ immediately processing all ready-to-consume messages to achieve work conservation is very likely to cause head-of-line (HOL) blocking of the smaller messages by the larger ones. As a result, message latencies increase unpredictably, flows receive unfair bandwidth shares, and throughputs drop.
We summarize our key observations as follows:
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
If only latency- or throughput-sensitive flows (or a mix of the two) compete, they are isolated from each other (§2.1.4).
Multiple bandwidth-sensitive flows can lead to unfair bandwidth allocations depending on their message sizes or number of QPs in use (§2.1.5).
Highly optimized, state-of-the-art RDMA-based systems also suffer from the anomalies we discovered (§2.2).
The presence of a congestion control protocol is no panacea to isolate latency- or throughput-sensitive flows from the bandwidth-sensitive ones (§2.3).
The performance anomalies we discovered stem from end hosts and are very likely caused by HOL Blocking in RNICs (§2.4).
Goals. An ideal RDMA performance isolation solution should satisfy the following goals:
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
Performance Isolation w/o Sacrificing Utilization:
Performance isolation and work conservation are known to be at odds in network-level scenarios(Chowdhury et al., 2016; Popa et al., 2012) even though max-min fairness (Jaffe, 1981; Demers et al., 1989; Bennett and Zhang, 1996) provides both on a single link. The latter, however, only holds when all flows are bandwidth-sensitive and have packets with bounded size differences (Shreedhar and Varghese, 1996); for latency-sensitive flows, one must plan for the worst case (Stoica et al., 1997). Given that RDMA messages can range from bytes to gigabytes, relying on max-min fairness is not enough. We should strive for increasing utilization without sacrificing isolation.
Traffic-Agnostic, Simple Service Interface: Applications cannot be expected to change the nature of their traffic. Hence, we must accommodate all three types of flows. Applications should not have to specify traffic volume either. It is thus preferable to provide a narrow interface – e.g., have applications choose one of the three classes of service when creating a flow.
No Changes to Applications or Hardware: Although an ongoing body of work focuses on programmable NICs and switches (Bosshart et al., 2014), large-scale deployments of these techniques are yet to happen. On a traditional life cycle, changes to the RNICs or switches are expensive, time-consuming, and are hard to deploy. If possible, simple edge- and software-based solutions that are application- and hardware-independent are preferable.
Scalability w/ Low Resource Usage: The proposed software solution should scale to a large number of flows without large resource consumption to remain practical.
Non-Goals. Users/tenants/applications gaming the public cloud network is a well-studied topic (Popa et al., 2012; Mogul and Popa, 2012; Chowdhury et al., 2016), and RDMA will likely experience similar challenges in such an environment. Nonetheless, given the extent of RDMA performance isolation anomalies even in a controlled, non-adversarial environment (§2), we restrict our focus on a cooperative datacenter environment in this paper. We consider the need for strategyproofness (Ghodsi et al., 2011; Ghodsi et al., 2012; Popa et al., 2012) to mitigate adversarial/malicious behavior to be a non-goal.
Justitia provides performance isolation between latency-, throughput-, and bandwidth-sensitive flows while maximizing RNIC resource utilization. In this section, we first present Justitia’s design principles (§4.1). Next, we present Justitia’s overall architecture in terms of its two core components: the Justitia daemon (§4.2) and Justitia shapers (§4.3). Finally, we extend Justitia to handle remote READs via inter-machine coordination (§4.4) and to further increase utilization when latency-sensitive flows cannot be helped (§4.5).
4.1. Design Principles
Justitia’s design principles follow from its requirements.
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
Isolation via Sharing Incentive: Given the isolation-vs-utilization tradeoff, instead of picking either one, we opt for guaranteeing each of the flows at least th of one of the two resources and then maximize the total utilization until latencies may be affected. This ensures that we are not unfairly penalizing one specific type of flows.
Soft Admission Control of Latency-Sensitive Flows: An implication of enforcing a sharing incentive is that providing latency guarantees may become untenable (e.g., when the number of other flows are high). In such cases, Justitia informs an application that a new latency-sensitive flow will not meet its latency target.
Sender-Side Multi-Resource Shaping in Software: Instead of keeping separate queues for each bandwidth- or throughput-sensitive flow, Justitia relies on a host-wide daemon that arbitrates between all resource-hungry flows from the sender side. It splits large messages to roughly equal-sized chunks, which helps avoid HOL blocking. We do not use hardware rate limiters in the RNIC because they are limited in number and slow when setting new rates (2 milliseconds in our setup).
4.2. The Justitia Daemon
Figure 7 presents a high-level overview of Justitia. Each machine has a Justitia daemon that performs latency monitoring and proactive rate management, and applications create QPs using the existing API to perform RDMA communication. Justitia relies on applications to optionally identify the type of a flow when creating the corresponding QP.111We implement this by passing an optional flag in ibv_qp_init_attr structure in the ibv_create_qp() function (done in one line of code). By default, flows are treated as bandwidth-sensitive. In the following, we first provide a high-level overview of how Justitia works and then elaborate on its different components.
To monitor latency, Justitia does not interact with latency-sensitive flows at all. They can send messages/data whenever they want because they cannot saturate either of the two RNIC resources. Instead, Justitia maintains a system-wide reference latency-sensitive flow to estimate the 99th percentile () latencies for small messages (§4.2.1). This works well in estimating the impact of resource-hungry (bandwidth- and throughput-sensitive) flows on the latency-sensitive ones because all latency-sensitive flows get affected when resource utilization is very high. Moreover, by monitoring its own reference flow instead of the flows from applications, Justitia does not need to wait on latency-sensitive applications to send a large enough number of sample messages for accurate tail latency estimation. It does not add additional delay into those applications by probing their flows either, which is significant in a microsecond-scale network.
Justitia performs proactive rate management of all bandwidth- and throughput-sensitive flows from the sender side. At its heart, the key idea is maximizing the safe total utilization () of all resource-hungry flows without violating system-wide latency target: , while guaranteeing sharing incentive. Using as a signal, Justitia uses an Additive Increase Multiplicative Decrease (AIMD) algorithm to maximize (§4.2.2).
Justitia enforces among bandwidth- and throughput-sensitive flows using multi-resource tokens. The Justitia daemon generates multi-resource tokens every interval to limit the total utilization of all resource-hungry flows to (§4.2.3). Each token corresponds to a fixed amount of bytes () and a fixed number of messages (). Because of the coupled nature of the two RNIC resources, a flow can completely spend only one resource of a token – a bandwidth-sensitive flow will exhaust its associated bytes, while a throughput-sensitive flow will exhaust the number of operations that token allows to send. Tokens are distributed in a fair fashion among the active resource-hungry flows by the Justitia daemon up to (§4.2.3).
Given the tokens, each flow shapes/paces itself (§4.3). A token is large enough for a flow not to bottleneck on token generation and distribution. Large messages from bandwidth-sensitive flows are divided into equal-sized chunks, which are then paced based on token availability. Splitting is necessary to avoid HOL blocking caused by bandwidth-sensitive flows. Batches of small messages from throughput-sensitive flows are paced by per-flow pacers too.
4.2.1. Handling Latency-Sensitive Flows
Justitia does not interrupt or interact with latency-sensitive flows. Instead, in the presence of at least one latency-sensitive flow, it runs a reference flow that keeps sending 10B messages to another machine in the cluster in periodic intervals (by default, =0.5 ms). is chosen to send the reference flow at a rate that adds no additional delay to other latency-sensitive flows, but still is frequent enough to monitor latency anomalies. Justitia then measures the latency between posting a message and when its work completion is generated.
Given the measurements, Justitia maintains a sliding window of the most recent (=10000) measurements, and it uses a count-min sketch (Cormode and Muthukrishnan, 2005) on that window to estimate . This is fed into the computation algorithm described below.
If is higher than , Justitia can perform soft admission control (e.g., returning a warning code) when creating new latency-sensitive flows. If cannot be met at all, Justitia can opt for maximizing utilization (§4.5).
In the absence of latency-sensitive flows, is set to total RNIC bandwidth (), where is pre-determined on a per-RNIC basis using the benchmark flows from Section 2. Because the ratio between to (i.e., the total ops/second) is fixed for a given RNIC, calculating in terms of bandwidth is sufficient.
In the presence of latency-sensitive flows, the overarching goal of Justitia boils down to continuously maximizing based on the current estimation (Pseudocode 1). At the same time, it must ensure that each resource-hungry flow – assume there are latency-, bandwidth-, throughput-sensitive flows – receives at least th of the RNIC resources. Instead of attempting to achieve this on a per-flow basis, Justitia focuses on maximizing , where is or a higher fraction of .
To continuously update , Justitia uses a simple AIMD scheme that reacts to every interval as follows. If the estimation is above , Justitia decreases by half; is guaranteed to be at least of . If the estimation is below , Justitia slowly increases . Because ranges between to the total RNIC resources and latency-sensitive flows are highly sensitive to too high a utilization level, our conservative AIMD scheme, which drops utilization quickly to meet , works well in practice.
4.2.3. Token Generation And Distribution
Justitia uses multi-resource tokens to enforce among the bandwidth- and throughput-sensitive flows in a fair manner. Each token represents amount of a fixed amount of bytes () and a fixed number of messages (). In other words, the size of determines the chunk size a bandwidth-sensitive flow is split into. A token is generated every interval, but the value of depends on as well as on the size of each token. For example, given 48 Gbps application-level bandwidth and 30 Million operations/sec on a 56 Gbps RNIC, if is set to 1MB, then we set =5000 operations and =167 microseconds.
Justitia daemon continuously generates one token every interval and distributes it among the active resource-hungry flows in a round-robin fashion. Each flow independently enforces its rate using one of the shapers (§4.3). Note that introducing the notion of weighted round-robin is straightforward. If a flow’s weight is , Justitia can ensure it receives -proportional tokens during each round.
4.3. Justitia Shapers
Justitia shapers – implemented in the RDMA driver – enforce utilization limits provided by the Justitia daemon-calculated tokens. There are two shapers in Justitia: one for bandwidth- and another for throughput-sensitive flows.
Shaping Bandwidth-Sensitive Flows. This involves two steps: splitting and pacing. For any bandwidth-sensitive flow, Justitia transparently divides any message larger than into -sized chunks to ensure that the RNIC only sees roughly equal-sized messages. Splitting messages for diverse RDMA communication verbs – e.g., one-sided vs. two-sided – requires careful designing (§5.2).
Given chunk(s) to send, the pacer requests for token(s) from the Justitia daemon by marking itself as an active flow. Upon receiving a token, it transfers chunk(s) until that token is exhausted and repeats until there is nothing left to send.
The application is notified of the completion of a message only after all of its chunks have been successfully transferred.
Shaping Throughput-Sensitive Flows. These flows typically deal with (batches of) small messages. Consequently, there is no need for message splitting. Instead, a pacer ensures that the flow can send at most messages corresponding to each token. Each token is large enough so as not to bottleneck on token generation and distribution.
Mitigating Head-of-Line Blocking. One of the foremost goals of Justitia is to mitigate HOL blocking caused by the bandwidth-sensitive flows to provide good isolation. To achieve this goal, we need to split messages into smaller chunks and pace them at a certain rate (enforcing ) with enough spacing between them to minimize the blocking. However, this simple approach creates a dilemma. On the one hand, too large a chunk may not resolve HOL Blocking. On the other hand, too small a chunk may not be able to reach . It also leads to increased CPU overhead from using a spin loop to fetch tokens generated in a very short period in which context switches are not affordable. Note that this is another manifestation of the performance isolation-work conservation tradeoff. We discuss how to pick the chunk size in Section 5.1 and how to reduce CPU overhead in Section 5.3.
4.4. Handling READs via Remote Control
So far we have discussed Justitia from a sender-side perspective. However, RDMA allows remote machines to read from a local machine using the RDMA READ verb. RDMA READs operations from machines to read data from compete with all sending operations (e.g., RDMA WRITE) from machine . Consequently, Justitia must consider remote READs as well.
One possible design to achieve this would be sending tokens from to so that ’s Justitia daemon can pace the READs. However, this requires tight coordination between many machines and susceptible to latency variations in sending/receiving tokens. Instead, we opt for a simpler solution in Justitia, wherein sends the updated guaranteed utilization (, where is the updated count of including remote READ flows) to each remote flow after each update, and locally enforces that rate. Note that this can sometimes decrease utilization when remote READ flows do not completely use their assigned resources.
4.5. What If Is Unattainable?
A key consequence of the isolation-utilization tradeoff is that may sometimes be unattainable – e.g., when it is set too low or in the presence of too many resource-hungry flows. This can cause underutilization as Justitia continuously try to reduce without success while limiting resource-hungry flows to th shares.
We address this issue by providing an option to the operator: if is higher than for period, Justitia assumes that is unattainable. It can then ignore latency-sensitive flows altogether and focus on equally sharing all resources among resource-hungry flows.
It may need to come out of this state only when the ratio changes. Specifically, when becomes even smaller – e.g., decreasing or increasing – it can stay in the same state. Only when increases, Justitia can go back to the original algorithm and try to attain again.
The cluster operator can decide whether to use this option based on their experience and application expectations.
We have implemented the Justitia daemon as a user-space process in 3,100 lines of C, and the shapers are implemented inside individual RDMA drivers with 5,200 lines of C code.
5.1. Determining Token Size for Bandwidth Target
One of the key steps in determining is deciding the size of each token. Because the RNIC can become throughput-bound for smaller messages instead of bandwidth-bound, we cannot use arbitrarily small messages to resolve HOL blocking. At the same time, given a utilization target, we want to use the smallest value to achieve that target to reduce HOL blocking while maximizing utilization.
Instead of dynamically determining it using another AIMD-like process, we observe that (i) this is an RNIC-specific characteristic and (ii) the number of RNICs is small. With that in mind, we maintain a pre-populated dictionary; Justitia simply uses the mappings during runtime. When latency-sensitive flows are not present, a large token size (1MB) is used. Otherwise, Justitia switched to the smallest chunk with which bandwidth-sensitive flows can use to saturate most of line rate (to enforce ) when sending them in a batch (Figure 29 in the Appendix). To avoid the variation caused by chunk sizes in different hardwares, we set the chunk size to be 5 KB by default.
5.2. Transparently Splitting RDMA Messages
Justitia splitter transparently divides large messages of bandwidth-sensitive flows into smaller chunks for pacing. It ensures that an application posts to a QP in a fully transparent manner and does not notice any difference when posting a Work Queue Element (WQE) or polling for Completion Queue Element (CQE) of that request from the Completion Queue (CQ) associated with that QP.
Our splitter uses a custom QP called a Split QP to handle message splitting, which is created when the original QP of a bandwidth-sensitive flow is created. A corresponding Split CQ is used to handle CQEs for the WQEs posted to a Split QP. A custom completion channel is used to poll those CQEs in an event-triggered fashion to preserve low CPU overhead of native RDMA.
To handle one-sided RDMA operations, when detecting a message larger than , we divide the original message into chunks and only post the last chunk to the application’s QP (Figure 8). The rest of the chunks are posted to the Split QP. Split QP ensures all chunks have been successfully transfered before the last chunk handled by the application’s QP. This makes sure the user cannot poll the CQE until the entire message has done transferring. The two-sided RDMA operations such as SEND are handled in a similar way, with additional flow control messages for the chunk size change and receive requests to be pre-posted at the receiver side. The WRITE_WITH_IMM verb can be further simplified by using WRITE in the WQE handled by the Split QP.
5.3. Reduce CPU Overhead From Using Small Tokens
As mentioned earlier, using small tokens lead to CPU overhead mainly from busy spinning to fetch tokens generated at a short period (around 1us) which precludes any context switches. We solve this challenge by decoupling token generation (TG) with token enforcement (TE). We move the discussion to Appendix D due to limited space.
In this section, we evaluate Justitia’s effectiveness in providing performance isolation between latency-, throughput-, and bandwidth-sensitive flows on InfiniBand and RoCEv2.
Our key findings can be summarized as follows:
[topsep=2pt, partopsep=0pt, leftmargin=1.5em]
Unless specified, we do not use hardware virtual lanes.
To measure latency, we perform 5 consecutive runs and present their median. Most of our results are very stable; we do not show error bars when they are too close to the median.
Ethics. This work does not raise any ethical issues.
6.1. Preventing Isolation Anomalies
We start by revisiting the scenarios from Section 2 to understand how Justitia isolates different types of RDMA flows.
Experimental Setup. We use the same setups as those described in Section 2, and unless otherwise specified, we set =2 microseconds on both InfiniBand and RoCEv2 for the latency-sensitive flows. Justitia works well in 100 Gbps networks too (Appendix C). Unless otherwise specified, sharing incentive is strictly enforced.
6.1.1. Predictable Latency
Recall that latency-sensitive flows are affected the most when they compete with a bandwidth-sensitive flow. In the presence of Justitia, both median and tail latencies improve significantly in both InfiniBand and RoCEv2 (Figure (a)a). In this experiment, we set the latency target to the value when the latency-sensitive is running alone. By sharing incentive requirement, the bandwidth-sensitive flow is limited to half of its original bandwidth (Figure (b)b). In other words, Figure (a)a shows the best latency isolation while maintaining sharing incentive. Because Justitia treats all bandwidth-sensitive flows from the same application as one and distribute tokens among them in a round-robin fashion, introducing more flows will not affect isolation.
Maximizing Work Conservation. Next we evaluate how Justitia performs when the latency target is set to a large value ( =10 microseconds) that can always be met. Justitia keeps increasing toward the line rate until the target is violated. Figure 10 illustrates the latency bound that can be achieved in such case.
For a slightly high , Justitia can provide bounded latency for applications sharing the same RNIC without compromising high bandwidth allocation. Note that as long as all applications go through Justitia, bandwidth-sensitive applications are all paced by Justitia with aggregate bandwidth set to line rate. Thus latency numbers in Figure 10 will not change regardless of the number of bandwidth-sensitive applications.
6.1.2. Fair Bandwidth and Throughput Sharing
Justitia ensures that bandwidth-sensitive flows receive equal shares regardless of their message sizes (Figure 11). To achieve fair sharing, Justitia introduces small bandwidth overhead (less than 6% on InfiniBand and 2% on RoCEv2).
Justitia’s benefits extends to the bandwidth- vs through-sensitive flow scenario as well. In this case, it ensures that both receive roughly half of their resources. Figure 12 illustrates this behavior. In both InfiniBand and RoCEv2, the throughput-sensitive flow is able to achieve half of its original message rate of itself running alone (Figure (a)a). The bandwidth-sensitive flow, on the other hand, is limited to half its original bandwidth as expected (Figure (b)b).
6.1.3. Throughput- vs. Latency-Sensitive Flow
6.2. Justitia and RDMA Applications
We now shift our attention to real applications (§2.2) and evaluate Justitia’s effectiveness at the application level. We observe that Justitia achieves better RNIC resource sharing when FaSST and the bandwidth-sensitive storage application coexist – FaSST’s throughput improves by with a decrease in storage application’s bandwidth (Figure 14). Justitia also improves eRPC’s median (tail) latency by () and its throughput by while still maintaining sharing incentive.
6.3. Justitia Deep Dive
6.3.1. Scalability and Rate Conformance
Figure (a)a shows that as the number of bandwidth-sensitive flows increases, all flows receive the same amount of bandwidth using Justitia. The overall RNIC bandwidth utilization remains close to that of its maximum capacity.
The same holds for throughput-sensitive flows (Figure (b)b), but with two caveats. First, a single throughput-sensitive flow cannot saturate the RNIC – it takes four or more (refer to Figure 27 in the Appendix). Hence, Justitia ensures that all throughput-sensitive flows send roughly equal number of messages. Second, throughput-sensitive flows are CPU-hungry because they drive a large number of messages.
6.3.2. CPU and Memory Consumption
Justitia uses two dedicated CPU cores per machine: one to generate and distribute tokens and the other for the reference latency-sensitive flow. A detailed analysis on CPU overhead can be found in Appendix D. Its memory footprint is not significant.
6.3.3. Impact of Latency-Sensitive Flow’s Message Size
All our latency-sensitive experiments use small, 16B messages. Here, we vary the message size and observe that Justitia can still meet the median latency of the flow running alone, and its tail performance is still limited due to the isolation-utilization tradeoff (Figure 17). The bandwidth-sensitive flow receives half the bandwidth in all cases.
6.4. Justitia and Alternatives
6.4.1. Justitia + DCQCN
As discussed earlier (§2.4), the anomalies we discover in this paper does not stem from the network congestion, but rather happens at the end hosts. To further confirm our hypothesis, we deployed DCQCN (§2.3) and found that it indeed falls short for latency- and throughput-sensitive flows (Figures 20, 20, 20). Justitia mitigates them and complements DCQCN by improving latencies by up to and throughput by .
6.4.2. Justitia + Hardware Virtual Lanes
Hardware virtual lanes are limited in number (Kumar et al., 2013; Alizadeh et al., 2013; Technologies, 2018); e.g., our Ethernet switches support only two lossless traffic classes. In this experiment, we run three flows, one each for each of the three types (Figure 21). Although the latency-sensitive flow remains isolated in its own class, the bandwidth- and throughput-sensitive flows compete in the same class. As a result, the latter observes throughput loss (similar to Figure 12). Justitia can effectively provide performance isolation between bandwidth- and throughput-sensitive flows in the shared queue.
6.4.3. Justitia vs. LITE
6.5. Dynamic, Long-Running Scenarios
Here we extend our evaluation from microbenchmarks to two dynamic scenarios. Both use = 2 microseconds.
6.5.1. Sharing Incentive Enforcement
First, we focus on Justitia’s effectiveness in isolating many flows with different requirements and performance characteristics. Specifically, we consider 8 long-running bandwidth-sensitive flows – 2 each with message sizes: 1MB, 10MB, 100MB, and 1GB – that arrive over time in pairs. When all of the bandwidth-sensitive flows are active, we start 8 latency-sensitive flows that run for a relatively short period of time (20 million samples) and finish. Figure 22 shows the latency measurements.
In the absence of Justitia, latency-sensitive flows suffer large performance hits: individually each flow had median and 99th percentile latencies of 1.3 and 1.4 microseconds (Figures (a)a and (b)b). With bandwidth-sensitive flows, they worsen by and . Justitia improves median and tail latencies of latency-sensitive flows by and while guaranteeing sharing incentive among all the flows.
6.5.2. When Is Unattainable
In this experiment, we focus on Justitia’s dynamic adjustments to use up resources when cannot be achieved (Figure 23). Justitia first tries to ensure sharing incentive when the ratio of active latency-sensitive flows increases. However, when it cannot meet the target for a long duration (in this case =5 seconds), Justitia provides an option to opt for increasing utilization and equally shares bandwidth between the bandwidth-sensitive flows. Note that the operator can choose the opposite as well.
6.6. Handling Remote READs
Unlike TCP/IP, RDMA provides READ verbs that allows a remote machine to read from a local machine , where data flows in the direction. Consequently, they compete with WRITEs and SENDs from machine to . Figure 24 shows that, as expected (§4.4), Justitia can isolate latency-sensitive remote READs from local bandwidth-sensitive WRITEs and vice versa.
6.7. Justitia’s Impact on Incast Scenarios
So far, we have always focused on sender-side RNIC contentions. In this experiment, we focus on Justitia’s impact on simple incast scenarios, where multiple senders – continuously send messages of 1MB, 10MB, 100MB, and 1GB (two senders each) to a single receiver (Figure 25). Simultaneously, sends a latency-sensitive flow to . We extended Justitia daemons to continuously exchange receiver side views with the senders (similar to the RDMA READ case). We compare four cases where (1) neither DCQCN or Justitia is applied, (2) only Justitia is applied, (3) only DCQCN is applied, (4) both DCQCN and Justitia is applied.
We make two observations. First, DCQCN indeed greatly improves incast. However, using Justitia alone can achieve similar performance as DCQCN. Second, Justitia can complement with DCQCN to further improve the incast scenario.
7. Related Work
Recently, large-scale RDMA deployment over RoCEv2 have received wide attention (Zhu et al., 2015; Guo et al., 2016; Mittal et al., 2015, 2018). However, the resulting RDMA congestion control algorithms (Mittal et al., 2015; Zhu et al., 2015; Le et al., 2018) primarily deal with Priority-based Flow Control (PFC) to provide fair sharing between bandwidth-sensitive flows inside the network. In contrast, Justitia focuses on RNIC isolation and is complementary to them (§6.4.1).
Similarly, Justitia is also complementary to FreeFlow (Kim et al., 2019), which solves a different problem: enabling untrusted containers to securely gain some of the performance benefits of RDMA. Because FreeFlow does not change how verbs are sent to queue pairs and only validates that it is secure to do so, it can still suffer from the performance isolation problems that Justitia addresses. It can also potentially benefit from employing an approach similar to Justitia. Further, in scenarios where applications are trusted, Justitia has the potential to achieve better performance than FreeFlow.
LITE (Tsai and Zhang, 2017) also addresses resource sharing and isolation issues in RNICs. However, through experiments (§E), we have found that LITE does not perform very well in the absence of hardware virtual lanes. In contrast, Justitia is a software-only solution that appreciates the isolation-vs-utilization tradeoff to mitigate RDMA performance isolation anomalies.
Justitia’s goal is to enable such diverse workloads to coexist. Although Justitia currently works at the level of flows, it can potentially be extended to handle application- and tenant-level isolation issues (§F).
Max-min fairness (Jaffe, 1981; Demers et al., 1989; Bennett and Zhang, 1996; Shreedhar and Varghese, 1996) is the well-established solution for link sharing that achieves both sharing incentive and work conservation, but it only considers bandwidth-sensitive flows. Latency-sensitive flows can rely on some form of prioritization for isolation (Alizadeh et al., 2013; Hong et al., 2012; Wilson et al., 2011).
Although DRFQ (Ghodsi et al., 2012) dealt with multiple resources, it considered cases where a packet sequentially accessed each resource, both link capacity and latency were significantly different than RDMA, and the end goal was equalizing utilization instead of performance isolation. Furthermore, implementing DRFQ required hardware changes.
Both Titan (Stephens et al., 2017b) and Loom (Stephens et al., 2017a) improve performance isolation on conventional NICs by programming on-NIC packet schedulers. However, this is not sufficient to address all RDMA performance isolation problems because it only schedules a single resource: the outgoing link. Further, Justitia works on existing RDMA NICs that do not have programmable packet schedulers.
Datacenter Network Sharing
With the advent of cloud computing, the focus on link sharing has expanded to network sharing between multiple tenants (Mogul and Popa, 2012; Popa et al., 2012; Chowdhury et al., 2016; Ballani et al., 2011; Shieh et al., 2011). Almost all of them – except for static allocation – deal with bandwidth isolation and ignore latency-sensitive flows.
Silo (Jang et al., 2015) dealt with datacenter-scale challenges in providing latency and bandwidth guarantees with burst allowances on Ethernet networks. In contrast, we focus on isolation anomalies in multi-resource RNICs between latency-, bandwidth-, and throughput-sensitive flows.
8. Concluding Remarks
We have demonstrated that performance isolation issues between bandwidth-, throughput-, and latency-sensitive RDMA flows are pervasive across InfiniBand, RoCEv2, and iWARP and in 10, 40, 56, and 100 Gbps RDMA networks. The root causes include the work-conserving nature of RDMA NICs (RNICs) and their multi-resource design. The overall impact is head-of-line (HOL) blocking when flows with diverse message sizes and performance requirements compete.
Justitia addresses these anomalies both at flow and application levels in two steps. First, it guarantees each flow at least th of the two RNIC resources (bandwidth and execution unit throughput). Second, it maximizes RNIC utilization across both dimensions without violating that guarantee. Justitia is easily deployable, scales well, can handle remote READs, and performs well in simple incast scenarios. Justitia works well in isolating the performance of real-world RDMA applications such as FaSST and eRPC. Furthermore, it complements RDMA congestion control protocols such as DCQCN and hardware virtual lanes (when present) well.
Justitia is only a first step toward RNIC performance isolation and raises interesting research questions (Appendix F).
- Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI.
- Alizadeh et al. (2013) Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick Mckeown, Balaji Prabhakar, and Scott Shenker. 2013. pFabric: Minimal Near-Optimal Datacenter Transport. In SIGCOMM.
- Association (2015) Infiniband Trade Association. 2015. Infiniband architecture specification volume 1. https://cw.infinibandta.org/document/dl/7859. (2015).
- Ballani et al. (2011) Hitesh Ballani, Paolo Costa, Thomas Karagiannis, and Ant Rowstron. 2011. Towards predictable datacenter networks. In SIGCOMM.
- Bennett and Zhang (1996) J.C.R. Bennett and H. Zhang. 1996. : Worst-case Fair Weighted Fair Queueing. In INFOCOM.
- Bosshart et al. (2014) Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, et al. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87–95.
- Carlson (2009) Craig Carlson. 2009. IEEE 802.1: 802.1Qaz - Enhanced Transmission Selection. http://www.ieee802.org/1/pages/802.1az.html. (2009).
- Chowdhury et al. (2016) M. Chowdhury, Z. Liu, A. Ghodsi, and I. Stoica. 2016. HUG: Multi-Resource Fairness for Correlated and Elastic Demands. In NSDI.
- Cormode and Muthukrishnan (2005) Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
- Cruz (1991a) RL Cruz. 1991a. A Calculus for Network Delay, Part I: Network Elements in Isolation. IEEE Transactions on Information Theory 37, 1 (1991), 114–131.
- Cruz (1991b) RL Cruz. 1991b. A Calculus for Network Delay, Part II: Network Analysis. IEEE Transactions on Information Theory 37, 1 (1991), 132–141.
- Data Center Bridging Task Group ([n. d.]) Data Center Bridging Task Group. [n. d.]. http://www.ieee802.org/1/pages/dcbridges.html. ([n. d.]).
- Demers et al. (1989) A. Demers, S. Keshav, and S. Shenker. 1989. Analysis and Simulation of a Fair Queueing Algorithm. In SIGCOMM.
- Dragojević et al. (2014) Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In NSDI.
- Gao et al. (2016) Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In OSDI.
- Ghodsi et al. (2012) Ali Ghodsi, Vyas Sekar, Matei Zaharia, and Ion Stoica. 2012. Multi-resource fair queueing for packet processing. SIGCOMM.
- Ghodsi et al. (2011) Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. 2011. Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In NSDI.
- Gu et al. (2017) J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin. 2017. Efficient Memory Disaggregation with Infiniswap. In NSDI.
- Guo et al. (2016) Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In SIGCOMM.
- Hong et al. (2012) Chi-Yao Hong, Matthew Caesar, and P. Brighten Godfrey. 2012. Finishing Flows Quickly with Preemptive Scheduling. In SIGCOMM.
- Huang et al. (2012) Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure Coding in Windows Azure Storage. In USENIX ATC.
- Intel (2003) Intel. 2003. HTB Home. http://luxik.cdi.cz/~devik/qos/htb/. (2003).
- Jaffe (1981) Jeffrey M Jaffe. 1981. Bottleneck flow control. IEEE Transactions on Communications 29, 7 (1981), 954–962.
- Jang et al. (2015) Keon Jang, Justine Sherry, Hitesh Ballani, and Toby Moncaster. 2015. Silo: Predictable message latency in the cloud. In SIGCOMM.
- Kalia et al. (2014) Anuj Kalia, Michael Kaminsky, and David G Andersen. 2014. Using RDMA efficiently for key-value services. In SIGCOMM.
- Kalia et al. (2016a) Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016a. Design guidelines for high performance RDMA systems. In USENIX ATC.
- Kalia et al. (2016b) Anuj Kalia, Michael Kaminsky, and David G Andersen. 2016b. FaSST: fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In OSDI.
- Kalia et al. (2019) Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2019. Datacenter RPCs can be General and Fast. In NSDI.
- Kim et al. (2019) Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, and Srinivasan Seshan. 2019. FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds. In NSDI.
- Kumar et al. (2013) Gautam Kumar, Srikanth Kandula, Peter Bodik, and Ishai Menache. 2013. Virtualizing Traffic Shapers for Practical Resource Allocation. In HotCloud.
- Le et al. (2018) Yanfang Le, Brent Stephens, Arjun Singhvi, Aditya Akella, and Michael M. Swift. 2018. RoGUE: RDMA over Generic Unconverged Ethernet. In SoCC.
- Li et al. (2014) Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In SoCC.
- Mitchell et al. (2013) Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In USENIX ATC.
- Mittal et al. (2015) Radhika Mittal, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. 2015. TIMELY: RTT-based Congestion Control for the Datacenter. In SIGCOMM.
- Mittal et al. (2018) Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Scott Shenker. 2018. Revisiting Network Support for RDMA. In SIGCOMM.
- Mogul and Popa (2012) Jeffrey C Mogul and Lucian Popa. 2012. What we talk about when we talk about cloud network performance. SIGCOMM CCR 42, 5 (2012), 44–48.
- Nelson et al. (2015) Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2015. Latency-tolerant software distributed shared memory. In USENIX ATC.
- Perry et al. (2014) Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. 2014. Fastpass: A centralized zero-queue datacenter network. (2014).
- Popa et al. (2012) L. Popa, G. Kumar, M. Chowdhury, A. Krishnamurthy, S. Ratnasamy, and I. Stoica. 2012. FairCloud: Sharing the Network in Cloud Computing. In SIGCOMM.
- Shieh et al. (2011) Alan Shieh, Srikanth Kandula, Albert Greenberg, and Changhoon Kim. 2011. Sharing the Data Center Network. In NSDI.
- Shreedhar and Varghese (1996) Madhavapeddi Shreedhar and George Varghese. 1996. Efficient fair queuing using deficit round-robin. IEEE/ACM Transactions on Networking 4, 3 (1996), 375–385.
- Stephens et al. (2017a) Brent Stephens, Aditya Akella, and Michael Swift. 2017a. Loom: Flexible and Efficient NIC Packet Scheduling. In NSDI.
- Stephens et al. (2017b) Brent Stephens, Arjun Singhvi, Aditya Akella, and Michael Swift. 2017b. Titan: Fair Packet Scheduling for Commodity Multiqueue NICs. In USENIX ATC.
- Stoica et al. (1997) I. Stoica, H. Zhang, and T.S.E. Ng. 1997. A Hierarchical Fair Service Curve Algorithm for Link-Sharing, Real-Time and Priority Service. In SIGCOMM.
- Technologies (2017) Mellanox Technologies. 2017. Mellanox Perftest Package. https://community.mellanox.com/docs/DOC-2802. (2017).
- Technologies (2018) Mellanox Technologies. 2018. Mellanox InfiniBand Switch Systems. http://www.mellanox.com/page/switch_systems_overview. (2018).
- Tsai and Zhang (2017) Shin-Yeh Tsai and Yiying Zhang. 2017. LITE Kernel RDMA Support for Datacenter Applications. In SOSP.
- Wilson et al. (2011) Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowtron. 2011. Better never than late: Meeting deadlines in datacenter networks. In SIGCOMM.
- Yu et al. (2014) Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Zhiheng Huang, Brian Guenter, Huaming Wang, Jasha Droppo, Geoffrey Zweig, Chris Rossbach, Jie Gao, Andreas Stolcke, Jon Currey, Malcolm Slaney, Guoguo Chen, Amit Agarwal, Chris Basoglu, Marko Padmilac, Alexey Kamenev, Vladimir Ivanov, Scott Cypher, Hari Parthasarathi, Bhaskar Mitra, Baolin Peng, and Xuedong Huang. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report. Microsoft Research.
- Zhang et al. (2017) Yiwen Zhang, Juncheng Gu, Youngmoon Lee, Mosharaf Chowdhury, and Kang G. Shin. 2017. Performance Isolation Anomalies in RDMA. In KBNets.
- Zhu et al. (2015) Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. Congestion control for large-scale RDMA deployments. In SIGCOMM.
Appendix A Hardware Testbed Summary
Table 1 summarizes the hardware we use for different RDMA protocols in our experiments.
|InfiniBand||ConnectX-3 Pro||Mellanox SX6036G||56 Gbps|
|InfiniBand||ConnectX-4||Mellanox SB7770||100 Gbps|
|RoCEv2||ConnectX-4||Mellanox SX6018F||40 Gbps|
|RoCEv2 (DCQCN)||ConnectX-4 Lx||Dell S4048-ON||10 Gbps|
|iWARP||T62100-LP-CR||Mellanox SX6018F||40 Gbps|
Appendix B Characteristics of Latency- and
Throughput-Sensitive Flows in the Absence
of Bandwidth-Sensitive Flows
Multiple latency-sensitive flows can coexist without affecting each other (Figure 26). Although latencies increase, everyone suffers equally. All flows experience the same throughputs as well.
Similarly, multiple throughput-sensitive flows receive almost equal throughputs when competing with each other, as shown in Figure 27.
Appendix C 100 Gbps Results With/Without Justitia
Similar to the anomalies observed for 10, 40, and 56 Gbps RDMA networks (§2), Figure 30 and Figure 31 show that latency- and throughput-sensitive flows are not isolated from bandwidth-sensitive flows even in 100 Gbps networks. In these experiments, we use 5MB messages since 1MB messages are not large enough to saturate the 100 Gbps link. Justitia can effectively mitigate the challenges by enforcing performance isolation.
Appendix D Reducing CPU Overhead of Using Small Tokens
Using small tokens lead to CPU overhead mainly from busy spinning to fetch tokens generated at a short period (around 1us) which precludes any context switches. We solve this challenge by decoupling token generation from token enforcement (TE).
To preserve low CPU overhead, tokens are generated in Justitia daemon and distributed via IPC sockets using a large whose is long. Token enforcement happens in Justitia shapers: messages are split into smaller chunks, and a waiting interval is inserted before posting a work request for the next chunk. The longer the waiting interval, the higher the CPU overhead caused by longer busy waiting, and the better isolation we achieve by allowing more small flows to sneak through during those intervals. For example, if we set the waiting interval to be the time it takes to send out one small chunk at the current rate enforced by the pacer (), the waiting intervals altogether will span the entire token generation time ; this leads to 100% CPU usage. Any shorter interval leads to a lower CPU usage with a shorter interval, and any longer interval fails to maintain . If we denote the waiting interval by , we get
where the shaper’s CPU overhead can be easily controlled by periodically following hints provided by the pacer via shared memory. The goal is to find the waiting interval that provides an acceptable isolation while minimizing CPU cost. To dynamically adjust the waiting interval, Justitia increases waiting interval from 0 and stops when a significant improvement in the tail latency estimate can no longer be seen.
Note that the above CPU overhead is caused by pacing small chunks in bandwidth-sensitive applications only. Justitia currently minimizes CPU overhead to half of a core (50%) per bandwidth-sensitive appications, and adds no CPU overhead to other types of applications.
Appendix E Justitia vs. LITE
LITE (Tsai and Zhang, 2017) is a software-based RDMA implementation that adds a local indirection layer for RDMA in the Linux kernel to virtualize RDMA and enable resource sharing and performance isolation. It can use hardware virtual lanes and also includes a software-based prioritization scheme.
We found that, in the absence of hardware virtual lanes, LITE does not perform well in isolating latency-sensitive flow from the bandwidth-sensitive one (Figure 32) – worse 99th percentile latency than Justitia. In terms of bandwidth-sensitive flows using different message sizes, LITE performs even worse than native InfiniBand (Figure 33). Justitia outperforms LITE’s software-level prioritization by being cognizant of the tradeoff between performance isolation and work conservation.
Appendix F Open Problems
Interesting short- and long-term future directions of this work include, among others, dynamically determining a flow’s performance requirements, handling multi-modal flows, handling in-network issues, extending to more complicated application- and/or tenant-level RDMA isolation issues, and implementing Justitia logic in programmable NICs.
We highlight two immediate next-steps in the following.
Co-Designing with Congestion Control. Although Justitia effectively complements DCQCN (§6.4.1) in simple scenarios, DCQCN considers only bandwidth-sensitive flows. A key future work would be a ground-up co-design of Justitia with DCQCN (Zhu et al., 2015) or TIMELY (Mittal et al., 2015) to handle all three traffic types for the entire fabric with sender- and receiver-side contentions (§6.7). While network calculus and service curves (Jang et al., 2015; Stoica et al., 1997; Cruz, 1991a, b) dealt with point-to-point bandwidth- and latency-sensitive flows, their straightforward applications can be limited by multi-resource RNICs and throughput-sensitive flows. At the fabric level, exploring a Fastpass-style centralized solution (Perry et al., 2014) can be another future work.
Justitia at Application and Tenant Levels. Currently, Justitia isolates applications/tenants by treating all flows from the same originator as one logical flow with a single type. This is an approximation of Seawall (Shieh et al., 2011). However, for an application with flows with different requirements, this straightforward approach is unlikely to work well.
A possible direction can be exploring Oktopus-style isolation schemes (Ballani et al., 2011), where we first isolate tenants and then apply Justitia inside each tenant. Similar to hierarchical token bucket (HTB) (Intel, 2003), a hierarchical instantiation of Justitia may be able to achieve this. However, unlike HTB, we must deal with conflicting performance requirements and multi-resource RNICs.
Strategyproof Justitia. Applications may not always correctly or truthfully identify their flow types. Augmenting Justitia with DRFQ (Ghodsi et al., 2012) while adding support for multiple parallel RNIC resources – DRFQ considers multiple resources in sequence – and all three traffic types can be interesting future work.