The tremendous growth in the amount of data that must be processed has resulted in dramatic increase in the demand for memory and storage by data intensive applications such as relational database, key-value store, and machine learning model training, etc. A cluster of machines is often required to satisfy the demand of such applications. RDMA is an interconnect technology that provides high throughput and low latency access to remote memory without CPU intervention. Many efforts have been made to fulfill such demands by providing remote memory access through RDMA [46, 44, 38, 39, 5, 4, 28, 40, 7, 8]. However, building such systems with a native RDMA library requires considerable low level knowledge of the RDMA NIC.
Several RDMA optimizations have been introduced through research efforts in many different systems such as key-value stores, transaction systems and distributed file systems [44, 38, 39, 5, 4, 28, 40, 7, 8, 3, 6, 30]. However, those techniques are not fully optimized at a networking stack, and there is considerable room for improvement. Moreover, those techniques are applied in different systems and are not compared with each other. This makes it hard for application and system developers to understand which optimization technique is suitable for which system design.
To further improve RDMA performance optimization, we focus on two main problems. First, the zero copy capability of RDMA offloading is a double edged sword. While it makes initiation of I/O operations faster, it can also overwhelm the limited onboard resources in the NIC and lead to inefficient usage of PCIe bus between the CPU and NIC (section 4.1). Existing approaches such as Doorbell batching discussed in recent research  chains multiple requests together and uses only one memory mapped I/O (MMIO) for the first request while the remaining requests are transferred via DMA. However, it does not reduce the total number of RDMA operations to the NIC(5.1). RDMAbox introduces Load-aware Batching that further improves batching efficiency over doorbell batching (section 5.1). Load-aware batching opportunistically looks for multiple adjacent requests that use contiguous memory addresses in the destination and merges them into a single request. Therefore, in addition to reducing the number of MMIOs, Load-aware Batching also reduces the total number of RDMA operations.
Even with batching, the NIC’s limited onboard resources can still be overwhelmed since the RDMA architecture lacks admission control (4.1). Recent network congestion control solutions such as Timely  can detect NIC overload by including delay in the NIC in the RTT calculation. But this incurs measuring overhead, and its expensive floating point calculation is not feasible in kernel space. RDMAbox implements a simple traffic regulator for admission control of RDMA I/O to the NIC by utilizing the I/O merge queue of Load-aware Batching(5.1). The traffic regulator stops I/O flow when the merge queue is filled to a configurable amount to prevent the NIC from being overwhelmed. We show that the simple yet effective traffic regulator provides 30% higher throughput than the case without admission control under heavy RDMA I/O load.
Second, few research efforts have been made on analysis of Work Completion (WC) handling mechanisms in RDMA systems. We first reveal limitations of existing approaches in terms of CPU usage, parallelism and scalability (section 4.2). We then propose a new polling scheme called Adaptive Polling (section 5.2). Adaptive Polling is triggered by completion events instead of running all the time, so it has lower CPU overhead than busy polling. Once triggered, Adaptive Polling will continue to poll the Completion Queue (CQ) up to a configured number of WCs or until no WC is left in the CQ. Therefore, Adaptive Polling has lower polling overhead than busy polling which only polls one WC at a time.
To make RDMAbox easy to use, we package the optimizations in kernel and userspace libraries and present them through simple node level abstractions. We demonstrate the flexibility and effectiveness of RDMAbox by implementing a kernel remote paging system and a userpace file system using RDMAbox. We also conduct extensive performance comparisons and analysis to help application and system developers better understand various tradeoffs and make system design decisions (section 6). Our experiments show that in both kernel remote paging and userspace file system cases, RDMAbox based implementations significantly outperform their respective previous efforts (section 7). In particular, comparing to nbdX , RDMAbox based kernel remote paging system improves throughput by up to 6.48× and reduces average latency by up to 83% and 99th tail latency by up to 98%, respectively in bigdata workloads, and reduces completion time by up to 83% in machine learning workloads. Our FUSE-based file system using RDMAbox achieves 1.7 - 6 higher throughput over Octopus , 1.2 - 2.2 over GlusterFS  and 1.2 - 1.6 over FUSE-based file system using Accelio .
The main contributions of this paper include:
low level RDMA optimizations that outperform previous solutions, packaged in easy-to-use APIs for kernel and userspace;
extensive performance test comparison with previous solutions and detailed analysis to help better understand various tradeoffs and make system design decisions;
demonstration of the flexibility and effectiveness of RDMAbox by implementing and providing a kernel remote paging system and a userspace file system as network node level abstraction.
The rest of the paper proceeds with background and related work, problem statement(section 4), RDMAbox optimizations and node level abstraction (section 5) and performance impact and detailed analysis of RDMABox optimizations with real world workload from applications (section 6), followed by evaluation with various application and workload patterns (section 7) and conclusion (section 8).
2 RDMA preliminary
Remote Direct Memory Access (RDMA) is the ability of accessing memory on a remote node without the intervention of CPUs. It provides high bandwidth and low latency. Infiniband (IB), Internet Wide-Area RDMA Protocol (iWARP), and RDMA over Converged Ethernet (RoCE) are examples of RDMA implementation today. We describe terminologies and logic flow of RDMA I/O operations here.
Memory Region(MR) should be registered in advance. User either copies data to existing MR or registers user provided buffer as MR dynamically before initiating RDMA I/O operations. This should be done in both sender and receiver side as well as exchange of a remote key and MR address.
Work Request(WR) is a data structure for RDMA I/O description. Sender needs to fill the destination information such as remote MR address and remote key, local MR address, which can be a starting address and size or Scatter Gather Entries(SGEs).
QueuePair(QP) is a pair of Send Queue and Receive Queue. When sender posts the WR, one needs to specify QP connected to the desired destination. RDMA library converts WR to Work Queue Entry(WQE) and puts it into Send Queue. Then, CPU writes WQE on NIC’s cache through MMIO(Memory Mapped I/O). When NIC processes the WQE in the WQE cache, it issues DMA read to get the data from MR and initiates the transport on the wire.
Local NIC also needs to create Completion Queue Entry(CQE) inside NIC to notify events of message processing completion(either sending or receiving). Then, CQE is converted to Work Completion(WC) and put into Completion Queue(CQ). Sender or receiver can be notified by event or voluntarily polls on CQ to get the event.
3 Related Work
RDMA optimizations have been proposed across different types of systems mentioned in introduction.
Physical address. FaRM reported PTE cache miss and performance decrease as registered MR increases. Suggested solution was to use physical address to avoid PTE cache miss. LITE also utilizes physical address for MR by implementing RDMA abstraction in kernel space. Recent research effort Storm suggested CMA(Contiguous Memory Allocation) in user space.
Multi QP. FaRM shows performance impact of multi QP optimization by varying the number of QPs. Request rate increases as the number of QPs increase, but it decreases as NIC runs out of space for QP cache. Kalia et al. (2016) also utilizes multi QP optimization to engage multiple NIC PU(Processing Unit)s.
Selective signaling and inline operation. Unsignaled verb is used to reduce NIC-initiated DMA (i.e., completion DMA write) and inline RECV in the CQE when the payload is small . Such efforts reduce DMA from NIC and allow PCIe bandwidth to be used for other operations.
One sided vs two sided. One sided vs two sided verb was actively discussed in key value stores, transaction and RPC systems [40, 5, 39, 4, 28].Pilaf and FaRM utilize one-sided RDMA for GET and PUT in key value stores. Unlike previous approach, HERD suggested to use two sided UD SEND with WRITE verb in key value stores. Since UD is scalable and fits for small message size, FaSST also utilize two-sided UD SEND for RPC systems. DrTM+H reported discussions between one-sided and two-sided in transaction systems with many experiments.
Doorbell batch. Doorbell batch is utilized in many research efforts[3, 28, 4] and is well-known batching technique that RDMA NIC provides. It helps to reduce the total bandwidth comsumption on PCIe by replacing MMIO with DMA read  because MMIO uses more bandwidth than DMA. We discuss pros and cons of Doorbell batch(5.1) and propose load-aware batching to further optimize the performance of RDMA NIC for various purposes(section 5.1).
Memcpy vs MR registration. Frey et al.(2009) pointed out that MR registration for small MR(<256KB in their report) has a significant overhead compared to memcpy to pre-allocated and registered MR. However, this is only correct in user space with virtual address. We reveal different results with physical address in kernel space and provide this as an option in our design(5.1).
4 Software Challenges
4.1 I/O thrashing on NIC
With asynchronous nature of RDMA, applications are able to maximize cpu efficiency by posting multiple parallel I/O. However, increasing parallel I/O can cause bottleneck in NIC due to limited resource such as WQE cache [5, 3] and leads to inefficient use of PCIe bandwidth.
When a WR is posted, CPU writes WQE(converted from WR) to NIC with MMIO. To post N WRs, CPU needs to writes MMIO N times on NIC. Many single I/O postings lead to inefficiency in use of PCIe bandwidth between NIC and CPU. Another problem is that, due to limited resource in NIC, such as WQE cache and Memory Protection Table(MPT), which stores permission information of each MR, many parallel single I/O posting likely causes NIC bottleneck(Figure 1).
To figure out performance impact of many parallel single I/O on NIC, we build remote memory system with Linux block device and RDMA. This virtual block device is connected to remote nodes in the cluster through RDMA. One can mount this virtual block device and provide remote memory with POSIX file interface. We run FIO benchmark  to measure IOPS by varying FIO threads. At first, IOPS increases when the number of data request threads increase. At one point, IOPS starts to drop by increasing threads(Figure 0(a)). It shows that posting many parallel single I/Os can increase performance at first but it also can cause NIC bottleneck due to too many I/Os beyond NIC’s capacity. Merging I/Os across data request threads can reduce the total number of I/O more and improves performance, but enforcing cross cpu/thread merging has a significant overhead. For instance, this is why Linux block layer does not provide cross-cpu I/O merging in the block layer . The benefit of parallel processing will be offeset by latency from merge-checking and batching. Although we give an example of kernel space case, applications who directly use RDMA library in user space also face the same I/O thrashing issue on NIC due to many parallel I/Os.
Lack of Admission Control for mitigating NIC bottleneck.
Due to lack of proper management of I/O thrashing on NIC, it creates negative factors such as WQE cache miss to cause NIC bottleneck. This even can happen when network is not congested. We run the same FIO benchmark  on our prototype implementation of virtual block device and measure in-flight operations and RDMA I/O completion time by increasing FIO threads in Figure 1. We set only a client and a server node connected to a single switch, indicating no network congestion. When FIO’s IOPS reaches peak point with 4 FIO threads, both RDMA I/O completion time and the number of in-flight ops are still increasing. This shows NIC becomes bottleneck and it takes longer time to process RDMA I/O requests with more parallel I/Os from userside. Lack of RDMA I/O level admission control in kernel layer can easily make NIC bottlenecked. Although Timely  can detect NIC bottleneck unlike other network congestion control, it is not proper in kernel space due to expensive floating point operations for gradient based rate calculation. Moreover, operations are simply blocked while pacing the traffic, wasting potential chance.
4.2 Limitation in WC handling approaches
Existing approaches have limitations in terms of CPU usage, parallelism and scalability. We briefly explore existing Work Completion handling approaches in RDMA systems here.
Busy polling[3, 4, 5] is used to maximize the performance at a cost of CPU overhead. Busy polling helps to reduce latency of polling and improve performance but it burns CPU even when I/O is in idle. When the number of remote connections increases, CPU overhead also increases linearly.
Event-triggered mode has no CPU overhead but longer latency compared to busy polling due to context switch and interrupt delay. Unlike Busy polling, Event-triggered mode polls in interrupt context. One interrupt context is required to process one WC item.
Event batch is extended from event-triggered mode. It is similar to Linux NAPI , which uses finite number of budget for batched processing. When event is triggered, it polls N times per event and it can get K WC items, where 0KN. It polls in interrupt context same as Event-triggered mode does, but it can reduce the number of interrupts from K to 1 to process K WCs. In this way, it improves performance compared to Event-triggered mode. After processing K WCs, Event batch goes back to Event-triggered mode even if there are other WCs arrived a bit late in CQ. Thus Event batch cannot catch them and another round of interrupt context will be consumed to process them.
One Shared CQ and busy polling are used as an extension of Busy polling case. Since Busy polling approach has too much CPU overhead when increasing the remote connection, it only uses one shared CQ(SCQ in short for the rest of the paper) and one busy polling on the system. It can reduce CPU overhead compared to N Busy polling threads. Since it relies on one serialized shared CQ and busy polling thread, it reduces parallelism(6.2).
Hybrid (event + busy) switches between event mode and busy polling with static length of timer. It has two drawbacks with two types of WC load pattern. First, if WC arrives with the time gap Timer, it burns CPU meaningless just to process 1 WC. If the number of CQs increases with many node connections, then CPU overhead becomes significant. Second, even with burst of WCs, busy polling must return to event mode after Timer and consume context switch and interrupt unnecessarily.
5 RDMAbox optimizations
We provide an overview of our optimization techniques in this section. For more information about low level APIs, we share our open source project on GitHub.
5.1 Load-aware Batching
Avoiding unnecessary batching latency. RDMAbox introduces a single merge queue for each write and read for cross-cpu/thread I/O merging at RDMA sending level under two major rules. First, it merges adjacent requests that have the same destination. This helps to reduce the total number of RDMA I/Os to NIC. Second, it batches differently based on the load on the merge queue. To avoid additional latency due to enforcing cross-thread I/O merging, parallel data request threads put requests into the merge queue and do merge-checking right away(Figure 2). Then, the earliest arriving thread checks the merge queue first. If there are more than one request that can be merged, the earliest arriving thread merges data requests. Then, the later arriving thread(s) that originally brought the request(s) into the merge queue just return(s) since there is no jobs in the queue. If a request arrives alone later, then its data request thread posts single RDMA I/O immediately(Figure 3). Instead of enforcing merging, Load-aware batching allows merging to happen only when the merge queue is stacked up by many concurrent requests due to heavy workload. Otherwise, each thread will post a single RDMA I/O to send its own request. In this way, Load-aware batching can merge requests across threads and yet it does not hurt parallelism. It also avoids additional latency of batching by batching only when it is possible.
Reducing cost of RDMA I/O to NIC. On top of Load-aware batching, RDMAbox supports Batching-on-MR to reduce the number of RDMA I/O and Doorbell Batching to save the bandwidth of PCIe . Both techniques are implemented on top of Load-aware batching as a hybrid approach to reduce I/O cost of RDMA. Batching-on-MR first tries to merge adjacent data requests that are on virtually contiguous on the remote memory. By merging multiple requests into one RDMA WQE, Load-aware batching can save bandwidth of PCIe between NIC and CPU by reducing the number of MMIOs to NIC. For instance, if N requests are merged into 1 WQE, then we can save that amount of space in WQE cache of NIC and reduce N MMIO to 1 MMIO in PCIe. If neighbor requests are not adjacent, then it can be chained as doorbell batch to save some of bandwidth in PCIe.
Comparison with Doorbell batching.
Doorbell batching does not reduce the total number of I/O to NIC. Doorbell batching connects multiple WRs with linked list and posts the first WR with MMIO. Then only the first WR is inserted into NIC through MMIO. The rest of chained WRs remains in memory at the moment and are read by DMA-read from NIC. Posting N WRs with Doorbell batch requires one MMIO and N-1 DMA reads. This way saves a bit of bandwidth but does not reduce the total RDMA I/Os(WQEs) to NIC. For instance, in Figure3, if only doorbell batching is used, the first group will be batched as a doorbell batch but its number of RDMA I/O is as same as that of single I/O case. We provide performance measurements to compare approaches with real world workload in section 6.1
Pre-registered MR vs dynamic MR registration.
On top of Load-aware batching, there are two ways to batch on MR. One is with pre-allocated and registered MR(preMR for the rest of the paper) and the other is with dynamic MR registration on data buffer(dynMR for the rest of the paper) with SGE(Scatter Gather Entry). PreMR avoids allocation and MR registration cost but entails memcpy from data buffer to MR. DynMR does not have allocation cost but MR registration cost. Frey et al.(2009)  reported that memcpy is faster than MR registration for small memory region(<256KB) in user space. Since virtual address of each page is used in MR registration in user space, overhead of storing address translation and PTE cache in NIC is larger than the cost of copying to pre-registered MR for small memory region. However, we found that the result is different in kernel space MR registration. Unlike in user space, copy cost dominates at all MR size in kernel space(Figure 3(a)). Since physical address is used to register MR in kernel space and it does not have overhead of address translation and PTE cache on NIC. In user space, threshold that MR registration gets better is 928KB in our measurement(Figure 3(b)). Therefore, we recommend dynMR for all message size in kernel space solution and mixed approach of dynMR and preMR based on the threshold in user space.
RDMA I/O level Admission Control.
Queueing I/O traffic when I/O channel is congested is a simple and widely used design in many networking systems. Multi-queue based traffic pacer, however, has synchronization cost and fairness issues [41, 42]. Our rule is simple. We want a single-queue-based traffic regulator without adding an extra layer of queue. By implementing a regulator on a single RDMA I/O merge queue, RDMAbox avoids an extra layer of queue and overhead from multi-queue design. RDMAbox uses window based in-flight data limiter with page granularity. Window size can be up to an upper-limit of NIC capability. This upper-limit is configurable at initialization time of RDMAbox. Fragmentation size also can be adjustable. In remote paging system example, it is equal to block I/O size.
Benefit of this design is to take an advantage out of behavior of waiting in a queue. When the traffic regulator blocks I/O traffic, requests are simply waiting in a queue doing nothing, which is usually a necessary waste of time. Since our design of regulator is implemented on RDMA I/O merge queue, it actually has an extra chance to merge neighbor requests while pacing the traffic. Load-aware batching can combine requests while queueing and reduce the number of RDMA I/O and MMIO, in turn, it helps to resolve NIC bottleneck. RDMAbox also provides a hook to implement custom admission control policy. For now, we use static window size for traffic regulator in our prototype since our goal in this paper is not to build complete traffic shaping or network congestion control algorithm. Our simple implementation works well and serves the purpose(Figure 8). Further, by providing a software hook to implement network congestion control solutions [34, 37], RDMAbox can extend its capability by implementing such existing software solutions.
5.2 Adaptive Polling
Adaptive Polling We propose Adaptive polling and discuss the design decisions in this section. Adaptive polling has three main advantages compared to previous approaches.
First, it has better parallelism than SCQ. In Adaptive polling, RDMAbox has N CQs for N RDMA channels. This gives better parallelism than serialized single SCQ and one busy polling thread to process it(N CQs vs 1 CQ).
Second, Adaptive polling has optimized throughput with conditional batch polling. Unlike busy polling which tries to poll a WC at a time, batch polling tries to get N WCs at a time where N is pre-defined value. Polling is only triggered when an event is detected and runs until there are no more WCs in a CQ. Then it immediately goes back to event mode while I/O is idle and no WC items arrive in CQ.
Third, Adaptive polling has less CPU overhead than Busy polling. By returning immediately to event mode when no WC items arrive, it causes less CPU overhead than typical Busy polling, which runs all the time regardless the existence of WC in a queue. The difference of CPU overhead between Adaptive polling and Busy polling becomes larger when increasing the number of completion queues. In this sense, Adaptive polling is also proper in receiver-side RDMA system design. However, Busy polling or SCQ with busy polling are not proper choice for CPU bypassing design in receiver-side RDMA system. It always burns CPU on the receiver side.
Tunable behavior of Adaptive Polling We implement microbenchmark with Adaptive polling to check adaptive behaviors. We set one to one connection with 2 nodes. To figure out the impact of polling, we use synchronous I/O with one QP. Next RDMA I/O is posted when WC arrives. Then we measure bandwidth, CPU usage, interrupt and context switch during 1 million 4KB write operations(Figure 5). To measure accurate CPU usage, we separate CPU affinity to different cores to isolate throughput calculation and log-printing jobs from actual CPU usage.
Adaptive polling acts more like Busy polling when MAX_RETRY becomes larger and Event mode when MAX_RETRY becomes smaller. MAX_RETRY is predefined tunable parameter by user to decide how many times batch polling repeats to detect incoming WC. As MAX_RETRY increases, CPU usage increases because duration of busy polling also increases 4(b). Interestingly, CPU usage of Adaptive polling is lower than Busy polling even when bandwidth reaches to the same highest point as Busy polling. This is because it quickly goes back to event mode when I/O is not busy and comes back to interrupt handler when next WC is detected(See interrupt exists at MAX_RETRY=120 in Figure 4(a) 4(b)). This means that there is meaningless CPU burning to achieve this performance.
Dealing with intermittent and burst load In Adaptive Polling, it behaves well with intermittent and burst load(section 4.2). First, with intermittent load, if the time gap between two WC MAX_RETRY, it behaves as event mode and incur less CPU overhead. Since only a few WCs arrive, accumulated difference of latency between event-like-behavior of Adaptive Polling and busy polling case will be small. Second, with burst load, it does not return if burst load is enough for Adaptive Polling to keep succeeding to detect following WC.
6 RDMAbox: Node level abstraction
RDMAbox also provides a node level abstraction that provides user-transparent remote memory access to user application. RDMAbox implements virtual block device that is connected to remote nodes to achieve this and manages remote resources, data distribution and tracking, and connections. For instance, user can mount RDMAbox block device on a directory and have easy access to remote memory through POSIX file interface(Remote File System) or set it as a swap space for Remote Paging System. One of the main benefit of node level abstraction is that it does not require modification of user application or OS. For remote server daemons to provide and manage remote memory, we follow the design in recent research effort .
We implement Remote Paging System with RDMAbox node level abstraction as an example of kernel-space library. We show performance impact of optimizations and provide detailed analysis using the Remote Paging System example.
6.1 Load-aware batching
Setup and methodology. We use a one-to-one connection to run VoltDB with YCSB. We creates 20GB ETC(Read heavy) and SYS(Write heavy) Facebook workload with Zipfian distribution. We set container limitation to make only 25% of workload stays in 75% is swapped out and distributed. We use 128KB block size.
Comparison of approaches.
We compare performance difference among approaches in Figure 6. In summary, Batching on MR shows 23.6-24.4% and 11.2-11.5% improvement over single I/O with preMR and dynMR in both ETC and SYS workload, Hybrid batch shows 22.2-47.7% and 15.7-40.5% improvement over the single I/O with preMR and dynMR, and 10.8-22.2% and 7.5-13.4% over doorbell batch with preMR and dynMR in both ETC and SYS. We also provide our observations here.
(1) Single I/O vs Batching on MR. Batching on MR shows better performance than Single I/O(Figure 6). This is because the number of RDMA I/O is reduced by batching. Table 1 reports the total accumulated number of RDMA I/Os measured during running VoltDB ETC workload. It shows that batching on MR reduces the number of RDMA I/O well. By having less RDMA I/O, less WQEs toward NIC and less MMIOs by CPUs. N WQEs and N MMIOs can be reduced to one WQE and one MMIO by batching N requests into one.
(2) Doorbell vs Batching on MR. Performance of Doorbell batch is slightly lower than Batching on MR because it does not reduce the number of RDMA I/O to NIC. With heavy workload, Doorbell batch could cause a bottleneck in NIC as single I/O does since it has same amount of RDMA I/O to it.
(3) Hybrid approach. The bright side of hybrid approach with Load-aware batching on MR and doorbell batching is that its optimization point is different. Batching on MR reduces the number of RDMA I/O to NIC and doorbell batch reduces bandwidth consumption. The condition where batching happens is also different from each other. Batching on MR happens on adjacent requests and doorbell batching can chain non-adjacent requests. Hybrid approach shows the highest performance among others. RDMAbox uses this hybrid approach by default.
Batch : Batching-on-MR only with preMR or dynMR,
Door : Doorbell batching only with preMR or dynMR,
Hybrid : Batching-on-MR + Doorbell with dynMR
|Single preMR||Batch preMR||Single dynMR||Batch dynMR||Doorbell dynMR||Hybrid dynMR|
Latency concerns about Batching. Batching might leave concerns about long tail latency. We measured 99th percentile latency of VoltDB with various batching approaches. Figure 7 shows that Load-aware batching does not have negative impact on application latency because RDMAbox does not enforce complete batching. It also helps to alleviate bottleneck in NIC by reducing RDMA I/O and saving bandwidth on PCIe. DynMR batch shows shorter latency than preMR batch because it removes copying cost from the critical path. Hybrid approach shows the shortest latency among others since both parts of batching-on-MR and doorbell-batching help to reduce cost of RDMA I/O.
Multi-channel optimization. RDMAbox adopts multi channel optimization to maximize parallelism. The number of channels per remote node is adjustable at initialization time. The number of QPs in the system can be KN where K is the number of QPs per remote node and N is the number of connected remote nodes. Each channel has its own QP in the dedicated context to avoid false synchronization and limited parallelism due to sharing QPs [28, 4]. In our experiment, 4 channels per remote node setting reaches the best result(See Figure 11).
First, parallelism is improved by multi-queue-pair optimization(4QPs). The peak IOPS is now at 7 FIO threads compared to 4 FIO threads with 1QP case in Figure 1. Multiple processing units in NIC engage multiple QPs and it shows better parallelism. With this, performance drop point is at 7 FIO threads. When comparing the highest performance, multi-QP improves IOPS 63.8% over the one without it(Figure 1).
Second, we measure the in-flight byte size at the peak point(7 FIO threads) and use it as a window size of a traffic regulator to show the effectiveness of RDMA I/O level admission control. By pacing in-flight I/O traffic with about 7MB-sized window(Figure7(b)) in this experiment, the IOPS increases even with more than 7 FIO threads(Figure7(a)). In-flight bytes become stable with the traffic regulator and the highest performance is improved by 29.9%.
6.2 Adaptive Polling
(N is number of peer nodes.)
Event : Event-triggered mode with N CQs.
EventBatch : Batched Event-triggered mode with N CQs.
Busy : N busy polling threads with N CQs.
SCQ(M) : M busy polling thread(s) with M SCQ(s).
AdaptivePoll : Adaptive Polling with N CQs.
We use one local node and N remote nodes to run VoltDB with SYS workload, which has both intermittent and burst I/O load(section 4.2). The rest of the setting is the same as subsection section 6.1 except that we use Single I/O with preMR for this experiment. For workload, we use the CPU-intensive VoltDB to evaluate the CPU usage impact of different polling approaches. SYS workload has more write traffic and it causes more CPU activity in OS block layer due to I/O merging. We use run-to-completion thread model. Given that preMR has more jobs to do in WC handling context than others. So, we choose preMR to show clearer performance differences among different polling approaches. We set one channel(QP) per remote node and each channel has a CQ except SCQ(Shared CQ) case. Note that Hybrid case in  does not have open source code nor algorithm in the paper, it is not included in the comparison.
Poor scalability of Busy polling due to CPU overhead. Busy polling shows the best result when it runs with few remote nodes(Figure 8(a)). CPU overhead until 4 busy polling threads does not affect the performance. Benefit of busy polling is larger than CPU overhead of 4 busy polling threads. When the number of peer nodes increases, throughput severely drops due to CPU overhead(Figure 8(b)). This CPU overhead affects application(VoltDB) performance.
Reasonable scalability of Event-triggered mode with increasing connections. Event-triggered mode shows higher throughput than Busy polling with many peer node connections. Event mode does not have CPU overhead compared to Busy polling(Figure 8(b)) with many peer node connections.
Analysis for Shared CQ. Shared CQ is shared by remote connections in a host node. One busy polling with one Shared CQ, denoted by SCQ(1), shows better performance than Busy Polling case with many peer node connections(peers8). Compared to N Busy Polling threads with N CQs, SCQ(1) has reduced CPU overhead(Figure 8(b)). With a few peer connections, SCQ(1) shows better performance than Event mode. However, SCQ(1) shows worse performance than Event mode with many peer connections. Although Event mode is slow due to interrupt latency and context switch, it has more CQs and more parallel processing context by N CPUs than SCQ(1) with many peer connections(peers8). SCQ(1) can be a bottleneck because all WCs are enqueued into this one shared CQ. In Event mode, WCs are enqueued into N CQs. This makes performance of Event mode is slightly higher than SCQ(1) with many peer connections. SCQ’s parallelism is further limited especially with run-to-completion thread model and/or replication because run-to-completion increases processing time of each WC and replication increases the number of WCs to process.
# of polling threads on a SCQ v.s. Throughput Performance. The performance difference in SCQ(1) and Event mode leaves us a question about parallelism. Event mode shows worse performance than SCQ(1) in small number of connections but SCQ(1) shows worse in many connections. We measure the performance of VoltDB(same setting in Figure 9) by increasing the number of busy polling threads() on M SCQs in Figure 11. Generally speaking, performance decreases as the number of busy polling threads on SCQ(s) increases due to CPU overhead. SCQ(1) with 2 busy polling threads shows slightly higher throughput than SCQ(1) with 1 busy polling thread but CPU overhead dominates after 4 polling threads regardless the number of shared CQs.
Parallelism v.s. CPU overhead.
Then, how about increasing the number of SCQs? Would bottleneck and limitation of parallelism be solved by multiple SCQs ? To figure this out, we increase the number of SCQs in SCQ(M)(Figure 11). Although SCQ(2) shows slightly better performance compared to SCQ(1), performance of SCQ(2) is still lower than Event mode in many connections(peers8) in (Figure 8(a)). It means SCQ(2) still has limitation on parallelism. Adding busy polling threads on each SCQ(M) decreases performance due to CPU overhead as we see in Figure 11. Adding more SCQs does not help either because it also increases the number of busy polling threads and, in turn, CPU overhead increases accordingly(Figure 8(b)).
Advantages of Adaptive polling. Adaptive Polling shows better parallelism than SCQ. Figure 8(a) shows Adaptive polling gives higher throughput than SCQ in many connections. In the same reason, Event mode and EventBatch also show better performance than SCQ(1) and SCQ(2) in many connections. Adaptive polling also shows better performance and lower CPU overhead than N Busy Polling case. Although Adaptive polling has slightly higher CPU overhead than Event-based approaches, we observe that this CPU overhead does not affect application performance.
7 Evaluation with applications
To evaluate the flexibility and effectiveness of RDMABox optimizations, we implement a kernel remote paging system and a user-space network file system using node level abstraction. We report our empirical study and comparison results in this section. First, we show that the RDMABox based remote paging system outperforms ndbX , the most recent representative solution, with up to 6.48 throughput improvement and up to 83% decrease in average tail latency in bigdata workloads, and up to 83% reduction in completion time in machine learning workloads. Second, the RDMABox based user-space file system on top of FUSE achieves 1.6 6 throughput improvements over the existing state of the art solution, represented by Octopus , GlusterFS  and FUSE-based file system using Accelio .
7.1 Remote Paging System with RDMAbox
We evaluate the impact of our optimization on user-level applications using seven memory intensive applications for evaluation. We use three big-data applications: MongoDB, VoltDB and Redis
, and four ML workloads: Logistic Regression
, Gradient Boost classification
, K-means and TextRank. All experiments are performed on Cloudlab nodes with Xeon E5-2650v2 processors(32 2.6Ghz virtual cores), 64GB 1.86Ghz DDR3 memory, 1TB SATA 3.5” rpm hard drives and Mellanox ConnectX-3. We set containers on the host to create swap traffic for our remote paging system example. We first measure peak memory of each application and set container memory limit to create 50% and 25% of in-memory working set fit case. Each application runs with total 22GB up to 50GB in-memory working set workload. Finally, we compare our remote paging system example on RDMAbox with the most performant remote paging systems(nbdX with Accelio) with two different block I/O size. Initially it was 128KB but the latest version set 512KB for better performance. We deploy the applications on container in the host node and use 3 remote peer nodes as memory donors. For RDMAbox, we use replication over 2 remote nodes and disk. Disk access occurs only when all replication is failed.
7.1.1 BigData workload performance
MongoDB is a general purpose, document-based, NoSQL distributed database. VoltDB is a ACID-compliant in-memory transactional database. We set both VoltDB and MongoDB as in-memory-only database. Redis is in-memory distributed caching system through key-value interface. We choose these applications because these are popular ones and have indexing strategies for efficient in-memory computing. This requires more memory for indices as well as dataset and makes workload memory-intensive. These applications show good performance when working set is in memory but it suffers from performance degradation when host node runs out of memory and starts to swap to disk. We use YCSB  with Zipfian distribution to create Facebook simulated workload  ETC and SYS. ETC has 95% read and 5% write and SYS has 75% read and 25% write. We populate these applications with 10 million record first and run 10 million queries to create 15GB to 22GB working set.
Figure 12 shows RDMAbox outperforms nbdX+Accelio(with 2 different block I/O size) up to 3.87 and 4.74 in MongoDB throughput, 4.01 and 6.48 in VoltDB throughput, and 2.73 and 4.33 in Redis throughput. Importantly, the gap between RDMAbox and other systems gets bigger when they rely on more remote memory(more swapping out). RDMAbox also shows lower latency in both average and 99th tail. The latency of nbdX+Accelio increases up to 5.24 and 6.12 in average, and 45 and 66 in 99th tail latency compared to RDMAbox.
7.1.2 ML workload performance
We evaluate RDMAbox with popular ML workload with actual dataset [23, 24, 25], which includes 4 to 87 million samples. In Figure 13, completion time of model training in nbdX+Accelio is higher up to 2.83 and 2.73 in LogisticRregression, 1.5 and 1.54 in GradientBoosting Classification, 1.8 and 2.28 in K-means, and 4.62 and 6.08 in TextRank than RDMAbox. It shows that memory-hungry workload such as TextRank gets more benefit on RDMAbox. Compute-intensive workloads like K-means and GradientBoosting show less performance gap compared to other workloads.
7.2 Remote File System with RDMAbox
To provide user-transparent remote memory access, we build Network File System with RDMAbox node level abstraction. Remote File System is mounted on a directory to provide a user transparent access to remote memory in the cluster. RDMAbox manages RDMA networking including remote resource management. Then, user application can read or write a file on this directory through POSIX file interface.
We implement our network file system in user space with FUSE support as an example of RDMAbox user-space library. Note that, for better performance, this Remote File System can also be implemented with RDMAbox remote block device that we introduced in section section 6. Since the main advantage of FUSE is portability over performance, performance of FUSE-based file system is not compared to kernel-based file system.
Comparison Experiment Setup.
We first compare our Remote File System with RDMA-based FUSE file systems such as Octopus , GlusterFS . Octopus is designed for a distributed file system with NVMe and RDMA and provides FUSE based API too. In their implementation, they use DRAM to simulate NVMe. So, we use Octopus as RAM and FUSE-based remote file system, which is not reported in the original paper. GlusterFS is also FUSE-based distributed file system that provides flexible and easy deployment of Gluster volume servers on a RDMA-based cluster. We set GlusterFS on ramdisk instead of disk for fair comparison.
We also compare with existing RDMA abstraction for both kernel and user space such as Accelio  by implementing FUSE-based network file system with Accelio. Since Accelio does not provide node level abstraction, we build the same Remote File System with our FUSE-based implementation by replacing network stack with Accelio. Remote regions , X-RDMA  and LITE  are also RDMA abstraction for remote memory access but RemoteRegion and X-RDMA don’t have open source code and LITE requires modification of OS. So they don’t fit for our purpose. Note that we only compare raw I/O performance here because metadata management in each system is different. Octopus and GlusterFS have distributed metadata to provide richer distributed metadata management capability. For GlusterFS, it also uses local cache for metadata. For FUSE, we use same version of FUSE client and default options except MAX_WRITE=128KB. We run IOzone  on a mounted point of FUSE. It opens one testfile and issues read and write for total 10GB data. We use the same hardware as Remote Paging System evaluation(section 7.1). We setup a cluster of 10 remote nodes for running server daemons in each system. Data will be distributed in 10 remote nodes and one client node will issue read and write.
We report the performance comparison results in Figure 14. Remote File System with RDMAbox has 1.7 - 6 higher throughput over Octopus, 1.2 - 2.2 over GlusterFS and 1.2 - 1.6 over Accelio.
For RDMA optimizations, Octopus utilizes single I/O with preMR, busy polling, multi QP optimization, one-sided. GlusterFS uses Single I/O with dynMR, batch polling and two-sided. In both write and read, Octopus shows slightly higher performance than GlusterFS in small size operations. Although copy cost is included in preMR, dynMR has higher registration overhead in small size message in user space(Figure 3(b)). MultiQP optimization and one-sided verb are also effective contributions. However, in large size data(>928KB threshold in our measurement), Octopus shows similar performance to GlusterFS since copy cost dominates in large size preMR. In GlusterFS case, it doesn’t show higher performance than Octopus even with dynMR because GlusterFS incurs extra copy to the storage on receiver side(server node) as reported in .
Accelio uses Doorbell batch with dynMR, EventBatch, multi QP optimization and two-sided. Since doorbell batch with dynMR shows a lot higher performance than single I/O with preMR or dynMR, Accelio shows higher performance than both Octopus and GlusterFS in both write and read in all data size. EventBatch is also good for less CPU overhead with many server node connections. On the other hand, busy polling on Octopus creates CPU overhead on a client node that is connected to many server nodes. Accelio, however, also has two-sided protocol and incurs extra copy cost to storage on receiver side.
In RDMAbox case, Load-aware batching is used with dynamic switching between preMR and dynMR based on threshold in user space. So it uses load-aware batching with preMR on small data size and dynMR on large data size. Adaptive Polling provides optimized performance like busy polling and incurs less CPU overhead with many server nodes in both sender and receiver node. Admission control and multiQP optimization are also contribution factors. RDMAbox node level abstraction also provides one-sided verb design for server daemon to reduce extra copy cost on receiver side. This shows that RDMAbox abstraction has flexibility and effectiveness in design of remote memory system for large size workload.
We have presented RDMAbox with a suite of RDMA optimizations, packaged in easy-to-use kernel and user-space libraries. We motivate the design of RDMABox optimizations with empirical analysis of inherent problems in conventional RDMA I/O operations. To demonstrate the flexibility and effectiveness of RDMAbox, we also implement a kernel remote paging system and a user-space file system. Extensive experiments on big data and machine learning workloads show the effectiveness of RDMABox optimized implementations over existing representative approaches. By reducing the cost of RDMA I/O with our optimizations, remote paging systems with RDMAbox achieve up to 6.48 higher throughput with a reduction of up to 83% in 99th percentile tail latency and up to 78% in average latency in bigdata workloads, and up to 83% reduction in completion time of model training in machine learning workloads over existing representative solutions. Userspace file systems based on RDMAbox achieve up to 6 higher throughput over existing representative solutions.
-  Jonathan Corbet "The multiqueue block layer" https://lwn.net/Articles/552904/ , June 5, 2013
-  Anuj Kalia, Michael Kaminsky, David G. Andersen. "Design Guidelines for High Performance RDMA Systems" USENIX Annual Technical Conference, 2016
-  Anuj Kalia, Michael Kaminsky, David G. Andersen. "FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs." 12th USENIX Symposium on Operating Systems Design and Implementation, 2016
-  A. Dragojevic, D. Narayanan, O. Hodson, and M.Castro. "FaRM: Fast remote memory" Proceedings of the 11th USENIX NSDI, Apr. 2014
-  Philip Werner Frey, Gustavo Alonso "Minimizing the Hidden Cost of RDMA" 29th IEEE International Conference on Distributed Computing Systems, 2009
-  Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, Kang G. Shin "Efficient Memory Disaggregation with INFINISWAP" USENIX NSDI, 2017
-  Mellanox Technology, https://github.com/accelio/NBDX "nbdX"
-  Mellanox Technology, http://www.accelio.org "Accelio"
-  Bae, Juhyun and Liu, Ling and Su, Gong and Iyengar, Arun and Wu, Yanzhao "Efficient Orchestration of Host and Remote Shared Memory for Memory Intensive Workloads" The International Symposium on Memory Systems, 2020
-  C. A. Reiss. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis" In SoCC, 2012
-  C.A. Reiss. "Understanding Memory Configuration for In-Memory Analytics." PhD thesis, UC Berkeley, 2016
-  A. Samih, R. Wang, C. Maciocco, M. Kharbutli, and Y. Solihin. "Collaborate memories in clusters: Opportunities and challenges" Transactions on Computational Science XXII, Berlin, Germany:Springer, 2014, pp.17-41
-  "MongoDB, The database for modern applications" https://www.mongodb.com
-  "Redis, an in-memory data structure store" https://redis.io
-  "VoltDB, a translytical in-memory database" https://github.com/VoltDB/voltdb
-  B.F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. "Benchmarking cloud serving systems with YCSB" In SoCC, 2010
-  B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M Palesczny. "Workload analysis of a large-scale key-value store" In SIGMETRICS, 2012
-  "Scikit-learn, a free software machine learning library" https://github.com/scikit-learn/scikit-learn
-  Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor "Caffe: Convolutional Architecture for Fast Feature Embedding" arXiv preprint arXiv:1408.5093, 2014
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin and Joseph M. Hellerstein
"GraphLab: A New Parallel Framework for Machine Learning."
Conference on Uncertainty in Artificial Intelligence (UAI),2010
Rada Mihalcea and Paul Tarau.
"TextRank: Bringing Order into Text."
Empirical Methods in Natural Language Processing, 2004
-  "TextRank Dataset : http://mattmahoney.net/dc/textdata.html"
-  "ML Dataset : https://www.kaggle.com/c/outbrain-click-prediction/data"
-  "ML Dataset : https://www.kaggle.com/noaa/gsod"
-  W Cao, L Liu "Hierarchical Orchestration of Disaggregated Memory." IEEE Transactions on Computers, 2020.
-  Dmitry Duplyakin and Robert Ricci and Aleksander Maricq and Gary Wong and Jonathon Duerig and Eric Eide and Leigh Stoller and Mike Hibler and David Johnson and Kirk Webb and Aditya Akella and Kuangching Wang and Glenn Ricart and Larry Landweber and Chip Elliott and Michael Zink and Emmanuel Cecchet and Snigdhaswin Kar and Prabodh Mishra. "The Design and Operation of CloudLab." USENIX Annual Technical Conference (ATC),2019 https://www.flux.utah.edu/paper/duplyakin-atc19
-  Xingda Wei, Zhiyuan Dong, Rong Chen, and Haibo Chen "Deconstructing RDMA-enabled Distributed Transactions: Hybrid is Better!" USENIX OSDI, 2018.
-  Dotan Barak "Tips and tricks to optimize your RDMA code" https://www.rdmamojo.com/2013/06/08/tips-and-tricks-to-optimize-your-rdma-code/
-  S Tsai, Y Zhang "LITE Kernel RDMA Support for Datacenter Applications" SOSP, 2017.
-  Teng Ma, Tao Ma, Zhuo Song, Jingxuan Li, Huaixin Chang, Kang Chen, Hai Jiang, Yongwei Wu. "X-RDMA: Effective RDMA Middleware in Large-scale Production Environments" IEEE International Conference on Cluster Computing (CLUSTER), 2019
-  Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. " Remote regions: a simple abstraction for remote memory" USENIX ATC, 2018
-  Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Christopher Lamb, Sameer Kumar, Craig Stunkel, George Bosilca, Aurelien Bouteiller. "UCX: An Open Source Framework for HPC Network APIs and Beyond" Hot Interconnects Symposium, 2015
-  Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, David Zats "TIMELY: RTT-based Congestion Control for the Datacenter" SIGCOMM, 2015
-  "http://lwn.net/2002/0321/a/napi-howto.php3"
-  Yibo Zhu, Yehonatan Liron, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, Ming Zhang. "Congestion Control for Large-Scale RDMA Deployments" SIGCOMM, 2015
-  Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, Minlan Yu. "HPCC: High Precision Congestion Control" SIGCOMM, 2019
-  Stanko Novakovic, Yizhou Shan, Aasheesh Kolli, Michael Cui, Yiying Zhang, Haggai Eran, Boris Pismenny, Liran Liss, Michael Wei, Dan Tsafrir, Marcos Aguilera. "Storm: a fast transactional dataplane for remote data structures" 12th ACM International Systems and Storage Conference (SYSTOR), 2019
-  Anuj Kalia, Michael Kaminsky, David G Andersen. "Using RDMA efficiently for key-value services" SIGCOMM, 2014
-  Christopher Mitchell, Yifeng Geng, Jinyang Li. "Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store" USENIX ATC, 2013
-  Ahmed Saeed, Nandita Dukkipati, Vytautas Valancius, Vinh The Lam, Carlo Contavalli, Amin M Vahdat. "Carousel: Scalable Traffic Shaping at End Hosts" SIGCOMM, 2017
-  Ahmed Saeed, Yimeng Zhao, Nandita Dukkipati, Mostafa Ammar, Ellen Zegura, Khaled Harras, Amin Vahdat "Eiffel: Efficient and Flexible Software Packet Scheduling" NSDI, 2019
-  https://fio.readthedocs.io/en/latest/
-  Youyou Lu, Jiwu Shu, and Youmin Chen, Tao Li "Octopus: an RDMA-enabled Distributed Persistent Memory File System" USENIX ATC, 2017
-  Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novakovic, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei "Remote regions: a simple abstraction for remote memory" USENIX ATC, 2018
-  https://www.gluster.org "GlusterFS"
-  http://www.iozone.org "IOzone Filesystem Benchmark"