The common wisdom is to avoid distributed transactions at almost all costs as they represent the dominating bottleneck in distributed transactional database systems. As a result, many partitioning schemes have been proposed to partition data in a way such that the number of cross-partition transactions is minimized (Curino et al., 2010; Tran et al., 2014; Pavlo et al., 2012; Serafini et al., 2016; Taft et al., 2014; Zamanian et al., 2015). Yet, a recent result (Zamanian et al., 2017) has shown that with the advances of high-bandwidth RDMA-enabled networks, distributed transactions can actually scale. Neither the message overhead nor the network bandwidth are limiting factors anymore.
This raises the fundamental question of how data should be partitioned across machines given high-bandwidth low-latency networks. In this paper, we argue that the new optimization goal should be to minimize contention rather than to minimize distributed transactions as is done in traditional schemes. While one might think that that minimizing contention is the same as minimizing the number of distributed transactions, it turns out that they are different. Minimizing contention is about reducing the chance for conflicts, mostly for frequently accessed items, which in some cases can mean avoiding distributed transactions, but in many it does not.
In this paper, we present Chiller, a new partitioning scheme and execution model based on 2-phase-locking, which aims to minimize contention. Chiller is based on two ideas: (1) re-ordering of transaction operations of a transaction so that the locks on the most contended records, if possible, are acquired last and (2) contention-aware partitioning so that the most critical records can be updated without additional coordination. For example, let’s assume a simplistic scenario with three servers in which each server can store up to two records, and a workload consisting of three transactions , , and . All transactions update . In addition, updates , updates and , and updates and as shown in Figure 2a. The common wisdom would dictate partitioning the data in a way that the number of cross-cutting transactions is minimized; in our example, this would mean co-locating all data for on a single server as shown in Figure 2b, and having distributed transactions for and .
However, we argue that we can achieve a better partitioning, by first re-ordering the operations per transactions, so that the updates to the most contended items, here and , are done last in the transactions, as shown in Figure 2a. This assumes that transactions are executed as compiled stored procedures, similar to H-Store (Kallman et al., 2008) and Volt-DB (Stonebraker and Weisberg, 2013). Second, we argue that it is better to place and on the same machine, as in Figure 2b. At first this might seem counter-intuitive as it increases the total number of distributed transactions from two to three ( is no longer a local transaction) and with it the overall transaction latencies. However, this partitioning scheme can decrease the number of aborts and therefore increase the total transaction throughput.
The idea is that re-ordering the transaction operations (in order to postpone acquiring the locks for the most contended records to the end of the transaction) minimizes the lock duration for the “hot” items and subsequently the chance of an abort of a concurrent conflicting transaction. More importantly, after the re-ordering, the success of a transaction relies entirely on the success of acquiring the lock for the most contended records. That is, if a distributed transaction has already acquired the necessary locks for all non-contended records (referred to as the outer region), the outcome of the transaction only depends on the success of updating the contended records (referred to as the inner region). This allows us to make all updates to the records in the inner region without any further coordination. Note that this partitioning technique primarily targets high-bandwidth low-latency networks, which mitigates the two most common bottlenecks for distributed transactions: message overhead and limited network bandwidth (see Section 2.2 in (Binnig et al., 2016) for a detailed discussion).
Obviously, many challenges need to be addressed to obtain such a contention-minimizing partitioning scheme. First, we need to determine what operations can actually be reordered. For example, primary key dependencies or value checks might make it impossible to re-order certain operations. Second, we need to decide what records should be accessed as part of an inner or outer region. Third, we need to partition the data in a way such that the lock duration for hot records is minimized. This is particularly challenging as we need to decide what records should be co-located on a single partition.
As we will discuss, this requires a new approach to data partitioning, which is quite different than existing partitioning algorithms which aim to minimize the number of distributed transactions, such as Schism (Curino et al., 2010). Chiller also goes beyond existing work (e.g., QURO (Yan and Cheung, 2016)) on re-ordering operations to increase throughput because in addition to addressing distributed transaction processing, it also re-orders operations based on the hotness of individual records. More importantly though, as we will show, re-ordering operations without re-considering the partitioning scheme only leads to limited performance improvements; the challenge lies in optimizing both, the operation order and data partitioning, at the same time. We further believe that our results will gain increasing importance as we observe more and more database vendors, most importantly Microsoft and Oracle, moving towards high-bandwidth RDMA-enabled networks even for their cloud deployments (Li et al., 2016; Dragojević et al., 2015; Paper, 2012).
In summary, we make the following contributions:
We propose a new contention-centric partitioning scheme.
We present a new distributed transaction execution technique, which aims to update highly-contended records without additional coordination.
We show that our system Chiller, which uses our new partitioning scheme and transaction execution technique, can outperform alternative partitioning and execution techniques by up to a factor of 2 on TPC-C and a real-world workload.
2. Why Contention is the problem
The throughput of distributed transactions is limited by three factors: (1) message overhead, (2) network bandwidth, and (3) increased contention (Binnig et al., 2016). The first two limitations are removed with the new generation of high-speed RDMA-enabled networks. RDMA largely avoids or significantly reduces the CPU and message processing overhead, and bandwidth has become abundant in these networks, to the point that the network bandwidth is competitive with that of main memory (Zamanian et al., 2017). However, what remains is the increased contention likelihood. In other words, with the new networks, distributed transactions can be scalable as they are neither limited by the network bandwidth nor CPU overhead, but they still intensify any contention-related problems within the workload as message delays are still significantly longer than local memory accesses.
2.1. Transaction Processing with 2PL & 2PC
To understand contention in transaction processing, let us consider a traditional distributed 2-phase locking (2PL) with 2-phase commit (2PC) protocol as shown in Figure 2(a). Here, we use transaction from Figure 2 as our example, and further assume that we have a transaction coordinator, which is co-located on Server 1. As part of phase one of 2PL, the coordinator first reads the records and at the same time acquires the appropriate shared/exclusive locks on the individual records per server. Once all locks are acquired and all items are read, the coordinator can enter the prepare phase of 2PC and prepare all servers to commit. Note that in this example, the prepare phase can be piggybacked onto the last step of the execution phase and is thus not needed. Finally, the transaction is committed as part of the second phase of 2PC and all locks are released. This implicitly assumes that the lock release phase of 2PC is done at the same time as the commit phase of 2PC.
The small green circle on each partition’s timeline in Figure 2(a) shows the point at which the partition can release the locks and consider the transaction committed. We refer to the time span between acquisition and release of a record lock as the record’s contention span, depicted by the thick blue line on each server’s timeline. During a record’s contention span, all concurrent transactions that access the record must either wait or abort. In this example, the contention span for all records is 2 messages long with piggybacking; even with the next generation of low-latency networks, the latency is at least an order of magnitude higher than only involving local memory accesses. The problem with this execution scheme is that regardless of whether a record is hot or not, the contention span remains the same.
Note that while our example used locking, other concurrency control methods that provide serializability suffer from the same issue, if not worse, even if they do not hold explicit locks for records (Harding et al., 2017). For example, in optimistic concurrency control, even though that there is no lock manager that locks records, transactions have to pass a validation phase before being able to commit which checks that the records that they have modified have not been changed by other concurrent transactions. If they have, the transaction has to abort, causing all of its work to be wasted (Dashti et al., 2017; Harding et al., 2017).
2.2. Contention-Aware Transaction Processing
As part of Chiller, we propose a new execution scheme that aims to minimize the contention span for contended records. The data partitioning layout shown in Figure 2b opens new possibilities to significantly reducing the contention (Section 4 describes in detail how we achieve such a partitioning). Here, the coordinator contacts all partitions involved in transaction and requests locks for all records except for the contended ones (in this case, the hot records are and , and so only the local record needs to be locked). If successful, it will send an RPC message to the partition that holds the hot data (i.e., Server 3) to perform the remaining part of the transaction, which is updating and , and commit its changes. Hence, Server 3 will update its two records, commit if successful, and return to the coordinator. Depending on the response from Server’s 3, the coordinator then sends a message to other partitions to either apply their updates and commit, or roll back their changes and abort.
The reason that Server 3 can unilaterally commit or abort is that it contains all necessary data to finish its part. Therefore, the part of the transaction which has the hottest records is treated as if it were an independent local transaction. This effectively makes the contention span of and much shorter, reducing the overall contention.
There are multiple assumptions in this scheme. First, after sending the request to Server 3, neither the coordinator nor the rest of the partitions can choose to abort the transaction for any reason. This is not a problem, as Chiller’s transaction execution model, concurrency control and replication mechanism together rule out any non-deterministic reason to abort a transaction which has already acquired all its locks and reached the agreement phase. While replication the records in the outer region is straightforward, the replication of the contended records modified in the inner region is more challenging. The coordinator cannot simultaneously send messages to all replicas of these records (as it does with non-contended records), because they may make inconsistent decisions, some committing and some aborting. Any coordination between these replicas is undesirable too, as it would defeat the purpose of the proposed execution scheme, since it would extend the contention span of the hot records by at least one network round trip. We will discuss in Section 5 how the replication works for these records. Second, for a given transaction, the number of partitions that are considered as the host for the inner region has to be at most one. Otherwise, multiple partitions cannot commit independently without coordination. This is why the execution of transactions in this manner requires a new partitioning scheme to ensure that contended records that are likely to be accessed together are located in the same partition. We will formally explain our novel contention-centric execution model in Section 3 and the partitioning algorithm in Section 4.
3. Two-region Transaction Execution
In this section, we present our two-region transaction processing technique from Section 2.2 in more detail. We will assume that we already have a partitioning of the records that avoids contention (we discuss how to do this in Section 4).
3.1. General Overview
Our method adopts the two phase locking (2PL) concurrency control, and provides full serializability. Deadlocks are not possible as we abort transactions as soon as any of their lock requests fail (Harding et al., 2017). The intuition behind our proposed execution model lies in two observations: (1) locks have to be held until the end of transaction, but re-ordering them can shorten the contention span for more critical items, and more importantly, (2) a single server can make the commit decision if the locks for all other records have already been acquired. The traditional 2PC is needed to ensure fault tolerance if any of the servers can make a commit decision. However, if only one server is responsible for the decision, the protocol can be significantly simplified while still being fault-tolerant. In this section, we first discuss the protocol itself, and address fault tolerance later in Section 5.
The goal of our new two-region execution model is to minimize the duration of locks on contended records by postponing their lock acquisition until right before the end of the expanding phase of 2PL, and performing their lock release right after the contended records are read/modified. Transactions scheduled for this type of execution will be executed in two stages, namely the inner and outer regions. More specifically, the execution engine re-orders operations into cold records (outer region) and hot records (inner region); the outer region is executed as normal and any failure in acquiring a lock results in aborting the transaction (see Figure 2(b)). If the locks in the outer region are successfully acquired, the transaction enters the inner region. The records in the inner region are accessed and modified without any communication with other participants. The important point is that the inner region commits upon completion. That is, after the last update is applied, the transaction actually is considered committed (note that we still need to replicate it before returning the commit decision to the client, which will be described in Section 5). We refer to this method of executing a transaction as two-region execution. Not all transactions are handled in this way. When a transaction deals only with cold data, it will be executed normally using 2PC at the end to ensure agreement among sites before committing.
In the two-region execution scheme, the inner region has to be executed after the outer region, and placing an operation in the inner region may amount to re-ordering operations inside transactions. Therefore we must determine which records in a transaction can be placed in the inner region. The next sub-section describes how we build the dependency graph, which models the constraints in re-ordering operations in transactions.
3.2. Static Analysis - Constructing a Dependency Graph
In order to describe our two-region execution model, we use an imaginary flight-booking transaction shown in Figure 4. We construct a dependency graph, whose goal is to find the constraints in re-ordering operations of a stored procedure. This graph is built when registering a new stored procedure in the system. There may be constraints on data values that must hold true for the transaction to be able to commit (e.g., there must be an available seat in the flight for the purchase transaction). Furthermore, operations in a transaction may have dependencies among each other. The goal is to reflect such constraints in the dependency graph. Here, we distinguish between two types of dependencies. A primary key dependency (pk-dep) is when accessing a record can happen only after accessing record , as the primary key of is only known after is read. In a value dependency (v-dep), the new values for the update columns of a record that is to be updated are known only after accessing . To determine the dependencies, we are only concerned about the pk-deps, and not the v-deps. This is because value dependencies do not restrict the lock acquisition order, while pk-deps do put restrictions on what re-orderings are possible. Hence, our algorithm builds a dependency graph for each stored procedure that encapsulates all these constraints. Each operation of the stored procedure corresponds to a node in this graph. There is an edge from node to if the corresponding operation of depends on that of . For example, the insert on line 15 of Figure 4 has a pk-dep on the read operation in line 5 (because of seat_id), and has a v-dep on the read operation in line 6 (because of c.name). This means that getting the lock for the insert query can only happen after the flight record is read (pk-dep), but it may happen before the customer is read (v-dep). The dependency graph is shown in Step 1 of Figure 4. Please refer to the figure’s caption for the explanation of the color codes.
3.3. Run-Time Decision
Given the partitioning scheme and the dependency graph produced by the static analysis, it can be decided for every single transaction the set of records that go into the outer and inner regions at run-time. Below, we describe the algorithm step-by-step.
1 - Decide on the execution model: The first step at execution time is to decide whether a transaction will be run as normal (i.e., with standard two-phase locking), or executed as a two-region transaction. We check the records in the transaction one-by-one in a lookup table that stores a list of hot records (a discussion of the look-up table is provided in Section 4.4). Afterwards, the algorithm checks the dependency graph and adds a given hot record to the list of inner region candidate list only if (a) no child depends on , or (b) all children of are located on the same partition as . In contrast, if has any child whose primary key is still not known or is hosted on a different partition, cannot be moved to the inner region. It is because a record cannot be moved to the inner region unless the locks for all the other transaction’s records which are hosted on other partitions are acquired. For the example in Figure 4, if the insert operation ( in the graph) belongs to a different partition than the flight record (), the flight record cannot be considered for the inner region because there is a pk-dep between the flight record and the seat record.
2 - Select the host for the inner region (referred to as inner host): If all candidate records for the inner region have the same host, then it is chosen as the inner host for that transaction. However, it is possible that there are multiple candidate hosts, but we can only select one, otherwise one server alone can no longer make the commit decision without additional coordination. Currently, the algorithm chooses the candidate with the most number of hot records in the transaction as the final inner host.
3 - Read and lock records in the outer region: The transaction begins acquiring locks and executing operations in the outer region. In our example, this includes acquiring a write lock for the customer record and a read lock for the tax record. If either of them fails, the transaction aborts.
4 - Execute and commit the inner region: Once all locks have been acquired for the records in the outer region, the coordinator delegates processing the inner region to the inner host by sending an RPC message with all information needed to execute and commit the transaction (e.g., transaction ID, all remaining operation IDs, input parameters, etc.). The dependency graph guarantees that all records needed for these operations will be found in the inner host, and so it executes the requested operations from beginning to end with no stall. Note that in the most general case locks are still acquired as part of the inner region execution. However, if it can be guaranteed through static analysis that none of the records of the inner region will be part of any outer region, locks in the inner region can be bypassed altogether, similar to H-store (Kallman et al., 2008).
Once all locks are successfully acquired and records are updated, the inner host commits and informs the coordinator about the outcome (through a proxy to ensure fault tolerance, as we will discuss in Section 5). In case any of the lock requests fails, the inner host aborts the transaction and directly informs the coordinator about its decision. In our example, the update to the flight record is made, a new record gets inserted into the seats table, the partial transaction commits, and the value for the cost variable is returned to the coordinator, as it will be needed to update the customer’s balance.
5 - Commit the outer region: If the inner region succeeds, the transaction is actually already committed and the coordinator can commit all pending changes in the outer region and unlock the records to make the changes visible. If the inner region fails, the coordinate has to unroll potential changes. In our example, the customer’s balance gets updated, and the locks are removed from the tax and customer records.
There are two main challenges for efficiently implementing this execution model. First, this technique will not be useful if the hot records of a transaction are scattered across different partitions. No matter which partition is chosen as the inner host of the transaction, the other hot records in the other partitions will observe long contention spans, even longer than what would have been in the normal execution of the same transaction, because the transaction often may become lengthier when executed in the two-region model. Therefore, it is essential that the hot records that are frequently accessed together are placed in the same partition. We therefore present a novel partitioning technique in Section 4, which is designed specifically to accomplish this goal.
The second challenge is fault tolerance. A transaction executed under two-region model will observe two different commit points; one is when the inner site commits the changes for its hot records (Step 4), and the other is when the changes to the outer region are committed (Step 5). For this reason, replication requires a new technique that will be presented in Section 5.
Furthermore, it should be noted that avoiding the locks within the inner region is only possible if it can be guaranteed that no transaction will touch the inner records as part of an outer region. While it is easy to observe such cases, it is often hard to guarantee this based only on the static analysis of stored procedures. Therefore, for our current implementation we only use the general execution model, which still acquires locks even for the inner region. While this imposes some overhead, it is negligible compared to the message delay. Furthermore, our partitioning scheme still minimizes the overall contention and thus also contention on the locks.
4. Contention-aware Partitioning
To fully unfold the potential of the two-region transaction execution model, the objective of our proposed partitioning algorithm is to find a horizontal partitioning of the data which minimizes the contention rather than the number of distributed transaction. To better explain the idea, we will use 4 transactions shown in Figure 5: The shade of red corresponds to the hotness of a record (darker is hotter), and the goal is to find two balanced partitions (to keep things simple in this example, we defined “balanced” as a partitioning that splits the set of records in half. However, a formal definition of load balance will ensue in Section 4.3). Existing distributed partitioning schemes try to minimize the number of distributed transactions. For example, Schism (Curino et al., 2010) would create the partitioning shown in Figure 4(b). However, this partitioning scheme would increase the contention span for records or , and in transaction , because will have to hold locks on either or , and as part of an outer region.
A better partitioning scheme is shown in Figure 4(c) as a single inner region can hold all records. The challenge in creating a contention-aware partitioning algorithm is that we not only need to determine which records should be co-located, but also what should go inside the inner region of different transactions. Besides, the algorithm should also avoid overloading any partition by putting too many contended records in it, or to put it differently, the load should be divided in a balanced way.
In the following sub-sections, we first describe how we collect the statistics that are used to model the contention for any individual record, and then describe our solution to partitioning based on contention.
4.1. Contention Likelihood
Each partition in Chiller has a partition manager, which randomly samples from the running transactions and periodically sends statistics about the most frequently accessed records and their read- and write-sets to the global statistics service, where such statistics are then aggregated for a given time-frame (usually a few minutes). As we only need such statistics to identify the hot records and their frequently co-accessed records, a light-weight sampling-based approach, which for example only collects these statistics for of the transactions is more than sufficient (cf. Section 7).
Once the statistics are aggregated, the total access frequencies of records are then used to determine the conflict likelihood. The probability of a conflicting access for a given record can be formulated as:
are random variables corresponding to the number of times a given record is read or written (i.e., modified) within the lock window, respectively (a lock window is defined as the average time a lock is held on a record. In our experiments). The equation consists of two terms to account for the two possible conflict scenarios: (i) write-write conflicts, and (ii) read-write conflicts. Since (i) and (ii) are disjoint (because ofin the first scenario and in the second scenario), we can simply add them together.
Similar to previous work (Kraska et al., 2009), we assume that we can model () using a Poisson process with a mean arrival time of (), which is just the time-normalized access frequency. This allows us to rewrite the above equation as follows:
Note that the two arrival rates for reads and writes and thus, the contention probability are defined per record. We use to refer to the contention likelihood of record . In the equation above, when is zero, meaning no write has been made to the record, will be zero. This makes sense because shared locks are compatible with each other and do not cause any conflicts. With a non-zero , higher values of will increase the contention likelihood due to the conflict of read and write locks.
Even for a sample with one million records, we found that such calculation can be performed in a matter of a few seconds. Besides, for much bigger workloads, one can think of multiple statistics service instances that each calculate the contention likelihood for a subset of records.
4.2. Graph Representation
The are three key properties that a graph representation of the workload should have to properly fit in the context of our execution model. First, contentions of records must be captured in the graph as this is the main objective. Second, the relationship between records must also be modeled, due to the requirement that there can be only one inner region for a transaction, and hence the frequently co-accessed contended records should be located together. Third, the final partitioning should also determine for each record which transactions should access this record in their inner region and which transactions should access it in their outer region. For these reasons, Chiller models the workload quite differently than any existing partitioning algorithm. Using this representation, we can thus not only efficiently represent contention and use standard graph partitioning algorithms to find a good split, but also determine what should be the inner and outer region for each individual transaction. As shown in Figure 4(c), we model each transaction as a star; at the center is a dummy vertex (referred to as t-vertex, and denoted by squares) with edges to all the records that are accessed by that transaction. Thus, the number of vertices in the graph is , where is the number of transactions and is the number of records. Therefore, the number of edges will be the sum of the number of records involved per transaction.
All edges connecting of a given record-vertex (r-vertex) to all of its t-vertex neighbors have the same weight. This weight is relative to the record’s contention likelihood, as defined in Section 4.1 before. More contended records will have edges with higher weights. In the context of the two-region execution model that we discussed in Section 3, the weight of the edge between an r-vertex and a connected t-vertex reflects how bad it would be if the record is not accessed in the inner region of that transaction.
Applying the contention likelihood formula to the our running example and normalizing the weights will produce the graph with the edge weights in Figure 4(c). Note that there is no edge between any two records. Co-accessing of two records is implied by having a common t-vertex connecting them. Next, we will describe how our partitioning algorithm takes this graph as input and generates a partitioning scheme with low contention.
4.3. Partitioning Algorithm
As we are able to model contention among records using a weighted graph, we can apply standard graph partitioning algorithms. More formally, our goal is to find a partitioning, which minimizes the contention:
Here, is a partitioning of the set of records into partitions, is the contention likelihood of record under partitioning , is the load on partition , is the average load on each partition, and is a small constant that controls the degree of imbalance. Therefore, . The definition of load will be discussed shortly.
Chiller makes use of METIS (Karypis and Kumar, 1998), a graph partitioning tool which aims to find a high-quality partitioning of the input graph with a small cut, while at the same time respecting the constraint of approximately balanced load across partitions (note that finding the optimal solution is NP-hard). The resulting partitioning will assign each vertex to one partition. For our specific problem, the interpretation of the resulting partitioning is as follows: A cut edge connecting a r-vertex in one partition to a t-vertex in another partition (shown in green in Figure 4(c) implies that will have to access in its outer region, and thus observing a conflicting access with a probability proportional to ’s weight. In other words, the partition ID to which is assigned determines the inner host of ; all r-vertices that are assigned to the same partition as their connected t-vertex will be executed in the inner region of , provided that their dependency requirements are met. As a result, finding the partitioning which minimize the total weight of all cut edges also minimizes the contention.
In our example, the sum of the weights of all cut edges (which are portrayed as green lines) is zero. will access record 3 in its inner region as its t-vertex is in the same partition as record 3, while it will access records 1 and 2 in its outer region. Similarly, transaction will access record 3 in its inner region, while records 1 and 2 will be accessed in the outer region. Compared to the partitioning in Figure 4(b), it will have one more distributed transaction.
While the objective function minimizes the contention, the load constraint ensures that the partitions are approximately balanced. The load for a partition can be defined in different ways, such as number of executed transactions, number of hosting records, or the number of records accesses. The weights of the vertices in the graph will depend on the chosen load metric. For the metric of number of executed transactions, t-vertices have a weight of 1 while r-vertices will have a weight of 0. The weighting is reversed for the metric of number of hosting records. Finally for the metric measuring the number of record accesses, r-vertices are weighed proportionally to the sum of reads and writes to them. METIS will generate a partitioning such that the sum of vertices weights in each partition is approximately balanced.
4.4. Discussion and Optimizations
There are two important issues every partitioning scheme has to address: (1) the graph size and the cost of partitioning it, and (2) the amount of meta-data (i.e., lookup table) needed to determine which record is located where.
Even with the recent advances in graph partitioning algorithms, it is still expensive to partition a large graph. However, Chiller has a unique advantage here: compared to existing partitioning techniques such as Schism, it produces significantly smaller graphs, more specifically with fewer edges. For example, Schism (Curino et al., 2010) and many other automatic partitioning (Serafini et al., 2016; Tran et al., 2014) have to introduce one edge for every new co-accessed data item, for a total of edges for a transaction with records. However, because of our star representation, we only introduce edges per transaction; one to connect every r-vertex to the t-vertex of that transaction. Such representation accounts for a much smaller graph compared to the existing tools. For example, we found that on average, constructing the workload graph and applying the METIS partitioning tool take up to 5 times longer on Schism compared to Chiller for the datasets we used in our experiments.
Furthermore, our approach has also a significant advantage to reduce the size of the lookup table. As we are only interested in the hot records, we can focus our attention on the records with a contention likelihood above a given threshold. Hence, the lookup table only needs to store where these hot records are located. All other records can be partitioned using an orthogonal scheme, for example, hash- or range-partitioning, which literally takes almost no lookup-table space. So to locate a record, the only thing we need to do is to check the lookup table whether the record is considered hot, and if it is, use its partition ID from the lookup table, and otherwise use the default partitioner (ranges or hashes). Note that this technique might cause more transactions to be distributed, but as we discussed earlier, minimizing the number of distributed transactions is not the primary goal in new RDMA-enabled databases, as shown in (Zamanian et al., 2017). In our experiments we found that this optimization significantly reduces the size of the lookup table without much negative impact on throughput.
Finally, a last optimization worth mentioning is that it is also possible to co-optimize for contention and minimizing distributed transactions using the same workload representation. One only needs to assign a minimum positive weight to all edges in the graph. The bigger the minimum weight, the stronger the objective to co-locate records from the same transaction.
Like many other in-memory databases, Chiller uses replication for both durability and high availability. Each partition has a user-specified number of replicas, one of which is considered as primary.
The replication of the records in the outer region of a transaction is simple: once the transaction has acquired all its locks, it replicates the updates to the replicas of its write-set. Replicating the updates to the records in the inner region, however, imposes a new challenge. This is because in main-memory databases, the changes to the replicas have to be applied before committing the transaction to ensure transactional consistency and atomicity. This is not possible for the inner region due to its earlier commit point (see Figure 2(b)) for two reasons. First, the replication of the updates in the inner region cannot be postponed to the end of the transaction (i.e., where the updates to the outer region get replicated), because the host of the inner region has already committed its part by this point. Second, its replication cannot be performed by its host either, because this would mean that the inner host has to postpone committing to after replicating the changes on all its replicas. This would defeat the purpose of having an inner region for contended records altogether, because locks would have to be held until the replicas send acknowledgements.
Chiller therefore uses a novel approach to solve this problem. As illustrated in Figure 6, when the host of the inner region successfully finishes its part and commits the changes locally, it sends an RPC message with the new values of the records modified during the processing of the inner region to its replicas, and without having to wait for the acknowledgments to return, it moves on to the next transaction. Upon receiving the replication message, the replica applies the updates to the records. Then, the replica notifies the original coordinator of the transaction (and not the inner host). The coordinator is allowed to continue the transaction only after it has received acknowledgements from all these replicas.
The assumption that the inner host is able to commit and start a new transaction without waiting for the acknowledgements requires the determinism of in-order message delivery — an aspect that is offered by RDMA’s queue-based communication model. Even in a network without such guarantee, a simple message ordering technique (e.g. by producing unique IDs for messages by concatenating partition ID to a monotonically increasing local message ID) will be sufficient. Therefore, if the inner host receives a new transaction and commits, the updates for the new transaction will not be visible to other execution engines unless those transactions too go through the replication phase. Since the replicas receive messages from the inner host in the same order that they are sent by the primary, and will apply those changes in the order they are received, it cannot happen that any update gets lost or overwritten while its subsequent updates have been applied. Note that if the inner region aborts the transaction, the replication phase will not be needed and the inner host can directly reply to the coordinator.
Chiller builds on and extends the NAM-DB (network attached memory) database architecture as proposed in (Binnig et al., 2016). The NAM-DB architecture provides a new abstraction with the goal of decoupling computation and storage by fully exploiting RDMA capabilities in distributed systems. The storage layer provides a shared distributed memory pool, which is exposed in a fine-grained byte-level fashion to all compute servers via RDMA. Such logical decoupling does not prevent co-location of computation and storage servers. A compute server can therefore access its co-located storage locally but still has access to any other storage server using RDMA.
In the context of this work, a compute server acts as a transaction execution engine. We use one transaction engine per available hardware thread, and partition the database across storage servers using the partitioning scheme described before.
For executing the transactions using the two-region model, the execution engine that is the coordinator for the transaction looks up the primary keys of its records in the lookup table that stores the partition assignment for hot records. To execute the outer region, the coordinator can directly access the desired records using RDMA. For the inner region, however, the coordinator must send an RPC message, along with all necessary information and input parameters, to the remote execution engine assigned to the partition that holds the hot items.
As mentioned before, partitions in Chiller should be accessible from any transaction engine via RDMA. Chiller uses RDMA one-sided operations (READ, WRITE) to access table on remote partitions. RDMA one-sided operations need the exact memory location of records that they read or modify. Chiller splits partitions into smaller buckets. Records within a partition are placed in buckets based on a hash/range/user-defined function on their primary keys. Each bucket may host multiple records, and may point to an overflow bucket if it gets full.
Chiller provides full serializability by adhering to two-phase locking (2PL) with the NO_WAIT mechanism. In this variant, a conflict results in the abort of the transaction immediately, and therefore deadlocks are not possible. To achieve this, buckets are locked when any of their records are being accessed, and the lock remains until the transaction commits or aborts. The two types of locks supported are shared and exclusive. Instead of using a separate lock manager that keeps track of lock requests for each lock item (possibly using a queue), each bucket encapsulates its own lock. Aside from the obvious benefit of not relying on a single point of failure which causes a performance bottleneck, the big advantage of using this technique over a traditional centralized lock manager is that it enables remote execution engines to directly read and update the bucket’s lock information via one-sided RDMA operations and RDMA atomics instead of messaging a centralized lock manager as discussed in (Zamanian et al., 2017).
Finally, each execution engine uses the idea of co-routines in order to hide network latency (inspired by (Kalia et al., 2016)). A co-routine is similar to a normal procedure except that it has the ability to suspend, yield, and continue execution without losing state. Unlike threads which are managed by the OS in a preemptive manner, the control flow of co-routines is completely handled by the program. In Chiller, each transaction is assigned to one of the available execution engine’s co-routines. Once the execution is handed over to a co-routine, it performs the operations in the transaction until it requests something over the network. It then passes the control to the next co-routine. This way, the CPU is always busy doing useful work even in the presence of distributed accesses and network stalls. Each execution engine has a master co-routine whose job is to orchestrate the execution of worker co-routines.
We evaluated our system to answer two main questions:
Is the contention-aware data partitioning effective in producing results that can efficiently benefit from the two-region execution model?
How does Chiller and its two-region execution model perform under various levels of contention compared to existing techniques?
The test bed we used for our experiments consists of 8 machines connected to a single InfiniBand EDR 4X switch using a Mellanox ConnectX-4 card. Each machine has 256GB RAM and two Intel Xeon E5-2660 v2 processors, each with 10 cores. The machines run Ubuntu 14.04 Server Edition as their OS and Mellanox OFED 3.4-1 driver for the network. At start-up, each partition pins a dedicated thread for its execution engine. In all of these example, we set the replication degree to 2, meaning that each record has one copy on a different machine.
7.2. Comparison of Partitioning Methods
To evaluate Chiller’s contention-aware partitioning algorithm, we used a data set released by Instacart (Instacart, 2017), which is an online grocery delivery service. We compared Chiller’s partitioning against Schism (Curino et al., 2010) and simple hash partitioning.
7.2.1. Data set and Workload
The key challenge in evaluating any partitioning scheme is to find actual real data. While a lot of work in this area actually only modifies synthetic benchmarks (such as YCSB or modified TPC-C) to test their algorithms (Serafini et al., 2016; Tran et al., 2014; Quamar et al., 2013), such benchmarks have obvious limitations. To test our partitioning scheme, we therefore opted to simulate a real-world system by using the recently released Instacart data set (Instacart, 2017). The dataset contains over real million grocery orders for around items from more than of their customers. On average, each order contains grocery products purchased in one transaction by a customer.
To model a transactional workload based on the Instacart data, we used a TPC-C-like workload. More specifically, we used the TPC-C’s NewOrder
stored procedure where each transaction reads the stock values of a number of items, subtracts each one by 1, and inserts a new record in the order table. However, instead of randomly selecting items according to the TPC-C specification, we used the actual Instacart data set to determine what is inside a transaction. Unlike the original TPC-C, this data set is actually difficult to partition due to the nature of grocery shopping, where items from different categories (e.g., dairy, produce, and meat) may be purchased together. Also, and more importantly, there is a significant skew in the number of purchases of different products. For example, 15 and 8 percent of transactions contain banana and strawberries, respectively. Such access skew translates into high data contention — and thus hot items in the database.
The results for comparing a (1) hash-based partitioning based on the primary key, (2) Schism (Curino et al., 2010), a state-of-the-art workload partitioning algorithm which aims to minimize the number of distributed transactions by co-locating records frequently accessed together, and (3) Chiller are shown in Figure 7. For this experiment, we increased the number of machines from 2 to 8, each hosting one partition, but kept the total data size constant. We measured the total throughput of the TPC-C-like NewOrder transactions, just with the aforementioned Instacart products, without any other running transactions. It is important to note that we restricted the execution to exactly one core per machine. We did this to emphasize the impact of the partitioning and avoid mixing it with local transactions (e.g., having two partitions on the same machine, which can potentially bypass the network).
Total Throughput: As Figure 7 shows, Schism achieves a significantly higher throughput than hash-partitioning (about 50 percent improvement), but neither of those techniques scale with the number of partitions. In contrast, Chiller achieves the highest throughput and scales almost linearly with the number of partitions.
Number of Distributed Transaction: To investigate what accounts for these results, we analyzed the percentage of distributed transactions for all three partitioning schemes (Figure 8). Schism seems to be relatively successful in its goal of minimizing the number of distributed transactions, as it has the lowest ratio of such transactions compared to hashing and Chiller. For instance, partitioning using Chiller results in more distributed transactions than Schism for 2 partitions, although the gap becomes narrower as the number of partitions increases. Yielding fewer distributed transactions, however, does not enable Schism to outperform Chiller, or even scale.
This result emphasizes our claim that in modern distributed environments, where RDMA has dramatically reduced the cost of message processing, minimizing the number of distributed transactions should not be the primary goal of data partitioning. Instead, it is the contention that determines whether a workload can scale or not.
Lookup Table Size: Another observation that we made is about the size of the resulting lookup tables. As mentioned before, all partitioning techniques need to store their record-to-partition mapping in some sort of lookup table. Unlike the original TPC-C benchmark, for workloads such as Instacart, the optimal partitioning layout cannot simply be described in the form of ranges on the primary key, and thus in existing partitioning tools, the number of entries in the lookup table can be as large as the number of records in the database. In contrast, Chiller only needs to store lookup entries for the hot items. In this experiment, we observed that Schism’s lookup table was about 10 times larger than the one in Chiller.
7.3. The Advantages of Two-Region Execution
To assess the ability of our proposed two-region execution model in handling contention, we evaluate how it holds up against alternative commonly used concurrency control models, most importantly traditional 2PL and optimistic schemes (OCC). As described before, we used the NO_WAIT variant of 2PL for both Chiller and traditional 2PL. For OCC, we based our implementation on the MaaT protocol (Mahmoud et al., 2014), which is an efficient and scalable algorithm for OCC in distributed settings (Harding et al., 2017).
7.3.1. Dataset and workload
For this experiment, we used the standard full TPC-C mix without any modifications, and one warehouse per execution engine. Therefore, the partitioning layout is the same for all concurrency control schemes compared in this experiment, that is, partitioned by warehouse. This allows us to focus only on the differences in the execution models of Chiller, 2PL, and OCC. Furthermore, the shopping cart items are selected according to the TPC-C specification, which does not produce much contention on the item table.
We fixed the number of machines to 8 with one warehouse per available core (i.e., 80 warehouses in total). Therefore, each execution engine (which is a thread pinned to a core) handles one warehouse. In this experiment, we increased the number of transactions that can be pending in a partition at any given time (i.e., possible concurrent transactions per server). For example, setting the number of concurrent transaction to 1 means that the execution engine cannot start a new transaction unless it finishes the last one (similar to the H-store (Kallman et al., 2008) and Hyper (Kemper and Neumann, 2011) execution models, where a transaction is executed to completion). Setting it to 5 allows for 5 outstanding transactions (e.g., transactions which wait for messages to return). We therefore expect that regardless of the concurrency control technique, the number of outstanding transactions should improve the overall throughput up to a point as the CPU core is less idle. Finally, as defined in TPC-C of NewOrder and and of Payment transactions are across warehouses (i.e., distributed).
We measured the throughput, abort rate, and commit fairness of Chiller, 2PL, and OCC.
Throughput: As Figure 8(a) shows, when there is only one transaction running in each partition at any given time, 2PL and Chiller achieve the same throughput. However, as we increased the number of concurrent transactions per partition, we observed that only Chiller’s throughput increases.
The reason is that we actually use the TPC-C transactions as originally defined in the benchmark, which do contain two severe contention points. First, every NewOrder transaction does an increment on one out of 10 records in the district table. Second, every Payment transaction has to update the total balance of the warehouse, creating an even more severe contention point. These contention points are also the reason why OCC performs even worse than pessimistic concurrency control techniques; much of its work will end up being wasted as the transaction will eventually find out that it has to abort, as also reported in (Harding et al., 2017). In contrast, Chiller automatically minimizes the lock duration time for those two contention points and thus, even gains benefit by increasing the number of concurrent transactions. In fact, it is able to scale to up to 4 concurrent transactions per warehouse before the throughput also starts to stagnate as the CPU core gets saturated.
Abort Rate: To analyze this further, we show in Figure 8(b) the abort rate for the three concurrency techniques. We see that the abort rate of Chiller is much lower with the number of concurrent transactions as Chiller puts the two contention points into the inner region of the transaction and thus minimizes the lock duration for them.
This figure also verifies that the reason for the drop in 2PL and OCC’s throughput is the increasing abort rates, while Chiller’s abort rate does not increase. So, the reason that it peaks at 4 concurrent transactions is because it becomes CPU-bound.
Commit Fairness: As a side-effect, we noticed another advantage of our two-region execution technique—it largely avoids starvation because of the shorter duration of contended locks. In TPC-C, NewOrder transactions need a shared lock for the warehouse table, while the Payment transactions need an exclusive lock for that table. Therefore, with 2PL the shared lock on the warehouse table gets passed around between different NewOrder transactions all the time, and incoming Payment transactions have no chance of getting an exclusive lock for their update. This is because shared locks are compatible with each other, so when there is a shared lock on a record, other requests for shared lock can be granted, while this is not the case with exclusive locks. As a result, with just 4 concurrent transactions, the abort rate of the Payment transaction in TPC-C using the standard 2-phase locking protocol is close to 100%. In contrast, with Chiller’s execution, the duration of shared locks is minimized, given every transaction more of a chance to acquire their locks. Even though this is an artifact of our NO_WAIT locking method, any other method used to establish more fairness among different stored procedures would do so at the cost of further degrading performance (Harding et al., 2017).
7.4. Impact of Distributed Transactions
Finally, we evaluated the impact of distributed transactions on the various concurrency control techniques.
We used the TPC-C benchmark, but restricted the set of transactions to NewOrder and Payment. For Payment, we varied the probability that the paying customer is located at a remote warehouse (the TPC-C default is ), and for NewOrder we varied the probability that at least one of the purchased items is located in a remote partition (the default is ). Each of these two transactions make up of the mix.
Figure 10 shows the total throughput with a varying fraction of distributed transactions. For both 2PL and OCC, we show the performance for running 1 or 5 concurrent transactions per warehouse at any given time. For Chiller, we used 5 concurrent transactions as it produced the best results, which is due to the way that it can serialize contention points. With 1 concurrent transaction per warehouse, there are few conflicts between transactions (transactions from different warehouses conflict only when accessing the same record from item table). Therefore, the main reason for the significant performance degradation when more transactions span multiple partitions is the higher average latency of each transaction. On the other hand, with 5 open transactions per warehouse, the conflict rate is high even in the absence of any distributed transactions. As the percentage of distributed transactions increases, the already existing conflicts become more pronounced due to the prolonged duration of locks. This observation clearly shows why having good partitioning layout is a necessity for good performance in traditional systems, and why existing techniques aim to minimize the percentage of distributed transactions.
Chiller has the best performance compared to the alternative approaches, and also degrades the least (less than ) when the fraction of distributed transactions increases. This is because the execution thread for a partition always has useful work to do; when a transaction is waiting for remote data, the next new or pending transaction can be processed. Since conflicts are most likely handled sequentially in the inner region, concurrent transactions have a very small likelihood of conflicting with each other. Therefore, an increase in the percentage of distributed transactions only means higher latency per transaction, but does not have a significant impact on throughput. This highlights our claim that minimizing the number of multi-partition transactions should not be the primary goal in the next generation of OLTP systems that leverage fast networks, but rather that optimizing for contention should be.
8. Related Work
Data Partitioning: A large body of work exists for partitioning OLTP workloads with the ultimate goal of minimizing cross-partition transactions (Curino et al., 2010; Tran et al., 2014). Most notably, Schism (Curino et al., 2010) is an automatic partitioning and replication tool that uses a trace of the workload to model the relationship between the database records as a graph, and then applies METIS (Karypis and Kumar, 1998) to find a small cut while approximately balancing the number of records among partitions. Clay (Serafini et al., 2016) builds the same workload graph as Schism, but instead takes an incremental approach to partitioning by building on the previously produced layout as opposed to recomputing it from scratch. E-store (Taft et al., 2014) balances the load in the presence of skew in tree-structured schemas by spreading the hottest records across different partitions, and then moving large blocks of cold records to the partition where their co-accessed hot record is located. Given the schema of a database, Horticulture (Pavlo et al., 2012)heuristically navigates its search space of table schemas to find the ideal set of attributes to partition the database. As stated earlier, all of these methods share their main objective of minimizing inter-partition transactions, which in the past have been known to be prohibitively expensive. However, in the age of new networks and much “cheaper” distributed transactions, such an objective is no longer optimal.
Determinism and Contention-Reducing Execution: Another line of work aims to reduce contention through enforcing determinism to part or all of the concurrency control unit (Cowling and Liskov, 2012; Kallman et al., 2008; Thomson et al., 2012). In Granola (Cowling and Liskov, 2012), servers exchange timestamps to serialize conflicting transactions. Calvin (Thomson et al., 2012) takes a similar approach, except that it relies on a global agreement scheme to deterministically sequence the lock requests. Faleiro et al. (Faleiro et al., 2014, 2017) propose two techniques for deterministic databases, namely lazy execution scheme and early write visibility, which aim to reduce data contention in those systems. All of these techniques and protocols require a priori knowledge of read-set and write-set.
In response to poor performance of multi-version concurrency control methods in high-conflict scenarios (Harding et al., 2017), MV3C (Dashti et al., 2017) repairs parts of failed transactions. Most related to Chiller is Quro (Yan and Cheung, 2016). Quro also re-orders operations inside transactions in a centralized DBMS, such that more contended data is accessed later in the transaction and hence its lock duration is reduced. However, unlike Chiller, the granularity of contention for Quro is tables, and not records. Furthermore, Quro does not have any notion of distributed transactions nor does it try to find a good partitioning scheme.
Transactions over Fast Networks: This paper continues the growing focus on distributed transaction processing on new RDMA-enabled networks (Binnig et al., 2016). The increasing adoption of these networks by key-value stores (Mitchell et al., 2013; Kalia et al., 2015; Li et al., 2017) and DBMSs (Dragojević et al., 2015; Zamanian et al., 2017; Kalia et al., 2016; Chen et al., 2017) is due to their much lower overhead for message processing using RDMA features, low latency, and high bandwidth. These systems are positioned in different points of the spectrum of RDMA. For example, FaSST (Kalia et al., 2016) uses the unreliable datagram connections to build an optimized RPC layer, and FaRM (Dragojević et al., 2015) and NAM-DB (Zamanian et al., 2017) leverage the RDMA feature to directly read or write data to a remote partition. Though different in their design choices, scalability in the face of cross-partition transactions is a common promise of these systems, provided that the workload itself does not impose contention. Therefore, Chiller’s two-region execution and its contention-centric partition are specifically suitable for this class of distributed data stores.
9. Conclusions and Future Work
This paper presents Chiller, a distributed transaction processing and data partitioning scheme that aims to minimize contention. Chiller is designed for fast RDMA-enabled networks where the cost of distributed transactions is already low, and the system’s scalability depends on the absence of contention in the workload. Chiller partitions the data such that the hot records which are likely to be accessed together are placed on the same partition. Using a novel two-region processing approach, it then executes the hot part of a transaction separately from the cold part. Our experiments show that Chiller can significantly outperform existing approaches under workloads with varying degrees of contention.
For future work, we intend to investigate the possibility of an opportunistic hybrid locking model for contended records. Ideally, using such a model, the system would be able to update the contended records for a partition using a single thread, without getting any locks for them in most cases, and resort to locking only when necessary. Also, expanding this work to other concurrency control mechanism, such as timestamp ordering and variants of optimistic schemes would be an interesting direction for future research.
- Binnig et al. (2016) Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, and Erfan Zamanian. 2016. The End of Slow Networks: It’s Time for a Redesign. PVLDB 9, 7 (2016), 528–539.
- Chen et al. (2017) Haibo Chen, Rong Chen, Xingda Wei, Jiaxin Shi, Yanzhe Chen, Zhaoguo Wang, Binyu Zang, and Haibing Guan. 2017. Fast in-memory transaction processing using RDMA and HTM. ACM Transactions on Computer Systems (TOCS) 35, 1 (2017), 3.
- Cowling and Liskov (2012) James A Cowling and Barbara Liskov. 2012. Granola: Low-Overhead Distributed Transaction Coordination.. In USENIX Annual Technical Conference, Vol. 12.
- Curino et al. (2010) Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: a workload-driven approach to database replication and partitioning. Proceedings of the VLDB Endowment 3, 1-2 (2010), 48–57.
- Dashti et al. (2017) Mohammad Dashti, Sachin Basil John, Amir Shaikhha, and Christoph Koch. 2017. Transaction Repair for Multi-Version Concurrency Control. In Proceedings of the 2017 ACM International Conference on Management of Data. 235–250.
- Dragojević et al. (2015) Aleksandar Dragojević, Dushyanth Narayanan, Edmund B Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No compromises: distributed transactions with consistency, availability, and performance. In Proceedings of the 25th Symposium on Operating Systems Principles. 54–70.
- Faleiro et al. (2017) Jose M Faleiro, Daniel J Abadi, and Joseph M Hellerstein. 2017. High performance transactions via early write visibility. Proceedings of the VLDB Endowment 10, 5 (2017), 613–624.
- Faleiro et al. (2014) Jose M Faleiro, Alexander Thomson, and Daniel J Abadi. 2014. Lazy evaluation of transactions in database systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 15–26.
- Harding et al. (2017) Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017. An evaluation of distributed concurrency control. Proceedings of the VLDB Endowment 10, 5 (2017), 553–564.
- Instacart (2017) Instacart. 2017. The Instacart Online Grocery Shopping Dataset 2017. https://www.instacart.com/datasets/grocery-shopping-2017. (2017). [Online; 2018-08-05].
- Kalia et al. (2015) Anuj Kalia, Michael Kaminsky, and David G Andersen. 2015. Using RDMA efficiently for key-value services. ACM SIGCOMM Computer Communication Review 44, 4 (2015), 295–306.
- Kalia et al. (2016) Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016. 185–201.
- Kallman et al. (2008) Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan PC Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, et al. 2008. H-store: a high-performance, distributed main memory transaction processing system. Proceedings of the VLDB Endowment 1, 2 (2008), 1496–1499.
- Karypis and Kumar (1998) G. Karypis and V. Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing 20, 1 (1998), 359–392.
- Kemper and Neumann (2011) A. Kemper and T. Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In 2011 IEEE 27th International Conference on Data Engineering. 195–206.
- Kraska et al. (2009) Tim Kraska, Martin Hentschel, Gustavo Alonso, and Donald Kossmann. 2009. Consistency rationing in the cloud: pay only when it matters. Proceedings of the VLDB Endowment 2, 1 (2009), 253–264.
- Li et al. (2017) Bojie Li, Zhenyuan Ruan, Wencong Xiao, Yuanwei Lu, Yongqiang Xiong, Andrew Putnam, Enhong Chen, and Lintao Zhang. 2017. KV-direct: high-performance in-memory key-value store with programmable NIC. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 137–152.
- Li et al. (2016) Feng Li, Sudipto Das, Manoj Syamala, and Vivek R Narasayya. 2016. Accelerating relational databases by leveraging remote memory and rdma. In Proceedings of the 2016 International Conference on Management of Data. ACM, 355–370.
- Mahmoud et al. (2014) Hatem A Mahmoud, Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi. 2014. Maat: Effective and scalable coordination of distributed transactions in the cloud. Proceedings of the VLDB Endowment 7, 5 (2014), 329–340.
- Mitchell et al. (2013) Christopher Mitchell, Yifeng Geng, and Jinyang Li. 2013. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store.. In USENIX Annual Technical Conference. 103–114.
- Paper (2012) An Oracle Technical White Paper. 2012. Delivering Application Performance with Oracle’s InfiniBand Technology. Technical Report.
- Pavlo et al. (2012) Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 61–72.
- Quamar et al. (2013) Abdul Quamar, K Ashwin Kumar, and Amol Deshpande. 2013. SWORD: scalable workload-aware data placement for transactional workloads. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 430–441.
- Serafini et al. (2016) Marco Serafini, Rebecca Taft, Aaron J Elmore, Andrew Pavlo, Ashraf Aboulnaga, and Michael Stonebraker. 2016. Clay: fine-grained adaptive partitioning for general database schemas. Proceedings of the VLDB Endowment 10, 4 (2016), 445–456.
- Stonebraker and Weisberg (2013) Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory DBMS. IEEE Data Eng. Bull. 36, 2 (2013), 21–27.
- Taft et al. (2014) Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J Elmore, Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. 2014. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. Proceedings of the VLDB Endowment 8, 3 (2014), 245–256.
- Thomson et al. (2012) Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 1–12.
- Tran et al. (2014) Khai Q Tran, Jeffrey F Naughton, Bruhathi Sundarmurthy, and Dimitris Tsirogiannis. 2014. JECB: A join-extension, code-based approach to OLTP data partitioning. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. ACM, 39–50.
- Yan and Cheung (2016) Cong Yan and Alvin Cheung. 2016. Leveraging lock contention to improve OLTP application performance. Proceedings of the VLDB Endowment 9, 5 (2016), 444–455.
- Zamanian et al. (2017) Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The end of a myth: Distributed transactions can scale. Proceedings of the VLDB Endowment 10, 6 (2017), 685–696.
- Zamanian et al. (2015) Erfan Zamanian, Carsten Binnig, and Abdallah Salama. 2015. Locality-aware partitioning in parallel database systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 17–30.