Improving High Contention OLTP Performance via Transaction Scheduling

10/03/2018 ∙ by Guna Prasaad, et al. ∙ University of Washington 0

Research in transaction processing has made significant progress in improving the performance of multi-core in-memory transactional systems. However, the focus has mainly been on low-contention workloads. Modern transactional systems perform poorly on workloads with transactions accessing a few highly contended data items. We observe that most transactional workloads, including those with high contention, can be divided into clusters of data conflict-free transactions and a small set of residuals. In this paper, we introduce a new concurrency control protocol called Strife that leverages the above observation. Strife executes transactions in batches, where each batch is partitioned into clusters of conflict-free transactions and a small set of residual transactions. The conflict-free clusters are executed in parallel without any concurrency control, followed by executing the residual cluster either serially or with concurrency control. We present a low-overhead algorithm that partitions a batch of transactions into clusters that do not have cross-cluster conflicts and a small residual cluster. We evaluate Strife against the optimistic concurrency control protocol and several variants of two-phase locking, where the latter is known to perform better than other concurrency protocols under high contention, and show that Strife can improve transactional throughput by up to 2x. We also perform an in-depth micro-benchmark analysis to empirically characterize the performance and quality of our clustering algorithm



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online Transaction Processing (OLTP) Systems rely on the concurrency control protocol to ensure serializability of transactions executed concurrently. When two transactions executing in parallel try to access the same data item (e.g., a tuple, index entry, table, etc.), concurrency control protocol coordinates their accesses such that the final result is still serializable. Different protocols achieve this in different ways. Locking-based protocols such as two-phase locking (2PL), associate a lock with each data item and a transaction must acquire all locks (either in shared or exclusive mode) for data items it accesses before releasing any. Validation-based protocols such as optimistic concurrency control(OCC)  (Kung and Robinson, 1981), optimistically execute a transaction with potentially stale or dirty (i.e. uncommitted) data and validate for serializability before commit.

Validation-based protocols (Diaconu et al., 2013; Tu et al., 2013) are known to be well-suited for workloads with low data contention or conflicts, i.e., when data items are being accessed by transactions concurrently, with at least one access being a write. Since conflicting accesses are rare, it is unlikely that the value of a data item is updated by another transaction during its execution, and hence validation mostly succeeds. On the other hand, for workloads where data contention is high, locking-based concurrency control protocols are generally preferred as they pessimistically block other transactions that require access to the same data item instead of incurring the overhead of repeatedly aborting and restarting the transaction like in OCC. When the workload is known to be partitionable, partitioned concurrency control (Kallman et al., 2008) is preferred as it eliminates the lock acquisition and release overhead for individual data items by replacing it with a single partition lock.

Recent empirical studies (Yu et al., 2014) have revealed that even 2PL-based protocols incur a heavy overhead in processing highly contended workloads due to lock thrashing (for ordered lock acquisition), high abort rates (for no-wait and wait-die protocols) or expensive deadlock-detection. Our main proposal in this paper is to eliminate concurrency control-induced overheads by intelligently scheduling these highly-contended transactions on cores such that the execution is serializable even without any concurrency control.

In this paper, we propose a new transaction processing scheme called Strife that exploits data contention to improve performance in multi-core OLTP systems under high contention. The key insight behind Strife is to recognize that most transactional workloads, even those with high data contention, can be partitioned into two portions: multiple clusters of transactions, where there are no data conflicts between any two clusters; and some residuals – those that have data conflicts with atleast two other transactions belonging to different clusters.

As an example (to be elaborated in Section 2), a workload that consists of TPC-C new order transactions can be divided into two portions: each set of transactions that orders from the same warehouse constitutes a cluster, while those that order from multiple warehouses constitute the residuals.

Since transactions in different clusters access disjoint sets of data items, they can be executed in parallel by assigning each cluster to a different core; each core can execute a given cluster without any concurrency control, and all of the executed transactions are guaranteed to commit (unless explicitly aborted). The residuals, on the other hand, can be executed serially either on a single core, again without any concurrency control, or across multiple cores with concurrency control applied. Our protocol aims to capture the “best of both worlds”: partition the workload to identify as many clusters as possible as they can be executed without the overhead of running any concurrency control protocols, and minimize the number of residual transactions.

The idea of transaction partitioning is similar to partitioned databases (Kallman et al., 2008), where data items are split across different machines or cores to avoid simultaneous access of the same data items from multiple transactions. However, data partitioning needs to be done statically prior to executing any transactions, and migrating data across different machines during transaction execution is expensive. Strife instead partitions transactions rather than data, and treats each batch of transaction as an opportunity to repartition, based on the access patterns that are inherent in the batch.

Furthermore, since data access is a property that can change over different workloads, Strife is inspired by deterministic database systems (Thomson et al., 2012) and executes transactions in batches. More precisely, Strife collects transactions into batches; partitions the transactions into conflict-free clusters and residuals; and executes them as described above. The same process repeats with a new batch of transactions, where they are partitioned before execution.

Implementing Strife

raises a number of challenges. Clearly, a naive partitioning that classifies all transactions as residuals would fulfill the description above, although doing so will simply reduce to standard concurrency control-based execution and not incur any performance benefit. On the other hand, if the residual clusters are forced to be small, then number of conflict-free clusters produced might be lesser. As such, we identify the following desiderata for our partitioning algorithm:

Minimize the number of residuals.

Maximize the number and size of conflict-free clusters.

Minimize the amount of time required to partition the transactions; time spent on partitioning takes away performance gain from executing without concurrency control.

To address these challenges, Strife comes with a novel algorithm to partition an incoming batch of transactions. It first represents the transaction batch as an access graph, which is a bipartite graph describing each transaction and the data items that are accessed. Partitioning then proceeds in steps: we first sample on the access graph to form the initial seed clusters, we then allocate the remaining transactions into the clusters. The resulting clusters are merged based on their sizes, and any leftover transactions are collected as the residuals. The final clusters are then stored in a worklist, with the cores executing them in parallel before proceeding to execute the residuals afterwards. Our prototype implementation has shown that the Strife protocol can improve transaction throughput by up to 2, as compared to traditional protocols such as two-phase locking for high-contention workloads.

In summary, we make the following contributions:

We propose a novel execution scheme for high contention workloads that is based on partitioning a batch of transactions into many conflict-free and a residual cluster. We use this clustering towards executing most transactions in the batch without any form of concurrency control except a few in the residual cluster.

We design a new algorithm for transaction partitioning based on their access patterns. Our algorithm uses a combination of sampling techniques and parallel data structures to ensure effiency.

We have implemented a prototype of the Strife concurrency control protocol, and evaluated using two popular transaction benchmarks: TPC-C (Council, 2018) and YCSB (Cooper et al., 2010). The experiments show that Strife can substantially improve the performance of transactional systems under high-contention by partitioning the workloads into clusters and residuals.

The rest of this paper is organized as follows: We first provide an overview of Strife in Section 2. Then in Section 3 we discuss our partitioning algorithm in detail. We present our evaluation of Strife in Section 4, followed by discussion of related work in Section 6.

2. Overview

In Strife, transactions are scheduled on cores based on their data-access pattern. Strife collects and executes transactions in batches, and assumes that read-write set of a transaction can be obtained statically. In scenarios where that is not possible, one can use a two-stage execution strategy similar to deterministic databases (Thomson et al., 2012): first dynamically obtaining the complete read-write set using a reconnaissance query, followed by a conditional execution of the transaction.

Strife employs a micro-batch architecture to execute transactions. Incoming transactions are grouped together into batches, partitioned into clusters and residuals, and scheduled to be executed on multiple cores. Micro-batching allows Strife to analyze conflicting data accesses and utilize them to intelligently partition the workload.

Figure 1. Execution Scheme of Strife on cores

The execution scheme of Strife is shown in Figure 1. A batch of transactions is executed in three phases: analysis, conflict-free and residual phase. First, the batch of transactions is analyzed and partitioned into a number of conflict-free clusters and a small residual. Each conflict-free cluster is then executed without any concurrency control in parallel on all cores in the conflict-free phase. After all clusters have finished execution, the residual transactions are executed on multiple cores with conventional concurrency control.111As mentioned in Section 1, the residuals can be executed serially on a single core as well, although our experiments have shown that executing using multiple cores with concurrency control is a better strategy. Once a batch is completed, Strife repeats by analyzing the next batch.

We next give an overview of each of the three phases using an example workload.

2.1. analysis phase

Figure 2. Access Graph of a TPC-C transactions
Figure 3. (a) Access Graph of TPC-C transactions (b) Optimal solutions for partitioning schemes

The goal of the analysis phase is to partition the batch of transactions into clusters such that any two transactions from two different clusters are conflict-free. We explain the details next.

To partition a batch of transactions, we first represent them using a data access graph. A data access graph is an undirected bipartite graph , where is the set of transactions in the batch, is the set of all data items (e.g., tuples or tables) accessed by transactions in , and the edges contain all pairs where transaction accesses data item . Two transactions are said to be in conflict if they access a common data item and at least one of them is a write.

For example, Figure 2 depicts the access graph of a batch of transactions from TPC-C benchmark. A new-order transaction simulates a customer order for approximately ten different items. The example in Figure 2 contains three different warehouses and . Each warehouse maintains stock for a number of different items in the catalog.

As shown in the figure, transactions access data items from different tables in the TPC-C database. , for example, writes to the warehouse tuple and a few other tuples from other tables such as district and stock that belong to as well. Transactions are in conflict because they both access ; whereas transactions are not. The batch shown in Figure 2, is said to be partitionable as groups of transactions access disjoint sets of data items. It can be partitioned into three clusters that do not conflict with each other, and the clusters can be executed in parallel with each one scheduled on a different core.

However, real workloads contain outliers that access data items from multiple clusters. Consider the example shown in Figure 3(a), again of TPC-Cnew-order transactions. Here, transactions and order items from multiple warehouses, resulting in a conflict with and respectively. There are two ways to execute and : either merge the two clusters that and belong to and assign the resulting cluster to be executed on a single core, or move and into a separate cluster to be executed afterwards. As the former might result in a single large cluster that takes significant amount of time to execute, we take the latter approach where we consider and as residuals. This results in the remaining batch partitioned into three conflict-free clusters along with the residuals, as shown in Figure 3(b).

A clustering is a partition of transactions into sets such that, for any and any transaction , , are not in conflict. Notice that no requirement is placed on the residuals . The data access graph does not distinguish between read and write access, because Strife considers only data items for which there is at least one write by a transaction. Consequently, if any two transactions that access the same common item are placed in two distinct clusters, then at least one of them will have a write conflict with some other transaction, hence we do not need to consider the type of access to the data items.

During the conflict-free phase, each cluster is executed on one dedicated core without any concurrency control between cores. After all clusters have finished, then, during the residual phase, the residual transactions are executed, still concurrently, but with conventional concurrency control applied. Ideally, we want to be at least as large as the number of cores to exploit parallelism at maximum during the conflict-free phase, and we want to be empty or as small as possible to reduce the cost of the residual phase. To get an intuition about the tradeoffs, we describe two naive clusterings. The first is the fallback clustering, where we place all transactions in and set ; this corresponds to running the entire batch using a conventional concurrency control mechanism. The second is the sequential clustering, where we place all transactions in , and set and ; this corresponds to executing all transactions sequentially, on a single core. As one can imagine, neither of these would result in any significant performance improvement. Hence in practice we constrain to be at least as large as the number of cores, and to be no larger than some small fraction of the transaction batch.

In practice, a good clustering exists for most transaction workloads, except for the extreme cases. In one extreme, when all transactions in the batch access a very small number of highly contentious data items, then no good clustering exists besides fallback and sequential, and our system simply resorts to fallback clustering. Once the data contention decreases, i.e., the number of contentious data items increases to at least the number of cores, then a good clustering exists, and it centers around the contentious items. When the contention further decreases to the other extreme where all transactions access different data items, then any partitioning of the transactions into sets of roughly equal size would be the best clustering. Thus, we expect good clusterings to exist in all but most extreme workloads, and we validate this in  Section 4. The challenge is to find such a clustering very efficiently; we describe our clustering algorithm in Section 3.

2.2. conflict-free Phase

After partitioning the incoming workload into conflict-free clusters, Strife then schedules them to be executed in parallel on multiple execution threads. Each execution thread obtains a conflict-free cluster and executes it to completion before moving to the next. Transactions belonging to the same cluster are executed serially one after another in the same execution thread.

Since the scheduling algorithm guarantees that there are no conflicts across transactions from different clusters, there is no need for concurrency control in this phase. As noted earlier, concurrency control is a significant overhead in transactional systems, especially for worloads that have frequent access to highly contended data. Hence removing it will significantly improve performance.

The degree of parallelism in this phase is determined by number of conflict-free clusters. Higher number of clusters result in them being executed in parallel, thereby reducing total time to execute all transactions in conflict-free clusters.

Once an execution thread has completed executing a cluster, it tries to obtain the next one. If there is none, it waits for all other threads that are processing conflict-free clusters before moving to the next phase. This is because residual transactions could conflict with transactions that are currently being executed by other conflict-free

phase threads without concurrency control. Hence, a skew in cluster sizes can cause a reduction in parallelism as threads that complete early cannot advance to next phase, although as our experiments show in Section 

4, that is usually not the case.

2.3. residual phase

As we saw in our example from Figure 3(a), Strife identifies a few transactions to be outliers and considers them as residuals. These transactions conflict with transactions from more than one conflict-free cluster. We execute these residual transactions concurrently on all execution threads, but apply some form of concurrency control. Unlike the conflict-free phase where the we guarantee conflict-freedom, transactions executed in the residual phase require concurrency control to ensure serializability.

We could use any serializable concurrency control protocol in this phase. In Strife, we use 2PL with NO-WAIT deadlock prevention policy as it has been shown to be highly scalable (Yu et al., 2014) with much less overhead compared to other protocols. Under this policy, when a transaction is aborted due to the concurrency control scheme, the execution thread retries it until successful commit or logical abort.

Once all the residual transactions have been executed, the same process is repeated with Strife processes the next batch by running the analysis phase.

3. Transaction Partitioning Algorithm

Figure 4. Example batch of TPC-C transactions

As mentioned, Strife partitions a batch of transactions into a set of conflict-free clusters , and a residual of size at most , with being a configurable parameter. This partitioning problem can be modeled as graph partitioning on the data access graph that corresponds to

. Graph partitioning in general is NP-complete, and hence obtaining the optimal solution is exponential in the size of the batch. Nevertheless, graph partitioning is a well researched area with many heuristic solutions. We review some of these solutions in Section 

6. However, it is challenging to use an off-the-shelf solution to the problem at hand as most of the them do not meet the performance requirements in a low-latency transaction processing system. So, we developed a heuristic solution that exploits the contentious nature of each batch of transactions.

Our partitioning algorithm is divided into three stages: (1) spot, (2) allocate and (3) merge. In the spot stage, we identify highly contended data items from the batch using a randomized sampling strategy. Each of those data items are allocated to a different cluster. Transactions in the batch are then allotted to one of these clusters in the allocate phase. Finally, in the merge phase, some of these clusters are merged together to form larger clusters when a significant number of transactions in the batch co-access data items from multiple clusters.

We use Figure 4 derived from the TPC-C benchmark as an illustrative example. In the figure, a new-order transaction (black dots) shown inside a warehouse (circles) orders items only from warehouse ; and those that order from multiple warehouses are shown at their intersections. As shown in the figure, in the given batch the majority of transactions only orders locally from warehouses and , while many transactions involving and order from multiple warehouses.

Before running the three stages of our partitioning algorithm, we first perform simple pre-processing on the transactions. During pre-processing step, Strife receives incoming transactions, stores them in the current batch, and computes the set of data items that are accessed in a write mode by at least one transaction. Data items that are read-only in the entire batch are ignored for partitioning purposes. For example, items is a dimension table in the TPC-C benchmark that is mostly read and rarely updated; as a consequence many items elements in batch are ignored by our algorithm. In the rest of the algorithm, we consider only those data items in that are written to by at least one transaction in .

3.1. Spot Stage

During the spot stage we create initial seeds for the clusters by randomly choosing mutually non-conflicting transactions. The pseudo-code for this stage is shown in Algorithm 1. Initially, all data items and transactions in the access graph are unallocated to any cluster. We begin by picking a transaction from uniformly at random. If none of the data items accessed by , denoted , is allocated to any cluster, then we create a new cluster and allot each to cluster . If any of the data items is already allocated to a cluster, we reject for the next sample. We repeat this randomized sampling of transactions for a constant number of times, where is some small factor times the number of cores. When all transactions in the batch access a single highly contended data item , for example, the initial pick will create cluster and allot to . All future samples are now rejected as they access . In such a case, we revert back to fallback clustering and execute sequentially.

The goal of the spot stage is to quickly identify highly contentious data items in the workload, as each such item should form their own cluster. To get some intuition of the working of the spot stage, suppose there are cores and the workload happens to have “hot spot” data items , meaning data items that are each accessed by a fraction of all transactions in the batch. An ideal clustering should place each hot spot in a different cluster

. We observe that, in this case, with high probability, each of the hot spots is accessed by one of the transactions chosen by the spot stage as initial seeds. Indeed, every hot spot data item

is picked with probability at least , because during each of the iterations, the probability that we chose a transaction that does not access is and, assuming , we have . This means that, with high probability, the spot stage will form cluster seeds that are centered precisely around those few hot spot data items. By the same reasoning, if two hot spot data items are frequently accessed together, then with high probability the cluster that contains one will also contain the other.

1 Function SpotStage()
2       d.Cluster = NULL;
       R := // Residual Cluster
3       k := 1;
4       repeat
5             Pick a random transaction from ;
6             if  d.Cluster = NULL then
7                   Create a new cluster ;
8                   Add to ;
9                   foreach  do
10                         d.Cluster = ;
12                  k++;
15      until  times;
16      return ;
Algorithm 1 Pseudo-Code for the Spot Stage

In our example, it is we pick one of or -only transaction in one of the rounds with a high probability. So, any or transaction picked in the future is simply rejected as the corresponding warehouse tuple is already allotted a cluster. Similarly, a transaction might be picked in one of the rounds resulting in three base clusters. At this stage, further sampling of transactions does not increase the number of base clusters as all other transactions will be rejected. In an alternate scenario, a -only and a -only transaction might be picked before any transaction due to the randomness of the event resulting in base clusters.

3.2. Allocate Stage

In this stage, we develop on the initial seed clusters created previously by allocating more transactions and data items accessed by them to these clusters.

In this stage, we allocate transactions in two rounds. Let the seed clusters be . In the first round, we scan through transactions in and try to allot a previously unallocated transaction to one of these clusters or as a residual based on the following criteria (refer Algorithm 2 for details):

If none of the allocated transactions access data items in , then we leave unallocated.

If all of the data items in are allocated to a unique cluster , then we allocate to as well and allocate all the other data items in to .

When data items in are allocated to more than one cluster, we allot to residuals .

Let the distance between two transactions and , denoted , be the length of shortest path between them in . For example, distance between and in Figure 3(a) is 1 due to . Distance between and cluster is the shortest distance between any transaction in and .

At the end of first round, all transactions that are at a distance from initial seed clusters are allocated to one of or . If is the maximum distance between two transactions in the same connected component, then repeating the above allocation round for times will eventually allocates all transactions. However, we observe that in practice, is close to for high contention workloads. Hence, in many cases we only need to run the above allocation mechanism once.

Next, we handle the remaining transactions in are left unallocated after the above process has taken place. To allocate them, we run a second round of allocation, but with a slight modification. Instead of skipping a transaction when is unallocated, we allot it to one of the clusters randomly. Transactions that were at a distance of from initial seed clusters are now at a distance of as new transactions have been allocated to them in the first round. So, some unallocated transactions will now have allocated data items. These are processed as in the first round: allocate to if it is the unique cluster for data items in and to if data items in are allocated to more than one cluster.

At the end of the allocate stage, we have a set of clusters (where ) and residual transactions such that all transactions in a cluster access data items only in , and the transactions in access data items belonging to more than one of the clusters.

1 Function AllocateStage()
2       foreach  do
3             if  d.Cluster = NULL then
4                   skip;
6             else if  d.Cluster = NULL or unique  then
7                   Add to ;
8                   foreach  and d.Cluster = NULL do
9                         d.Cluster = ;
12             else
13                   Add to ;
16      foreach Unallocated  do
17             if  d.Cluster = NULL then
18                   Pick a random from ;
19                   Add to ;
20                   foreach  and d.Cluster = NULL do
21                         d.Cluster = ;
24             else if  d.Cluster = NULL or unique  then
25                   Add to ;
26                   foreach  and d.Cluster = NULL do
27                         d.Cluster = ;
30             else
31                   Add to ;
34      return ;
Algorithm 2 Pseudo-Code for the Allocate Stage

In our TPC-C example, if the spot stage had resulted in base clusters one each for and , then most transactions in will be allocatted to one of the clusters in the allocate stage. A small number of transactions that are or , however, will not be allocatted to either of these clusters and will be added to the residuals. If the spot stage produced and as base clusters, most of and transactions from will be allocatted to its clusters. However, none of the transactions can be added to any of the clusters and hence will be added to the residual cluster. We further process the resulting clusters from this stage to reduce the size of the residual cluster. Our example does not execute the second round as all transactions are aready allocated during the first round.

3.3. Merge Stage

Depending on the nature of base clusters created in spot stage, the number of residual transactions that remain at the end of the allocate stage could be large. During merge, we merge some of these clusters to improve the quality of the clusters and to reduce the size of residual cluster. When two clusters and are merged to form a new cluster , transactions in that access data items only from and can now be allocated to using the allocation criteria mentioned above.

While merging reduces the number of residual transactions, excessive merging of clusters could result in forming one large cluster which reduces parallelism in conflict-free phase. Hence, we merge clusters until size of the residual cluster is smaller than the bound specified by the parameter , i.e. using the scheme detailed in Algorithm 3. serves as a parameter that chooses between executing transactions on multiple cores with concurrency control (if is small) versus on fewer cores but with no conflicts and without concurrency control (otherwise). Empirically, we found to be appropriate in our experiments.

Let denote the number of transactions in that access data items in and . Note that the transactions that are accounted for in can access data items from clusters other than and as well. If the two clusters and are separated, then all of the transactions will be marked as residuals. So, we merge cluster pairs using the following criterion:

Since, , a merge scheme using the above criterion always results in the number of residuals being smaller than . Once all such clusters have been merged, transactions in the residual cluster are re-allocated to the new clusters when all data items accessed by a transaction belong one unique cluster. The resulting conflict-free clusters are then executed in parallel without any concurrency control, followed by the residuals with concurrency control applied as discussed in Section 2.

1 Function MergeStage()
2       Clusters := ;
3       foreach  do
4             Create new cluster ;
5             := ;
6             Remove from Clusters;
7             Add to Clusters;
8             foreach  d.Cluster = or  do
9                   d.Cluster = ;
12      foreach  do
13             if  then
14                   Add to ;
17      return (Clusters, );
Algorithm 3 Pseudo-code for the Merge Stage

In our example, when the base clusters are , the number of transactions that are allotted as residuals are small, and hence there is no merging of clusters needed. However, if the clusters are , then size of the residual cluster is large and clusters and are merged. None of the other clusters are merged together as they do not satisfy the merge criterion. The final clusters are then and with a small amount of transactions ( and ) in the residual cluster. In this example, our algorithm has essentially identified conflict-free clusters that can now executed without any concurrency control, where all transactions in these clusters access hot data items.

4. Evaluation

We have implemented a prototype of Strife and evaluated the following aspects of Strife:

We compare the performance of Strife with variants of the two-phase locking protocol for high contention workloads. The results show that Strife achieves up to better throughput both on YCSB and TPC-C benchmark.

We study the impact of the number of “hot” records on performance by varying the number of partitions in the YCSB mixture and number of warehouses in TPC-C. We show that Strife is able to improve its performance as the number of hot items that are mostly independently accessed increases.

We characterize the scalability of Strife along with other protocols by varying the number of threads in the system for a highly contended YCSB and TPC-C workload and Strife outperforms traditional protocols by in terms of throughput.

We evaluate the impact of contention by varying the zipfian constant of the YCSB workload. We observe that while other 2PL protocols perform better at lower contention workload, Strife outperforms them by up to in throughput when the contention is higher.

4.1. Implementation

We have implemented a prototype of Strife that schedules transactions based on the algorithm described above. Strife, at its core, is a multi-threaded transaction manager that interacts with the storage layer of the database using the standard get-put interface. A table is implemented as a collection of in-memory pages that contain sequentially organized records. Some tables are indexed using a hashmap that maps primary keys to records. We implement the index using libcuckoo  (Li et al., 2014) library, a thread-safe fast concurrent hashmap. Records contain additional meta-data required to do scheduling and concurrency control. We chose to co-locate this with the record to avoid overheads due to multiple hash lookups and focus primarily on concurrency control.

As discussed earlier, Strife groups transactions together into batches and process them in a three phases.


We prioritize minimizing the cost of analysis phase over the optimality of our scheduling algorithm. Threads synchronize after each of the three stages of the algorithm. First, the spot stage is executed using a single thread, followed by allocate in parallel. The batch is partitioned into equal chunks and a thread allocates transactions in its chunk to the base clusters created in spot phase. Each record has a clusterId as part of its meta-data and is updated atomically using an atomic compare-and-swap operation. A transaction is allotted to a cluster only when all atomic operations on a transaction succeeds.

This is followed by the merge stage that is carried out by a single thread. The cluster pair counts used in merge stage are gathered during the allocate phase using thread-local data structures and finally merged to obtain the global counts. Each cluster has a root, which initially points to itself. When merging two clusters, we modify the root of one cluster to point to another. To obtain the cluster to which a record or transaction belongs, we trace back to the root. Finally, similar to the allocate phase, the residual transactions are re-allocated to clusters in parallel.


The analysis phase produces a set of conflict-free clusters and the residuals. The conflict-free clusters are stored in a multi-producer-multi-consumer concurrent queue, called the worklist. Each thread dequeues a cluster from the worklist, executes it completely without any concurrency control and obtains the next. Threads wait until all other threads have completed executing the conflict-free clusters.

Once the conflict-free clusters are executed, threads then execute the residual transactions. The residuals are stored in a shared concurrent queue among the threads. Threads dequeue a transaction, and execute it using the two-phase locking concurrency control under the NoWait policy (i.e., immediately aborts the transaction if it failes to grab a lock). Strife moves to the analysis phase of next batch once the residual phase is completed. Technically, the threads can start analyzing the next batch while residual phase of previous batch is in-progress. However, we did not implement this optimization to simplify the analysis and interpretability of results.

4.2. Experimental Setup

We run all our experiments on a multi-socket Intel (R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz with 2TB Memory. Each socket has 15 physical and 30 logical cores. All our experiments are limited to cores on a single socket.

We implemented Strife and our baselines in C++. In all our experiments we set the batch size to be K transactions resulting in a latency of atmost ms. Note that this is lower than recommended client response time of ms for the TPC-C benchmark (Council, 2018), and it did not result in any significant difference in the results.

4.3. Workloads

All our experiments use the following workloads:

TPC-C: We use the a subset of the standard TPC-C benchmark. We restrict our evaluation to a 50:50 mixture New-Order and Payment transactions. We pick these two transactions as they are short ones that can stress the overhead of using a concurrency control protocol.

All tables in TPC-C have a key dependency on the Warehouse table, except for the Items table. Hence most transactions will access at least one of the warehouse tuple during execution. Each warehouses contains districts, each district contains K customers. The catalog lists K items and each warehouse has a stock record for each item. Our evaluation adheres to the TPC-C standards specification regarding remote accesses: % payment transactions are to remote warehouses and % of items are ordered from remote warehouses. Each New-Order transaction orders approximately items resulting in a total of % remote stock accesses. We do not, however, use secondary index to retrieve a customer record using last name and restrict to querying by the primary key customer_id only.

YCSB: The YCSB workload is designed to stress the concurrency control further and help in various micro-benchmark experiments. YCSB transactions are a sequence of read-write requests on a single table. The table contains M keys with a payload size of  bytes and is queried using its primary key. Transactions are generated as belonging to specific partitions where the intra-partition key-access is determined by a zipfian distribution. The distribution can be controlled using the zipfian constant, denoted using . The higher the value of , the higher the frequency of accessing the hotter keys in the distribution. Each transaction consists of accesses with a % probability of reads vs. writes. In our experiments, we control the number of hot records by varying number of partitions.

4.4. Baselines

We compare our Strife prototype with variants of the two-phase locking (2PL) protocol. Several experimental studies (Yu et al., 2014) have shown that 2PL strategies outperform other validation or multi-version based protocols for highly contented workloads. Below are the implementation specifics of our baselines:

NoWait: NoWait is a variant of the 2PL protocol where a transaction acquires shared or exclusive locks depending on the access type during execution, and releases them only upon commit or abort. If a transaction is unable to acquire a lock, it is aborted immediately (without waiting) to prevent deadlocks. We use as many locks as the number of records in the database, each co-located with the record to avoid overheads due to a centralized lock table.

WaitDie: WaitDie is another variant of 2PL that uses a combination of abort and wait decisions based on timestamps to prevent deadlocks. A transaction requesting for a shared or exclusive lock waits in the queue corresponding to the lock only when its timestamp is smaller than all the current owners of the data item. Since transactions in the waiting queue always have decreasing timestamps, there is no deadlock possible. We use a hardware counter-based scalable timestamp generation technique proposed in prior work (Yu et al., 2014).

LockOrdered: This is a deadlock-free variant of 2PL. Before execution, a transaction acquires shared or exclusive locks on all data items it accesses in some pre-determined order to prevent deadlocks, and releases them after committing the transaction.

WaitsForGraph: We use a graph, called the waits-for graph to track dependencies between transactions waiting to acquire logical locks and their current owners. Each database thread maintains a local partition of the wait-for graph similar in prior work (Yu et al., 2014). A transaction is added to the waits-for graph only when a lock is currently held in a conflicting mode by other transaction(s). A cycle in the dependency graph implies a deadlock and the recently added transaction is aborted.

4.5. Varying number of hot records

Figure 5. Performance of TPC-C on cores: (a)Throughput vs. Number of Warehouses (b) Runtime Breakdown
Figure 6. Performance of YCSB on cores: (a)Throughput vs. Number of Warehouses (b) Runtime Breakdown

We first analyze the performance of Strife and compare it with our baseline concurrency control protocols under high contention by varying number of hot records.

Contention in TPC-C workload can be controlled by varying the number of records in the warehouse table as all transactions access (i.e., read or write) the warehouse tuple. Figure 5(a) shows throughput in number of transactions committed per second vs. number of warehouses. As the number of warehouses increases from left to right, contention in the workload decreases from left to right. Payment transaction updates two contentious records: district and warehouse, while a new order reads warehouse and items tuples and updates district, customer, and other low contention items from stock table. In our experimental setup, we retry a transaction if aborts due to the concurrency control.

The results of the experiment are shown in Figure 5(a) (for TPC-C) and Figure 6(a) (for YCSB) respectively. The results show that Strife significantly outperforms all other protocols by up to in terms of throughput. When contention decreases, any concurrency control protocol is expected to improve in performance. Specifically, the number of warehouses in the workload determines the number of conflict-free clusters produced by the analysis phase. When the number of warehouses (in TPC-C) or partitions (in YCSB) is greater than the number of available cores (15 in our experiments), the conflict-free clusters are executed in parallel without any concurrency control. However, other protocols are unable to exploit this parallelism as well as Strife because the workload still have significant number of conflicts within each warehouse.

We now explain the results in detail. The LockOrdered protocol is based on ordered acquisition of logical locks on records. A thread spin-waits until a requested lock is granted. When the number of warehouses is , most threads are blocked except for the two that current have ownership of locks on the warehouses. So, the performance of LockOrdered is poor when the number of warehouses is small. However, as we increase the warehouse from to , the chance that a thread is blocked decreases by a factor of , so the LockOrdered protocol is seen to recover the performance outperforming other 2PL variants. On the other hand, Strife eliminates the locking overhead, and thus results in much better performance.

NoWait and WaitDie protocols use aborts to avoid deadlocks. The advantage of NoWait over WaitDie is that the former has very little overhead as it only needs a single record-level mutex for concurrency control. Hence aborts are cheap and even repeated retries are not very expensive. The WaitDie protocol incurs additional overhead in the form of waiting queue for each record. Another reason for the poor performance of WaitDie is that when a transaction with timestamp gets aborted as it is younger than the current owner, it is also younger than the waiters and hence during retry, it is highly likely that it aborts again. We observe the abort rate of WaitDie is more than % in our experiments.

The WaitsForGraph is more conservative regarding aborting a transaction. It aborts a transaction only when there is a cycle in the global waits-for graph. Even though the graph is maintained in a scalable thread-local fashion, deadlock detection acquires locks on the data structures of other threads and hence serves as a concurrency bottleneck. Note that in TPC-C the actual occurrence of deadlocks is rare and cycle detection is purely an overhead.

Figure 5(b) depicts the average time taken by each phase in Strife for a batch of size K transactions. The cost of analysis is almost constant as we vary the number of partitions. However, the residual phase time and hence the number of residual transactions drops steadily. This is because when a new order transaction belonging to warehouse orders an item from a remote warehouse , it accesses the corresponding stock . It is considered an outlier access only if there also exists a local transaction to that accesses the same . Otherwise will be part of ’s cluster and considered to be conflict-free. So, when the number of warehouses are small, there is a high probability that this happens and hence more residual transactions. We also observe that the time for conflict-free phase decreases steadily as we increase the number of partitions. This further validates that Strife exploits parallelism even in the high-contention workload.

Next, we perform a similar experiment on the YCSB workload by varying the number of partitions. We use a zipfian constant of to produce a highly contended workload. The main difference between TPC-C and YCSB workload is that all transactions access one of the highly contended warehouse tuple in TPC-C, thereby reducing the diameter of the access graph of the batch to . Whereas in YCSB, transactions belonging to a partition need not all access a single contentious data. The zipfian distribution creates a set of hot items in each partition with varying degrees of “hotness”.

Finally, Figure 6(a) shows the comparison of Strife with the baselines. The observations are largely similar to TPC-C. As the number of partitions increases, the total amount of contention in the batch decreases. While this improves the performance of all protocols, Strife still outperforms others with up to improvement in throughput, despite the fact that unlike TPC-C, almost % of time is spent in analysis for the partitions. The conflict-free phase steadily decreases as the batch can be executed in higher degrees of parallelism. We also note that even though the batch is completely partitionable into clusters by design, analysis phase produces more than clusters resulting in some single partition transactions being labeled as residuals.

4.6. Scalability

In this section, we analyze the scalability of Strife and baselines on a high contention workload. We set the number of warehouses to be in TPC-C and vary number of cores.

Figure 7. Scalability of TPC-C workload with warehouses: (a) Throughput vs. Cores (b) Runtime Breakdown
Figure 8. Scalability of YCSB workload with partitions: (a) Throughput vs. Cores (b) Runtime Breakdown

The results are shown in Figure 7(a). When number of cores is , performance of Strife and other protocols are almost similar. Strife clusters the batch into conflict-free clusters and residuals, which are then executed concurrently on cores. When number of cores is increased to , throughput of Strife doubles as all the clusters can be executed simulatenously. Beyond cores, the number of conflict-free clusters produced is still and so there is no significant change in throughput. But since the analysis phase is executed in parallel, the time spent there decreases, and this improves throughput as we scale up the number of cores.

As number of cores increases, with the same degree of contention (i.e., with warehouses) other concurrency protocols improve only marginally but are still much poorer than Strife. Increasing number of cores in the LockOrdered protocol, for example, results in additional threads unnecessarily spin-waiting on locks. Earlier work has revealed that for very high degrees of parallelism (1024 cores) this can lead to lock thrashing (Yu et al., 2014) and can be more detrimental.

The WaitsForGraph protocol performs poorly in high contention as the number of transactions added to the waits-for-graph increases as the number of cores increase. Cost of cycle detection increases as it involves acquiring locks on the thread-local state of other threads. In NoWait and WaitDie, on the other hand, more cores result in increased abort rates because the probability that two conflicting transactions access same warehouse concurrently increases.

For YCSB, we set the number of partitions to be . The keys in the transactions are generated using a zipfian distribution with a value of . For , transactions access a set of hot items. Figure 8(a) depicts the scalability of our system and the 2PL variants on YCSB workload. Throughput doubles when increasing the cores from to due to similar reasons as TPC-C. However, beyond cores improvement in Strife performance is mainly attributed to the parallel execution of analysis phase.

Figure 7(b) and Figure 8(b) show the runtime breakdown for TPC-C and YCSB workload respectively. In TPC-C, around % of time is spent on executing residuals with NoWait concurrency control. This is due to the remote payments are orders specified in the TPC-C standards specification. Analysis phase in YCSB is more expensive due to the nature of contention. We perform a more detailed analysis in the following section.

4.7. Factor Analysis

Figure 9. Analysis Phase Breakdown (a) TPC-C with warehouses (b) YCSB with partitions

We now analyze the cost of various stages in the analysis phase. The analysis phase happens in stages: pre-processing, spot, allocate and merge.

Figure 9(a) shows the breakdown of cost of various the stages during analysis of each batch of K transactions from the TPC-C workload with a : mixture of new-order and payment transactions. With cores, most of the time (around %) is spent in allocating transactions to seed clusters and % on the pre-processing stage. As we increase the degree of parallelism, the time taken by the allocate and pre-processing stages drops steadily. Since we sample only a few transactions ( for our experiments), the cost of spot stage is almost negligible and hence overlaps the cost of merge in the figure.

As our TPC-C transactions can be clustered based on the warehouse that they access, the spot stage is able to identify seed clusters that correspond to warehouses easily. We set the value of (that determines ratio of size of residual cluster) to be in all our experiments. Based on the seed clusters, Strife is able to allocate all transactions to a cluster corresponding to its warehouse cluster when there are no remote accesses and to the residual cluster when there are remote accesses. The main observation here is that Strife does not enter the merge stage as the number of transactions in residual cluster is already within the bounds specified by . Most of the time in analysis phase is spent in scanning through the read-write sets of transactions to allot them to a cluster in the allocate stage. The reduction in allocate and pre-processing stage cost is reflected marginally in the overall performance, as shown in Figure 7(a).

The analysis phase breakdown for YCSB workload is shown in Figure 9(b). YCSB is different from TPC-C in that it has a set of hot items belonging to each partition. Let and be two such hot items: if and are selected into two different clusters in the spot stage, then all data items co-accessed with get allocatted separately and those with separately. Essentially it creates a partition within the YCSB partition rendering most transactions as residual in the allocate stage. Since the residual cluster size is large, we spend more time merging them.

4.8. Impact of Contention

Figure 10. Contention Analysis on YCSB ( partitions)

Finally, we compare the performance of Strife with other protocols by varying the contention parameter of the YCSB222A similar experiment is not possible for the TPC-C workload as every new-order and payment transaction accesses the warehouse tuple, making it highly contended within the warehouse. workload. Even though Strife is not designed for low-contention workloads, we present this analysis to empirically understand the behavior of our partitioning algorithm.

The zipfian constant determines the frequency of access to popular items. For a small value of most transactions access different items leading to fewer conflicts in the workload. In the low contention case, most concurrency control protocols perform well. Especially, we see that NoWait and WaitDie protocols perform about % and LockOrdered about % better than Strife as shown in  Figure 10. But, when we increase the value of theta, the number of conflicts in the workload increases as many transactions access a few hot items. In this high contention case, Strife outperforms other protocols by x. Compared to the overheads of executing such a workload with concurrency control, the analysis quickly reveals the underlying structure of the batch which is used to schedule conflict-free clusters in parallel without any concurrency control.

5. Discussion

In this section we discuss a few tradeoffs in the design of Strife. As discussed in Section 1, Strife is inspired by empirical studies (Harizopoulos et al., 2008; Yu et al., 2014) that reported concurrency control to be a substantial overhead for transactional workloads with high contention. It is built on the insight to identify conflict-free clusters in workloads that can be executed in parallel without concurrency control.

The main challenge in realizing this insight is to partition transactions into conflict-free clusters (as discussed in Section 2) fast enough that does not outweigh the benefits of concurrency control free parallel execution. Most traditional approaches to graph partitioning (as discussed in Section 6) do not meet this criterion. Hence, we designed a new heuristic that exploits the amount of data contention in each workload. Following are some pros and cons of our design choices:

Randomly sampling transactions (Section 3.1) allows us to quickly spot contentious items in the workload to form initial seed clusters. Most other techniques require expensive tracking of access patterns for each data item.

Another key observation that is specific to high contention workloads is that diameter of the access graph is often small (close to ). We characterize this in Sec. 4.8. This allows us to optimize the allocate phase by only needing to run atmost 2 rounds of allocation.

On the contrary, for low contention workloads, the diameter of the access graph tends to be greater than 1. Hence we either need to run multiple rounds of allocate, or assign transactions randomly to the initial seed clusters, as detailed in Sec. 3.2. The former increases the amount time spent in analysis, while the latter results in sub-optimal clusters. We currently choose the latter, although the resulting throughput is still comparable to other locking-based protocols, as shown in Figure 10.

merge step uses an additional parameter that determines which clusters to merge. Merging clusters results in two competing effects: (1) it reduces parallelism in the fast conflict-free phase; but (2) increases the number of transactions that are executed without concurrency control. A high value of reduces the overall benefit of using Strife, while a small value of can force merging of clusters and reduce parallelism. So, picking the right is important to achieve good performance.

The number of conflict-free clusters produced for a batch is an indicator of reasonable trade-off between performance and core utilization. This quantity can be used to provision the amount of resources for the OLTP component in hybrid transactional/analytical processing (HTAP) (Özcan et al., 2017) databases.

6. Related Work

Graph Partitioning

The scheduling problem we proposed and provided a heuristic problem can be fundamentally modeled as a graph partitioning problem called the -way min cut optimization. Even though, the partitioning problem is NP-Complete, several algorithms have been developed that produces good partitions: including spectral methods (Pothen et al., 1990; Hendrickson and Leland, 1995) and geometric partition methods  (Heath and Raghavan, 1995; Miller et al., 1993) among others. Multi-level partitioning algorithms (Karypis and Kumar, 1998) are known to produce better partitions with a moderate computational complexity – the basic idea is that a large graph is coarsened down to a few hundred vertices, a bisection of this graph is computed and then projected back to the original graph through multi-step refinement. METIS (LaSalle et al., 2015) is an open-source graph partitioning library developed based on this scheme. Our preliminary investigation revealed that these techniques are much more expensive and does not match the practical low-latency processing requirements of OLTP systems.

Data Clustering

Our scheduling problem computes a clustering of data items simultaneously as we cluster transactions to ascertain conflict-freedom among the conflict-free clusters. An alternative approach that we investigated is to first partition the data items based on co-occurrence in the batch followed by clustering of transactions based on this partition. Most of the data clustering algorithms such as -means clustering are iterative. Multiple scans of transactions and its read-write set incurs a significant overhead compared to the actual execution of transactions. However, our randomized solution is inspired by Ailon et. al. (Ailon et al., 2008) solution to the correlation clustering problem where elements are clustered based on a similarity and dissimilarity score.

Lock Contention

Johnson et. al. (Johnson et al., 2009) identify the problem of lock contention on hot items and apply a speculative lock inheritance technique to skip the interaction with a centralized lock table by directly passing over locks from transactions to transactions; a core assumption in this work, which does not apply to the highly contended workloads we deal with, is that transactions mostly acquire shared locks on the hot items. Jung et. al. (Jung et al., 2013) identify lock manager as a significant bottleneck and propose a new design for lock manager with reduced latching. We have shown that Strife outperforms the 2PL protocols even under the optimistic assumption of highly scalable record-level locks. Sadoghi et. al. (Sadoghi et al., 2014) identify lock contention as a significant overhead proposes an MVCC based optimization to reduce it.

Orthrus (Ren et al., 2016) is a database design proposal for high contention workloads that partition the concurrency control and transaction execution functionalities into different threads. However, unlike our design, Orthrus still uses locks to perform concurrency control. Yu et. al. (Yu et al., 2014) evaluate OLTP performance on 1000 cores and report that locking-based protocols are perform worse on write-intensive workloads due to lock thrashing, while lightweight 2PL protocols such as NoWait and WaitDie result in a very high abort rate.

Modular and Adaptive Concurrency Control

Callas (Xie et al., 2015) presents a modular concurrency control to ACID transactions that partitions transactions into groups that when executed independently under different concurrency protocols still ensures serializability. Callas uses the dependency graph of a workload similar to our system to group data items but is different from our approach in that we analyze every batch independently and hence can adapt to changing access patterns quickly. IC3 (Wang et al., 2016)

uses static analysis and dynamic dependency tracking to execute highly contended transactions as pieces on multiple cores in a constrained fashion that ensures serializability. Tang et. al. propose adaptive concurrency control(ACC) that dynamically clusters data items and chooses optimal concurrency control for each cluster using a machine learning model trained offline.

Improvements to Traditional Protocols

Dashti et. al. (Dashti et al., 2017) propose a new approach for validating MVCC transactions that uses the dependency graph to avoid unnecessary aborts of transactions. BCC (Yuan et al., 2016) improves traditional OCC by dynamically tracking dependencies that help avoid false aborts during the validation phase. Yan et. al. (Yan and Cheung, 2016) improve 2PL by statically analyzing stored procedures to find an efficient lock acquisition order based on contention of data items in the workload.

Partitioned and Deterministic Databases

H-Store (Kallman et al., 2008; Jones et al., 2010) partitions the database such that most transactions access a single partition thereby reducing the overall concurrency control overhead. Partitioned databases requires static determination of partitions and does not adapt to changing access patterns. Moreover, multi-partition transactions are known to cause a significant drop in OLTP performance (Yu et al., 2014) for partitioned databases. Pavlo et. al. (Pavlo et al., 2012) automatically repartitions a database based on a given workload using local neighborhood search. Calvin (Thomson et al., 2012) is a distributed database that executes transactions by deterministically ordering them. Our choice of micro-batching transactions to schedule them optimally is inspired Calvin and related literature (Ren et al., 2014) on determinstic databases.

Pavlo et. al. (Pavlo et al., 2011)

predict and choose the optimizations (such as intelligent scheduling) that a distributed OLTP system can employ during runtime using a combination of offline machine-learning and markov models. This approach, however, is not adaptive to dynamic workloads with changing access patterns.

7. Conclusion

We presented Strife, a transaction processing system for high-contention workloads. Strife is designed based on the insight that portions of a transactional workload can be executed as conflict-free clusters without any concurrency control, even when the workload has high data contention. We achieved this by developing a low-overhead partitioning algorithm that divides a batch of transacions into a set of conflict-free clusters and residuals. The clusters are executed on multiple cores without any concurrency control, followed by the residuals executed with concurrency control. Our experiments have showed that Strife can achieve substantial performance improvement, with throughput increase compared to standard locking-based protocols on TPC-C and YCSB workloads.


  • (1)
  • Ailon et al. (2008) Nir Ailon, Moses Charikar, and Alantha Newman. 2008. Aggregating Inconsistent Information: Ranking and Clustering. J. ACM 55, 5, Article 23 (Nov. 2008), 27 pages.
  • Cooper et al. (2010) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC ’10). ACM, New York, NY, USA, 143–154.
  • Council (2018) Transaction Processing Performance Council. 2018. TPCC-C Standards Specification. [Online; accessed 19-Dec-2017].
  • Dashti et al. (2017) Mohammad Dashti, Sachin Basil John, Amir Shaikhha, and Christoph Koch. 2017. Transaction Repair for Multi-Version Concurrency Control. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). ACM, New York, NY, USA, 235–250.
  • Diaconu et al. (2013) Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Ake Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. Hekaton: SQL Server’s Memory-optimized OLTP Engine. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 1243–1254.
  • Harizopoulos et al. (2008) Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. 2008. OLTP Through the Looking Glass, and What We Found There. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD ’08). ACM, New York, NY, USA, 981–992.
  • Heath and Raghavan (1995) Michael T. Heath and Padma Raghavan. 1995. A Cartesian Parallel Nested Dissection Algorithm. SIAM J. Matrix Anal. Appl. 16, 1 (Jan. 1995), 235–253.
  • Hendrickson and Leland (1995) Bruce Hendrickson and Robert Leland. 1995. A Multilevel Algorithm for Partitioning Graphs. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (Supercomputing ’95). ACM, New York, NY, USA, Article 28.
  • Johnson et al. (2009) Ryan Johnson, Ippokratis Pandis, and Anastasia Ailamaki. 2009. Improving OLTP Scalability Using Speculative Lock Inheritance. Proc. VLDB Endow. 2, 1 (Aug. 2009), 479–489.
  • Jones et al. (2010) Evan P.C. Jones, Daniel J. Abadi, and Samuel Madden. 2010. Low Overhead Concurrency Control for Partitioned Main Memory Databases. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10). ACM, New York, NY, USA, 603–614.
  • Jung et al. (2013) Hyungsoo Jung, Hyuck Han, Alan D. Fekete, Gernot Heiser, and Heon Y. Yeom. 2013. A Scalable Lock Manager for Multicores. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 73–84.
  • Kallman et al. (2008) Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. 2008. H-store: A High-performance, Distributed Main Memory Transaction Processing System. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1496–1499.
  • Karypis and Kumar (1998) George Karypis and Vipin Kumar. 1998. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (Dec. 1998), 359–392.
  • Kung and Robinson (1981) H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for Concurrency Control. ACM Trans. Database Syst. 6, 2 (June 1981), 213–226.
  • LaSalle et al. (2015) Dominique LaSalle, Md Mostofa Ali Patwary, Nadathur Satish, Narayanan Sundaram, Pradeep Dubey, and George Karypis. 2015. Improving Graph Partitioning for Modern Graphs and Architectures. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3 ’15). ACM, New York, NY, USA, Article 14, 4 pages.
  • Li et al. (2014) Xiaozhou Li, David G. Andersen, Michael Kaminsky, and Michael J. Freedman. 2014. Algorithmic Improvements for Fast Concurrent Cuckoo Hashing. In Proceedings of the Ninth European Conference on Computer Systems (EuroSys ’14). ACM, New York, NY, USA, Article 27, 14 pages.
  • Miller et al. (1993) Gary L. Miller, Shang-Hua Teng, William Thurston, and Stephen A. Vavasis. 1993. Automatic Mesh Partitioning. In Graph Theory and Sparse Matrix Computation, Alan George, John R. Gilbert, and Joseph W. H. Liu (Eds.). Springer New York, New York, NY, 57–84.
  • Özcan et al. (2017) Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. 2017. Hybrid Transactional/Analytical Processing: A Survey. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). ACM, New York, NY, USA, 1771–1775.
  • Pavlo et al. (2012) Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-Aware Automatic Database Partitioning in Shared-Nothing, Parallel OLTP Systems. In SIGMOD ’12: Proceedings of the 2012 international conference on Management of Data. 61–72.
  • Pavlo et al. (2011) Andrew Pavlo, Evan P. C. Jones, and Stanley Zdonik. 2011. On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems. Proc. VLDB Endow. 5, 2 (Oct. 2011), 85–96.
  • Pothen et al. (1990) Alex Pothen, Horst D. Simon, and Kan-Pu Liou. 1990.

    Partitioning Sparse Matrices with Eigenvectors of Graphs.

    SIAM J. Matrix Anal. Appl. 11, 3 (May 1990), 430–452.
  • Ren et al. (2016) Kun Ren, Jose M. Faleiro, and Daniel J. Abadi. 2016. Design Principles for Scaling Multi-core OLTP Under High Contention. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 1583–1598.
  • Ren et al. (2014) Kun Ren, Alexander Thomson, and Daniel J. Abadi. 2014. An Evaluation of the Advantages and Disadvantages of Deterministic Database Systems. Proc. VLDB Endow. 7, 10 (June 2014), 821–832.
  • Sadoghi et al. (2014) Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, and Kenneth A. Ross. 2014. Reducing Database Locking Contention Through Multi-version Concurrency. Proc. VLDB Endow. 7, 13 (Aug. 2014), 1331–1342.
  • Thomson et al. (2012) Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: Fast Distributed Transactions for Partitioned Database Systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD ’12). ACM, New York, NY, USA, 1–12.
  • Tu et al. (2013) Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy Transactions in Multicore In-memory Databases. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13). ACM, New York, NY, USA, 18–32.
  • Wang et al. (2016) Zhaoguo Wang, Shuai Mu, Yang Cui, Han Yi, Haibo Chen, and Jinyang Li. 2016. Scaling Multicore Databases via Constrained Parallel Execution. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, New York, NY, USA, 1643–1658.
  • Xie et al. (2015) Chao Xie, Chunzhi Su, Cody Littley, Lorenzo Alvisi, Manos Kapritsos, and Yang Wang. 2015. High-performance ACID via Modular Concurrency Control. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). ACM, New York, NY, USA, 279–294.
  • Yan and Cheung (2016) Cong Yan and Alvin Cheung. 2016. Leveraging Lock Contention to Improve OLTP Application Performance. Proc. VLDB Endow. 9, 5 (Jan. 2016), 444–455.
  • Yu et al. (2014) Xiangyao Yu, George Bezerra, Andrew Pavlo, Srinivas Devadas, and Michael Stonebraker. 2014. Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores. Proc. VLDB Endow. 8, 3 (Nov. 2014), 209–220.
  • Yuan et al. (2016) Yuan Yuan, Kaibo Wang, Rubao Lee, Xiaoning Ding, Jing Xing, Spyros Blanas, and Xiaodong Zhang. 2016. BCC: Reducing False Aborts in Optimistic Concurrency Control with Low Cost for In-memory Databases. Proc. VLDB Endow. 9, 6 (Jan. 2016), 504–515.