SCAR: Strong Consistency using Asynchronous Replication with Minimal Coordination

03/01/2019 ∙ by Yi Lu, et al. ∙ MIT 0

Data replication is crucial in modern distributed systems as a means to provide high availability. Many techniques have been proposed to utilize replicas to improve a system's performance, often requiring expensive coordination or sacrificing consistency. In this paper, we present SCAR, a new distributed and replicated in-memory database that allows serializable transactions to read from backup replicas with minimal coordination. SCAR works by assigning logical timestamps to database records so that a transaction can safely read from a backup replica without coordinating with the primary replica, because the records cannot be changed up to a certain logical time. In addition, we propose two optimization techniques, timestamp synchronization and parallel locking and validation, to further reduce coordination. We show that SCAR outperforms systems with conventional concurrency control algorithms and replication strategies by up to a factor of 2 on three popular benchmarks. We also demonstrate that SCAR achieves higher throughput by running under reduced isolation levels and detects concurrency anomalies in real time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

High availability (HA) is crucial in modern data-oriented applications. In clusters with hundreds to thousands of servers, failure is a norm rather than an exception. When a failure happens, a highly available system is able to mask the failure using standby servers. In most applications, high availability is implemented using data replication.

A desirable property of any approach to high availability is strong consistency between replicas, i.e., that there is no way for clients to tell when a failover happened, because the state reflected by the replicas is identical. Enforcing strong consistency in a replicated and distributed database is a challenging task. The most common approach is based on primary-backup replication, where all reads and writes are handled at the primary replica, which synchronously ships writes to the backup replicas. As a result, the primary releases locks and commits only after writes have propagated to all replicas, blocking other transactions from accessing modified records and limiting performance. In typical configurations, reads are always executed at the primary replica to ensure that a transaction observes the latest data, but this also incurs long latency if the primary replica is far away from the client and further loads the primary.

If a system could process reads at replicas and asynchronously ship writes to replicas, it could achieve considerably lower latency and higher throughput, because clients can read from the nearest replica, and transactions can release locks before replicas respond to writes. Indeed, these features are central to many recent systems that offer eventual consistency(e.g., Dynamo (DeCandia et al., 2007)). Observe that both local reads and asynchronous writes introduce the same problem: the possibility of stale reads at replicas. Thus, they both introduce the same consistency challenge: the database cannot determine whether the records a transaction reads are consistent or not. One naive way of consistently reading from backup replicas is to always send a validation request to the primary replica after reading a record (or set of records) at a backup replica, to verify that the records at both replicas are the same. However, this also incurs significant network traffic and latency, and requires the primary to be involved in most or all reads, and thus is not likely to be better than the traditional primary-backup scheme. In this paper, we propose an alternative method. In our approach, the primary provides a promise to a backup replica that a record will not change at the primary for a certain period of logical time. In this way, a transaction can read the backup record without validation at the primary, which significantly reduces the amount of coordination required in transaction processing.

The system we built that embodies this idea is called SCAR. It is a single-version distributed and replicated database that supports strong consistency, i.e., serializability and snapshot isolation. To achieve this goal, SCAR implements a logical timestamp-based optimistic concurrency control (OCC) algorithm with critical performance optimizations. To our best knowledge, we are the first to use logical timestamps to allow a transaction to read from any backup replica and asynchronously replicate its writes in a replicated database system.

By default, transactions in SCAR are serializable. In practice, database transactions often execute under reduced isolation levels (e.g., snapshot isolation) for better performance. With minor changes to the commit protocol, SCAR supports snapshot isolation as well. It achieves this without maintaining multiple data versions in the database, and thus requires a smaller storage footprint compared to multi-version concurrency control (MVCC) algorithms. Furthermore, SCAR provides a low-overhead concurrency anomaly detector to report whether each individual transaction running under snapshot isolation actually committed under serializability. This allows a user to detect when isolation violations are absent and determine whether the current isolation level of transactions meets the needs of the workload.

Our evaluation on an eight-server cluster shows that SCAR outperforms systems that use conventional concurrency control algorithms and replication strategies by up to factor of 2 on Retwis (a commonly used benchmark that emulates Twitter-like social network), YCSB, and TPC-C. The performance advantage of SCAR is even more significant in the Wide-Area Network (WAN) setting. Further, SCAR is able to achieve 2x performance improvement versus serializability when running under snapshot isolation. Finally, we show how the concurrency anomaly detector gives a breakdown of concurrency anomalies in real time.

In summary, this paper makes the following major contributions:

  • [noitemsep,nolistsep]

  • We present SCAR, a distributed and replicated in-memory database. It allows serializable transactions to read from backup replicas and replicates writes asynchronously, with minimal coordination across nodes.

  • We show how SCAR supports snapshot isolation with minor changes to the commit protocol.

  • We introduce a concurrency anomaly detection scheme which detects anomaly-free executions in real time with low overhead.

  • We propose two optimizations to reduce the overhead of coordination during transaction validation in SCAR.

2. Background

This section discusses the background of distributed concurrency control and data replication.

2.1. Distributed Concurrency Control

Concurrency control enforces two critical properties of a database: atomicity and isolation. Atomicity requires a transaction to expose either all or none of its changes to the database. The isolation level specifies when a transaction is allowed to see another transaction’s writes.

Both serializability and snapshot isolation are considered as strong isolation levels in a distributed database. Serializablity (SR) requires transactions to produce the same results as if they are sequentially executed; it is the gold standard isolation level due to its robustness and ease of understanding. Snapshot isolation (SI) requires a transaction to read a consistent snapshot and perform all the writes at the same time; but reads can happen earlier than the writes. Snapshot isolation is weaker than serializability and thus allows more transactions to commit. This leads to better performance at the cost of potential concurrency anomalies.

Three classes of concurrency control protocols are commonly used in distributed systems: two-phase locking (2PL) (Bernstein et al., 1979; Eswaran et al., 1976), optimistic concurrency control (OCC) (Kung and Robinson, 1981), and multi-version concurrency control (MVCC) (Reed, 1978). 2PL protocols are pessimistic and use locks to avoid conflicts. An MVCC protocol maintains multiple versions of each tuple in the database. This offers higher concurrency since a transaction can potentially pick from amongst several consistent versions to read, at the cost of higher storage overhead and complexity. In OCC, a transaction does not acquire locks during execution; after execution, the database validates a transaction to determine whether it commits or aborts. At low contention, OCC has better performance than 2PL due to its non-blocking execution. Although a large number of distributed OCC protocols have been proposed in recent years (Mahmoud et al., 2014; Mu et al., 2014; Shute et al., 2013; Yu et al., 2018; Zhang et al., 2015), there has not been a consensus of the best implementation of distributed OCC. In this paper, we adapt Silo’s OCC protocol (Tu et al., 2013) to the distributed environment and use that to illustrate typical distributed OCC algorithms. More details of the baseline will be discussed in Section 3. Traditionally, all the three classes of concurrency control support serializability, but only MVCC supports snapshot isolation. As we show in Section 5, although SCAR is a single-version OCC protocol, it supports both serializability and snapshot isolation without the overhead of storing multiple tuple versions.

The lifecycle of a distributed transaction contains an execution phase and an atomic commit protocol. During the execution phase, a transaction accesses the database and executes transaction logic. The atomic commit protocol guarantees that all participating nodes agree on the outcome of the transaction (i.e., commit or abort) and this outcome survives failures. In Section 3, we will discuss how a transaction is executed and committed in SCAR.

2.2. Replication

Modern database systems support high availability (HA) such that when a subset of servers fail, the rest of the servers can carry out the database functionality, thereby end-users do not notice the server failures. High availability requires the database to replicate data across multiple servers and propagate each update to all the replicas.

Both Paxos-based and primary-backup replication schemes are commonly used in replicated systems. Paxos-based (Lamport, 2001) replication synchronizes each read and write operation to the database. While Paxos handles node failures more gracefully, it requires heavy coordination that incurs excessive network traffic and performance degradation (Zhang et al., 2015).

Primary-backup replication is also a commonly used replication scheme. A transaction can write to only the primary copy of each record. The primary replica then propagates the write to the backup replicas. In a typical primary-backup replication scheme, reads always go to the primary copies in order to observe the latest data. However, if the primary copy is on a remote node while a backup copy is local, it is desirable if a transaction can read the local backup copy instead, since it can avoid the network latency. Some existing systems have tried to allow transactions to read from backup replicas by having multi-version data storage and global timestamp allocation (Corbett et al., 2012; Peluso et al., 2012). This solution, however, incurs higher storage overhead and complexity for version management; furthermore, generation of consistent timestamps across nodes requires complex protocols. Other systems achieve this by only supporting weak consistency levels like causal consistency (Mehdi et al., 2017) or eventual consistency (Terry et al., 1995). Although this reduces the complexity of managing replication, it introduces concurrency anomalies into transactions, which makes it difficult to implement correct application code for database users. In Section 3, we discuss how SCAR allows transactions to read from backup copies while enforcing serializability with minimal coordination.

3. Scar

In this section, we first give an overview on SCAR to show its benefits over typical distributed OCC protocols. We next explain in detail how SCAR reads from replicas, validates transactions and applies the writes asynchronously without relaxing the consistency model. At last, we discuss the non order-preserving serializability that SCAR achieves.

3.1. Overview

The database in SCAR is partitioned across a cluster of nodes. Each partition is mastered on one machine with one or more backup partitions on other nodes. SCAR implements primary-backup replication. Each record has a primary replica and one or more backup replicas. Each replica resides on a different machine. If the primary replica of some records fails, one of the backup replicas is promoted as the new primary and the database as a whole does not stop its service.

An important feature of SCAR is that a transaction can read from any backup replica without necessarily coordinating with the primary replica. Figure 1 illustrates how a pair of transactions (i.e., : ; : ) work in SCAR and in a distributed OCC system similar to Google’s F1 (Shute et al., 2013). Here, we have a two-node database with two replicas, with two records, and , each mastered on different nodes.

In a typical distributed OCC protocol (right hand side of the Figure 1), the transaction may read from a local replica (read() at Node 1), but before the transaction can commit, the database must validate the transaction’s local read at the primary replica (Node 2), because the database running at Node 1 does not know whether record has been changed at its primary replica or not.111The database can avoid this validation if each write locks and updates all the replicas. But this degrades the performance of write operations. Such validation introduces round-trip messages which degrade performance.

SCAR avoids many of these validations because the database provides per-record read/write timestamps to the backup replica as a promise that each record will not be updated until the read timestamp, as shown in the left side of Figure 1. Therefore, a transaction does not need to coordinate with the primary replica to validate a record, as long as the transaction commits earlier than the record’s read timestamp. In practice, the timestamps can be either physical (e.g., no update can happen in the next 10 seconds) or logical (e.g., no update can happen until the logical timestamp reaches 10). SCAR uses the logical timestamp design to avoid some difficulties of physical clocks (e.g., distributed clock synchronization). Specifically, each record in the database is associated with two logical timestamps, which are represented by two 64-bit integers: []. The wts is the logical write timestamp, indicating when the record was written, and the rts is the logical read validity timestamp, indicating that the record can be read at any logical time ts such that wts rts.

Suppose the primary nodes of record and are Node 1 and Node 2 respectively. Similarly, the backup nodes of record and are Node 2 and Node 1 respectively. In the left side of Figure 1, the transaction running on Node 1 reads a local record which has logical timestamps of [5, 15]; it also reads record from the local backup replica which has logical timestamps of [10, 20]. The logical read timestamp on record is a promise that the primary will not update until at least logical time 21. In this example, the transaction can commit locally at Node 1 at timestamp 16 (larger than record ’s rts), at which point operations to both and are valid; and thus there is no need to coordinate with the primary replica of (i.e., Node 2).

Figure 1. Illustrating the SCAR algorithm

Logical timestamp-based protocols were first proposed in TicToc (Yu et al., 2016). Unlike TicToc, however, SCAR focuses on using logical timestamps to allow transactions to read from backup replicas without relaxing the consistency model and apply writes to the database asynchronously to reduce round-trip communication.

The rest of this section explains in detail how SCAR runs a transaction and manages replication. We will discuss how consistency and fault tolerance are achieved in Section 4.

3.2. Reading from Replicas

A transaction in SCAR runs in multiple phases: an execution phase, a validation phase, and a commit phase. We say the node initiating a transaction is the coordinator node, and other nodes are participant nodes.

In the execution phase, a transaction reads records from the database and maintains local copies of them in its read set (RS). Each entry in the read set contains the value as well as the record’s associated wts and rts.

For a read request, the coordinator node first checks if the request’s primary key is already in the read set. This happens when a transaction reads a data record multiple times. In this case, the coordinator node simply uses the value of the first read. Otherwise, the coordinator node reads the record from the database. A record can be read from any replica in SCAR. To avoid network communication, the coordinator node always reads from its local database of a local copy is available.

The coordinator first locates the primary node and backup nodes of the record. As illustrated in Figure 2, a transaction can read the record from its local database in two scenarios: (1) the coordinator node happens to be the primary node . For example, transaction can read record locally on Node 1 in the left side of Figure 1. (2) the coordinator node is a backup node among , which already has a copy of the record. For example, transaction can read record locally on Node 1 as well, even though Node 2 is the primary node of record in the left side of Figure 1.222Node 1 is a backup node that has a copy of record . If no local copy is available, a read request is sent to a participant node, i.e., the remote primary node .

In SCAR, logical timestamps (i.e., 64-bit wts and rts) are associated with records in both primary and backup replicas. For a read request, the system returns both the value and the logical timestamps of a record; and both are stored in the transaction’s local read set. Later in Section 3.3, we explain how logical timestamps are used to avoid coordination during a transaction’s validation phase.

All computation is performed in the execution phase. Since SCAR’s algorithm is optimistic, writes are not applied to the database but are stored in a per-transaction write set (WS), in which, as with the read set, each entry has a value and the record’s associated wts and rts.

For a write operation, if the primary key is not in the write set, a new entry is created with the value and then inserted into the write set. Otherwise, the system simply updates the write set with the new value. Note that for updates to records that are already in the read set, the transaction also copies the wts and the rts to the entry in the write set, which are used for validation later on.

Figure 2. Pseudocode to read from replicas

3.3. Transaction Validation

After a transaction finishes its execution phase, it must be successfully validated before it commits. We borrow the idea on transaction validation from TicToc (Yu et al., 2016), a single-node multi-core concurrency control protocol. Different from TicToc, SCAR is the first to apply logical timestamps to a distributed and replicated database system to allow a transaction to read from any backup replica and asynchronously replicate its writes using the concept. We now describe the three steps to validate a transaction: (1) lock all records in the transaction’s write set; (2) assign a commit timestamp to the transaction; (3) validate all records in the transaction’s read set. Optimizations involving replication will be introduced later in Section 6.

A transaction first tries to acquire locks on each record in the write set to prevent concurrent updates from other transactions. A locking request is sent to the primary replica of each record. To avoid deadlocks, we adopt a NO_WAIT333NO_WAIT dead lock prevention strategy was shown as the most scalable protocol (Harding et al., 2017). policy, i.e., if the lock is already held on the record, the transaction does not wait but simply aborts. For each acquired lock, if any record’s latest wts does not equal to the stored wts, the transaction aborts as well. This is because the record has been changed at the primary replica since the transaction last read it. The transaction also updates each record’s rts in its local write set in this step.

A commit timestamp cts is next assigned to the transaction based on all records the transaction accesses (both read set and write set). The cts is the smallest timestamp that meets the following two conditions: (1) not less than the wts of each entry in the read set; (2) larger than the rts of each entry in the write set. To see this, consider the example in the left side of Figure 1, the cts is 16, which equals to max(.rts + 1, .wts, .wts).

At last, a transaction validates its read set. The transaction’s cts is first compared with the rts of each record in its read set. A read validation request is sent only when a record’s rts is less than the cts. In this case, the transaction tries to extend the record’s rts at the primary node. The extension would fail in two scenarios: (1) the record’s wts changed, meaning the record was modified by other concurrent transactions; (2) the record is locked by other transactions and the rts is less than the cts. In either case, the rts cannot be extended and the transaction must abort. Otherwise, the transaction extends the record’s rts to the transaction’s cts.

3.4. Asynchronous Write and Replication

Figure 3. Pseudocode to commit a transaction

If a transaction fails the validation, it simply aborts, unlocks the acquired locks, and discards its local write set. Otherwise, it will commit changes in its write set to the database. SCAR applies the writes and replication asynchronously to reduce round-trip communication. We will discuss how consistency and fault tolerance are achieved in Section 4.

As illustrated in Figure 3, the value of each record in a transaction’s write set and the cts are sent to the primary and backup replicas from the coordinator node by calling the commit function. There are two scenarios that writes are sent: (1) writes are sent to the primary replica: Since the primary replica is holding the lock, upon receiving the write request, the primary replica simply updates the value and the logical timestamps for the record in the database to [cts,cts ]; (2) writes are sent to backup replicas: Since asynchronous replication is employed in SCAR, upon receiving the write request, the lock on the record is not necessarily held on the primary replica, meaning replication requests to the same record from multiple transactions could arrive out of order. SCAR determines whether a replication request at a backup replica should be applied using the Thomas write rule (Thomas, 1979): the database applies a write if the wts of the record in the write request is larger than the current wts of the record in the database (line 14 – 15 of Figure 3). Because the wts of a record monotonically increases in the primary replica, this guarantees that secondary replicas apply the writes in the same order as the order to commit transactions on primary replicas.

3.5. Non Order-Preserving Serializability

Figure 4. Illustrating non order-preserving serializability vs. order-preserving serializability

SCAR achieves less coordination in transaction execution but sacrifices external consistency (Corbett et al., 2012), i.e., the system commits transactions under non order-preserving serializability, which we will describe below.

Most OCC algorithms are based on physical time (e.g., Silo (Tu et al., 2013)). In these systems, the database validates a transaction’s read set by comparing the data versions from the read set to the latest ones on the primary replicas. If any record’s primary partition is not on the coordinator, a round-trip communication must be performed when a transaction’s read set is validated. We now use the same example from Figure 1 and show the events happening following the physical time in Figure 4. Consider the example in the right side of Figure 4, in which transaction and runs concurrently. By the time commits, has committed and a new value of record has been written to the database (i.e., version 11444The commit time is not less than the version of any record in a transaction’s read and write set.). Since record in ’s read set has changed from 10 to 11, cannot commit at time 11 and must abort. must retry and commits at time 12. In systems with order-preserving serializability (e.g., Spanner (Corbett et al., 2012)), the commit time of conflicting transactions determines transaction commit order.

In contrast, transactions commit in SCAR do not necessarily follow the order of commit timestamps, i.e., the logical time does not always agree to the physical time. Consider the example in the left side of Figure 4. After transaction commits, which has written a new value of record (i.e., version [21, 21]). Transaction can still commit at time 16, even though record is in its read set and the value has changed. This is because ’s commit time is earlier than record last written time in the space of logical time, i.e., 16 falls between logical time 10 and 20. Non order-preserving serializability enables SCAR to reduce significant network communication when a transaction is validated. For example, for each record in a transaction’s read set, the system first compares the record’s rts and the transaction’s assigned commit timestamp cts. A read validation request is sent only when the record’s rts is less than the cts and the primary node of the record is not the coordinator node of the transaction. Otherwise, the read is already consistent, since it is valid at logical time cts. If all records in the read set can be validated locally, a round trip communication is eliminated entirely.

4. Consistency and Fault Tolerance

In this section, we first describe how SCAR ensures consistency with epochs and next show how fault tolerance is achieved.

4.1. Ensuring Consistency with Epochs

One well known issue with asynchronous replication is the potential for data inconsistency when a failure occurs. In a typical implementation, a transaction commits after successfully updating the primary replica, while the replication requests are still underway. If the primary replica fails after the transaction commits, the replicas are not guaranteed to receive the data. Therefore, the effect of the last several updates might be lost, leading to inconsistent behavior.

SCAR addresses this issue by delaying the commit of a transaction until the completion of its replication as well as the replication of transactions that it depends on. Specifically, SCAR borrows ideas from the epoch-based logging scheme used in Silo (Tu et al., 2013; Zheng et al., 2014), in which transactions commit in batches. A transaction commits only after all transactions within the same batch commit, although a transaction can release its locks early, before the replication process completes.

Transactions in SCAR are separated by epochs (of 10 ms each, by default) using global barriers. Each transaction is assigned the current epoch number as it starts. The epoch number advances when the next global barrier is reached. Transactions in an epoch commit if all the transactions in this epoch have replicated their write sets to the backup replicas. This guarantees that all committed transactions are fully replicated and survive failures. It also guarantees that for a committed transaction, all transactions that it depends on have the same or smaller epoch number and thus have committed as well (Zheng et al., 2014).

4.2. Fault Tolerance

The system can tolerant up to simultaneous failures when each partition has replicas. For ease of presentation, we discuss the case where only one node fails. For each partition on the failed node, if the primary partition is lost, a secondary partition on other nodes becomes the primary partition.

As discussed above, SCAR commits transactions by epochs. Once a fault occurs, SCAR rollbacks the database to the last successful epoch, i.e., all tuples that are updated in the current epoch are reverted to the states in the last epoch. To achieve this, the database maintains two versions of each tuple. One always has the latest value. The other one has the most recent value up to the last successful epoch.

The system can continue processing transactions when a node fails. Once the failed node restarts, it copies the lost partitions from other nodes. In the meantime, the restarted node uses the Thomas write rule (Thomas, 1979) to correct its database the same as in Section 3.4 and catches up to other nodes using the writes of committed transactions.

5. Isolation Levels

SCAR supports serializable transactions by default. In addition, it also supports snapshot isolation (SI). In this section, we first describe the protocol to support SI transactions. We then discuss how SCAR can be used for monitoring concurrency anomalies from SI transactions in real time.

5.1. Transactions under Snapshot Isolation

A transaction running under SI does not detect read/write conflicts. By not detecting these conflicts, the system is able to achieve a lower abort rate and higher throughput.

Many systems adopt a multi-version concurrency control (MVCC) algorithm to support snapshot isolation. In an MVCC-based system, a timestamp is assigned to a transaction when it starts to execute. By reading all records that have overlapping time intervals with the timestamp, the transaction is guaranteed to observe the state of the database (i.e., a consistent snapshot) at the time when the transaction began.

Instead of maintaining multiple versions for each record, we made minor changes to the algorithm discussed in Section 3 to support snapshot isolation. SI transactions do not have to follow a serial order, instead, they only require that all reads come from a consistent snapshot of the database and there are no conflicts with any concurrent updates made since that snapshot. SCAR achieves this by assigning an additional timestamp to validate the read set of a transaction. We introduce a new timestamp crts 555crts is short for commit read timestamp., which is the maximum value of the wts of all records in a transaction’s read set. The system next uses the crts to validate the transaction’s read set as discussed in Section 3.3 and ensures all reads are from the state of the database at logical time crts. The system then applies the writes at logical time cts as before to make sure there are no conflicts with updates. The crts is often smaller than the cts 666The crts does not have to be larger than the rts of each entry in the write set., which makes a transaction more likely to be validated. Note that SCAR can support a mix of transactions concurrently running under different isolation levels (SI/serializability) as well.

5.2. Concurrency Anomaly Detection

In practice, database transactions are often executed under reduced isolation levels, as there is an inherent trade-off between performance and isolation levels. For example, both Oracle and Microsoft SQL Server default to read committed. Unfortunately, such weaker isolation levels can result in concurrency anomalies that yield an interleaving of operations that could not arise in a serial execution of transactions. SCAR provides a real-time breakdown of how many transactions may have experienced anomalies by running under reduced isolation levels (SI in particular). It reports which transactions may have experienced anomalies and which transactions definitely did not. SCAR does this in a lightweight fashion that introduces minimal overhead, allowing developers to monitor their production systems and tune the isolation levels on the fly.

Recall that a snapshot isolation transaction is assigned with one more timestamp crts to validate its read set. According to the SCAR protocol, a transaction having two equal timestamps, meaning the crts is equal to the cts, is serializable because all accesses occur at the same logical time. In the transaction validation phase, SCAR applies this lightweight equality check to all SI transactions to detect transactions that may have observed concurrency anomalies (i.e., the ones with two different timestamps).

With the support of real-time anomaly detection, application developers can monitor how many anomalies arise with snapshot isolation in a timely manner and get insights to make better design decisions. For example, developers can switch to higher isolation levels when too many anomalies are detected or re-design application logic to eliminate anomalies in transactions running under reduced isolation levels.

6. Optimizations

In this section, we discuss two optimizations that further reduce network round trips.

6.1. Timestamp Synchronization

SCAR can validate a previous read locally without communicating with the primary replica. We now introduce an optimization that boosts the system’s performance by further reducing the frequency of remote validation.

Figure 5. Pseudocode to synchronize timestamps

The logical timestamps associated with each record on the primary replica are updated in two scenarios: (1) a new value is written to a record when a transaction commits (e.g., new timestamps with equal wts and rts are assigned), and (2) the rts is extended when a record is validated, as discussed in Section 3.3. In the first scenario, a record’s value and its associated timestamps are also updated on backup replicas. However, the rts of a record would not be updated on backup replicas in the second scenario by default. As more and more transactions validate a record on the primary replica, the gap between the rts on the primary replica and backup replicas becomes larger.

Since the records on backup replicas have stale and smaller rts, the record in a transaction’s read set is less likely to be validated locally. To address this problem, we apply an optimization we call timestamp synchronization. The idea is to actively propagate the rts from primary replicas to backup replicas. As shown in Figure 5, the function ts_sync is invoked when a transaction commits or aborts (note that some records may be successfully validated even if a transaction aborts). Since the value of a record is not included in this synchronization, the system only extends the rts when backup replicas have the same wts, as shown in the function update_rts (line 10 – 11 of Figure 5). Note that timestamp synchronization happens asynchronously and thus does not increase the latency of a transaction.

6.2. Parallel Locking and Validation

As we discussed in Section 3.3, if a transaction commits under serializability, the commit timestamp cts must be larger than the rts of each entry in the write set. Since a tuple’s timestamp may be changed by other conflicting transactions, the latest rts of each tuple is not available until it has been locked. In addition, a transaction must validate its read set with the cts. In other words, a transaction cannot validate its read set until it has locked its write set.

However, an SI transaction only requires that all reads are from a consistent snapshot. As we discussed in Section 5, an SI transaction has two timestamps (i.e., crts and cts). A transaction can calculate the crts with the wts of each entry in its read set. With the crts, the transaction is able to lock its write set and validate its read set in parallel. Once all tuples in the write set have been locked, the transaction next calculates the cts and commits if no conflicts exist.

We now use an example to illustrate how SCAR eliminates one network round trip with the parallel locking and validation (PVL) optimization for SI transactions.

Example 1 ().

Suppose an SI transaction reads tuple and , and updates the value of tuple to . The following operations are invoked: (1) Read , (2) Read , (3) Write , and (4) Commit.

Figure 6. Illustrating parallel locking and validation

We show a step-by-step diagram in Figure 6, in which a tuple is shown as a vertical band. The start and end of a band indicate the tuple’s wts and rts. Each step shows a different phase from a transaction’s lifecycle. There are two steps to validate a serializable transaction and one step to validate an SI transaction. Both serializable and SI transactions have a step to commit in the end. A round trip communication may happen at the end of each step. For the ease of presentation, suppose there is no conflicting transaction that updates the timestamps of tuple and .

The transaction reads tuple [2,3] and [2,2]. There are three steps in the transaction validation phase if the transaction commits under serializability (shown on the top of Figure 6).

Step 1: The transaction locks tuple , since it’s in the write set. According to the algorithm in Section 3.3, the cts is 4, being the maximum of wts in the read set and rts in the write set.

Step 2: Tuple will be validated at timestamp 4. Since the wts is not changed, it’s not locked by other conflicting transactions, and the rts of tuple can be extended to 4, the validation succeeds.

Step 3: The transaction updates tuple with a new value at the cts and commits.

As we discussed above, an SI transaction is able to lock the write set and validate the read set in a single step, as shown on the bottom of Figure 6.

Step 1: The transaction first generates the timestamp for read validation. In this example, the maximum of wts in the read set is 2. The transaction next uses this timestamp as the crts to validate tuple and the validation succeeds. Meanwhile, the transaction also locks tuple .

Step 2: The transaction next generates the cts, which is 4. At last, the transaction updates tuple and commits.

7. Discussion

As discussed in previous sections, SCAR improves the performance of distributed OCC protocols through asynchronous replication and coordination reduction by using logical timestamps. SCAR also allows transactions to read data from backup replicas to reduce network messages.

Besides SCAR, some MVCC-based systems like Spanner (Corbett et al., 2012) and TAPIR (Zhang et al., 2015) also allow transactions to read data from secondary replicas. As we will discuss in this section, however, MVCC-based systems may not be as effective as SCAR in reducing coordination. Furthermore, we demonstrate that the technique in SCAR can also be applied to an MVCC protocol as an improvement.

In an MVCC protocol, a transaction is assigned a unique commit timestamp at the beginning of its execution. The commit timestamp can be derived from either a synchronized clock (e.g., atomic clock (Corbett et al., 2012) or software-based solution (Eidson, 2002; Mills, 1991)), or a centralized timestamp allocator (Wu et al., 2017). Each database record contains a version number which is the commit timestamp of the creating transaction. A transaction may read a record from a secondary replica, if its commit timestamp intersects the valid timestamp range for an old version of the record.

When a transaction is accessing the latest version of a record, however, an MVCC protocol may not be able to determine the consistency of the data based on the transaction’s local information. Spanner (Corbett et al., 2012) solves this problem using atomic clocks, by letting the transaction wait until the uncertainty period has passed. This requires special hardware with atomic clocks which is very expensive. A more generic solution (e.g., TAPIR (Zhang et al., 2015)) is to send a message to other replicas to check the consistency of the data. This introduces a least one round of network messages and therefore defeats the purpose of reading from replicas.

The new idea that SCAR introduces is the rts of each record. With the rts, a transaction knows that the data is guaranteed to be valid until that logical time and therefore is able to read the data without contacting other replicas. Note that the concept of rts can also be applied to any MVCC protocols like Spanner (Corbett et al., 2012) or TAPIR (Zhang et al., 2015) to reduce the number of coordination messages to see if the latest data version is read.

8. Evaluation

In this section, we study the performance of SCAR focusing on the following key questions:

  • [noitemsep,nolistsep]

  • How does SCAR perform compared to other distributed concurrency control algorithms?

  • How does network latency affect SCAR?

  • What’s the performance of SCAR with different numbers of replicas?

  • How much performance gain can SCAR achieve under snapshot isolation vs serializability?

  • What fraction of transactions actually commit under serializability when they run under snapshot isolation?

  • How effective is each optimization in SCAR?

8.1. Experimental Setup

We run most of the experiments on a cluster of eight machines, each with 32 cores (four 8-core 2.13 GHz Intel(R) Xeon(R) E7-4830 CPUs) and 256 GB of DRAM. Each machine runs 64-bit Ubuntu 12.04 with Linux kernel 3.2.0-23 and the servers are connected with a 10 GigE network.

In our experiments, we run 24 worker threads and 2 threads for network communication on each machine. Each worker thread has an integrated workload generator. Aborted transaction are re-executed with an exponential back-off strategy. All results are the average of ten runs.

In Section 8.3, we run experiments on wide-area network setting using Amazon EC2 instances.

8.1.1. Workloads

To evaluate the performance of SCAR, we ran a number of experiments using the following three popular benchmarks:

Retwis: The Retwis benchmark is designed to model activities happened at Twitter (Leau, 2013). There is a single table and each row is a key-value pair. We support two transactions, namely, (1) PostTweet and (2) GetTimeline. A user can post a tweet to the social network via the PostTweet transaction. The GetTimeline transaction returns the latest tweets from a user and his/her followers.

YCSB: The Yahoo! Cloud Serving Benchmark (YCSB) is a simple transactional workload. It’s designed to be a benchmark for facilitating performance comparisons of database and key-value systems (Cooper et al., 2010). There is a single table and each row has ten attributes. The primary key of the table is a 64-bit integer and each attribute has 10 random bytes. Unless otherwise stated, a transaction consists of 4 operations in this benchmark.777YCSB+T (Dey, 2013; Dey et al., 2014), another extension to YCSB, wraps operations within transactions in a similar manner to model activities happened in a closed economy.

TPC-C: The TPC-C benchmark is a popular benchmark to evaluate OLTP databases (tpc, 2010). It models a warehouse-centric order processing application. We support the NewOrder transaction in this benchmark, which involves customers placing orders in their districts within a local warehouse. The local warehouse fulfills most orders but a small fraction of the orders involve products from remote warehouses.

In Retwis and YCSB, we set the number of records to 400K per partition and the number of partitions to 192, which equals to the total number of worker threads in the cluster. To model different access patterns, we vary the skew factor and the ratio of cross-partition transactions in our experiments (i.e., Section 

8.2). In TPC-C, we set the number of warehouses to 192 as well. In all workloads, we set the number of replicas to 3, i.e., each partition has a primary partition and two secondary partitions, which are always hashed to three different nodes.

8.1.2. Distributed Concurrency Control Algorithms

By default, SCAR is allowed to read from local secondary replicas. The timestamp synchronization optimization is also enabled, unless otherwise stated. We compared SCAR with the following distributed concurrency control algorithms. To avoid an apples-to-oranges comparison, we implemented all algorithms in C++ in our framework. All systems are compiled using GCC 5.4.1 with the -O2 option enabled.

S2PL: This is a distributed concurrency control algorithm based on strict two-phase locking. Read locks and write locks are acquired as a worker runs a transaction. To avoid deadlock, the same NO_WAIT policy is adopted as discussed in Section 3. The worker updates all records and replicates the writes to replicas before releasing all acquired locks.

OCC: This is a distributed optimistic concurrency control algorithm based on Silo’s OCC protocol (Tu et al., 2013). OCC assigns a transaction ID (TID) to each transaction when it commits based on the TID associated with each record in its read/write set. OCC also has the same three steps to validate a transaction as in SCAR. However, unlike SCAR, it must validate all records of a transaction. This is achieved by comparing the data versions from the transaction’s read set to the latest ones in the database on the primary replicas at commit time.

RC: This is a reduced consistency protocol adapted from OCC. It supports read committed transactions by not validating a transaction’s read set.

By default, OCC and RC are allowed to read from local secondary replicas. However, RC does not need to use TIDs to validate its read set on the primary replicas, since every record it sees is guaranteed to come from committed transactions. OCC and RC also use asynchronous writes and replication, and apply the same techniques as discussed in Section 3.4 for consistency reasons.

A transaction runs under serializable isolation level in SCAR, S2PL and OCC, and under read committed isolation level in RC. SCAR, OCC and RC commit transactions with a group commit that happens every 10 ms.

8.2. Performance Comparison

We now study the performance of SCAR and other algorithms using Retwis, YCSB and TPC-C workloads.

(a) 50% Cross-Partition
(b) Skew factor = 1.2
(c) 50% Cross-Partition; Skew factor = 1.2
Figure 7. Throughput and cumulative distribution of commit request latency of each approach on Retwis
(a) 50% Cross-Partition
(b) Skew factor = 1.2
(c) 50% Cross-Partition; Skew factor = 1.2
Figure 8. Throughput and cumulative distribution of commit request latency of each approach on YCSB

8.2.1. Retwis Results

We first analyze the performance of all distributed concurrency control algorithms on the Retwis benchmark. We run a workload mix of 80/20, i.e., the workload consists of 80% of the GetTimeline transaction and 20% of the PostTweet transaction.

The GetTimeline transaction has read operations that load tweets from a given user and his/her followers, where is chosen at random from 1 to 10. There are 3 update operations and 2 write operations in the PostTweet transaction. In social network, some popular tweets are read by a lot more people. To model this, each operation in the PostTweet

transaction follows a uniform distribution, but each operation in the

GetTimeline transaction follows a Zipfian distribution (Gray et al., 1994) with a skew factor.

Figure 7(a) shows the results with a varying skew factor and 50% cross-partition transactions. S2PL has consistently lower throughput than other algorithms, since it always reads from primary replicas and applies writes synchronously. RC has the highest throughput, since it avoids much coordination by running transactions under read committed. When the skew factor is 0, i.e., each access follows a uniform distribution, SCAR has 41% higher throughput than OCC. This is because OCC has to validate every records it reads. In contrast, SCAR can locally validate some records as discussed in Section 3. As we increase the skew factor, the rts of each record is more likely to be valid at a transaction’s commit timestamp in SCAR. For this reason, with a skew factor being 2.4, SCAR has 1.9x higher throughput than OCC and achieves similar throughput to RC.

We also ran Retwis with a fixed skew factor 1.2 and a varying ratio of cross-partition transactions. Since the throughput of each approach decreases significantly, we report each approach’s relative throughput to RC in Figure 7(b) for the purpose of better visualization. Overall, SCAR achieves up to 70% higher throughput than OCC.

We now study the commit request latency of each approach. The commit request latency measures how long it takes a transaction to release its write locks since the beginning of execution (Athanassoulis et al., 2009)

. The overall execution latency is not reported due to the fact that all approaches except S2PL use a group commit. We report the cumulative distribution function (CDF) of the commit request latency in each approach in Figure 

7(c). We fixed skew factor to 1.2 and the ratio of cross-partition transactions to 50%. In Figure 7(c), we can observe that SCAR has consistently lower commit request latency than OCC and S2PL. The commit request latency of RC is the lowest, since it runs transactions under read committed and avoids more coordination than others.

8.2.2. YCSB Results

We next study the performance of SCAR versus the other algorithms on the YCSB benchmark. As in Retwis, each read operation follows a Zipfian distribution (Gray et al., 1994)

and each update operation follows a uniform distribution. We run a workload mix of 80/20, i.e., each operation in a transaction has an 80% probability of being a read operation and a 20% probability of being an update operation.

Figure 8(a) shows the throughput of each approach with a varying skew factor and 50% cross-partition transactions. We observe a similar result as for Retwis. For example, the throughput of SCAR goes up as we increase the skew factor. The throughput of other approaches is not sensitive to the skew factor. When the skew factor is 2.4, SCAR has 75% higher throughput than OCC and achieves similar throughput to RC. Figure 8(b) shows the relative throughput of each approach to RC with a varying ratio of cross-partition transactions. We fix the skew factor to 1.2 as well. Overall, SCAR has up to 65% higher throughput than OCC.

Figure 8(c) shows the CDF of the commit request latency on YCSB. We fixed skew factor to 1.2 and the ratio of cross-partition transactions to 50%. As in Retwis, we can observe that the gap is smaller between SCAR and RC than OCC and S2PL.

8.2.3. TPC-C Results

Figure 9. Throughput of each approach on TPC-C
Figure 10. Results with a varying number of replicas

At last, we study the performance of each approach on the TPC-C benchmark. We ran a workload with the NewOrder transaction only. 10% of transactions are fulfilled by a remote warehouse, meaning that they are cross-partition transactions. We consider two variations of TPC-C in this experiment: (1) Replicated Item: the Item table is replicated on each node and considered to be a read only index; (2) Partitioned Item: the Item table is partitioned across all nodes the same as all other tables. In the second variation, more transactions become cross-partition transactions. This is because every NewOrder transaction has 5  15 reads from the Item table. If no local replica is available, these reads are remote.

For the purpose of better visualization, we report each approach’s relative throughput to RC as we did in Section 8.2. We show the result of Replicated Item in the left side of Figure 10. Since the NewOrder transaction is a write-intensive transaction, i.e., almost every read comes with an update888The reads from the Warehouse and the Customer table are always local., this benchmark gives no benefits to SCAR and makes it slightly slower than OCC due to more messages being sent (e.g., messages to synchronize timestamps as discussed in Section 6.1). The result of Partitioned Item is shown in the right side of Figure 10. With more reads from the Item table being remote, SCAR achieves 32% higher throughput than OCC because of coordination reduction. RC has the highest throughput, since it never validates remote reads.

In summary, SCAR is able to achieve higher throughput than OCC and S2PL by reducing coordination. In the case that there exists high access skew in read operations, its performance is even close to running transactions under reduced isolation levels (e.g., read committed).

8.3. Wide-Area Network Experiment

Table 1. Round trip times between EC2 nodes
Figure 11. Results in wide-area network setting

In this section, we study how SCAR performs compared to other approaches in the wide-area network (WAN) setting. For users concerned with very high availability, wide-area replication is important because it allows the database to survive the failure of a whole data center (e.g., due to a power outage or a natural disaster). We use three m5.2xlarge nodes running on Amazon EC2 (ec2, 2019). The three nodes are in North Virginia, Ohio, and North California respectively and each node has 8 virtual CPUs. The round trip times between any two nodes are shown in Table 1. We consider two variations in this experiment: (1) Partitioned Database: the database is partitioned across three area zones and no replication is used; (2) Replicated Database: each partition of the database is fully replicated in all area zones, meaning each one has a primary partition and two backup partitions. The primary partition is randomly chosen from 3 nodes.

In this experiment, we run 6 worker threads and 2 threads for network communication on each node. The group commit frequency is set to once per second. The same YCSB workload from Section 8.2.2 is used, with the skew factor being 1.2 and 50% cross-partition transactions.

As we can observe from Figure 11, RC has the highest throughput in both scenarios, since less coordination is required. In addition, all approaches except S2PL have higher throughput when the database is replicated across area zones. The reasons are twofold: (1) S2PL uses synchronous write and replication. More replicas lead to higher latency and lower throughput; (2) all other protocols are optimistic and are able to take advantage of local replicas to reduce network communication. In Partitioned Database, SCAR achieves 26% higher throughput than S2PL and 41% higher throughput than OCC. In Replicated Database, the performance of SCAR and RC is significantly improved. For example, in the experiment running in the local area network setting, SCAR achieves only 62% higher throughput than OCC (i.e., Figure 8(a)). With wide-area network, SCAR’s performance improvement over OCC is 2.3x.

Overall, the performance advantage of SCAR over other approaches is even more significant in the WAN setting.

8.4. Effect of Different Numbers of Replicas

We now study how SCAR performs with different numbers of replicas. In this experiment, we report the results on the YCSB benchmark in Figure 10, with the skew factor being 1.2 and 50% cross-partition transactions. We varied the number of replicas from 1 (only the primary replica exists) to 8 (one primary replica plus seven backup replicas) for each partition.

Since the workload is read-intensive, each approach is expected to have higher throughput when more replicas are available. This is because writes and replication are not a bottleneck in this workload, and more reads requests are served locally. For example, RC almost achieves 2.4x higher throughput when 8 replicas are available. In contrast, OCC achieves only 52% higher throughput, since many reads still need validation. SCAR reduces coordination through logical timestamps and achieves 2.2x higher throughput.

In summary, reading from local replicas effectively boosts a system’s performance. By default, each partition in SCAR has three replicas.

Figure 12. Serializability (SR) vs. Snapshot Isolation (SI); Dotted line shows the performance gain of SI over SR
Figure 13. Percentage of serializable (SR) transactions running under snapshot isolation (SI); Red area shows the throughput of SR transactions

8.5. Serializability vs. Snapshot Isolation

We next study how much performance gain SCAR is able to achieve when it runs transactions under snapshot isolation (SI) versus serializability. Our intuition is that the system should commit more transactions per second under SI. The reasons behind are twofold. First, SI transactions does not detect read/write conflicts, which introduces a lower abort rate in a highly contended workload. Second, SCAR running under SI can lock a transaction’s write set and validate its read set in a single round trip as discussed in Section 6.2.

In this experiment, we only report the results on the YCSB benchmark. This is because, in the Retwis benchmark, the GetTimeline transaction is a read-only transaction and these two isolation levels are the same to it, and TPC-C is a write-intensive benchmark. To increase read/write conflicts, we set the number of operations to 8 on YCSB and all operations follow a Zipfian distribution with a skew factor.

We vary the skew factor from 0 to 2.4. The ratio of cross-partition transactions is 50% and the workload mix is 80/20.

We report the results in Figure 13. Two solid lines show the throughput of SCAR running under different isolation levels. The dashed line shows the performance gain of running under SI over serializability. As there is more contention in the workload, the performance of the system running under two different isolation levels goes down. This is because the system has a higher abort rate when the workload becomes more contended. As expected, SCAR running under SI has a larger performance gain when the skew factor increases. For example, the performance gain goes up from 27% to 111%, as the skew factor goes up from 0.6 to 2.4.

8.6. Concurrency Anomaly Detection

There is an inherent trade-off between throughput and isolation levels. A system runs more transactions under SI than serializability, but concurrency anomalies may arise. As we discussed in Section 5.2, SCAR is able to give a real-time breakdown if a transaction commits under serializability when it started as a snapshot isolation transaction.

We used the same workload as in Section 8.5 and report the throughput of SCAR running transactions under SI as well as serializability in Figure 13. As the workload has more contention, the percentage goes down. For example, When the skew factor is 0.6, 58% of SI transactions commit under serializability. The percentage goes down to 40% as we increase the skew factor to 2.4. This is because fewer transactions are able to commit under a higher isolation level in a highly contended workload.

In summary, there are a large number of transactions that actually commit under serializability when they started as SI transactions. We believe application developers can achieve higher performance with SCAR running under SI and monitor how many concurrency anomalies arise at the same time, with an eye towards switching to serializable mode if too many transactions are not achieving serializability.

(a) Retwis
(b) YCSB
(c) YCSB
Figure 14. Factor analysis for SCAR, OCC and RC; 50% Cross-Partition; Skew factor = 1.2; LR=local read, LV=local read validation; TS=timestamp synchronization; PLV=parallel locking and validation

8.7. Factor Analysis

We now study the effectiveness of each optimization technique in more detail through a factor analysis.

8.7.1. LR, LV and TS Results

In the baseline implementation, SCAR is only allowed to read from primary replicas and validation messages are always sent to primary replicas to better study the effectiveness of local read validation. The timestamp synchronization, as discussed in Section 6.1, is also disabled. Similarly, the baseline implementations of OCC and RC always read from primary replicas as well. We don’t show the results on S2PL in this experiment, since it is not clear to us how to apply these optimization techniques to it.

We introduce one technique at a time to the baseline implementation and show the results in Figure 14(a) and Figure 14(b). +LR refers to the technique that allows a transaction to read from local secondary replicas. +LR+LV refers to adding the local read validation on top of SCAR when the first technique is enabled. A record in a transaction’s read set can be locally validated if its rts ends after the transaction’s commit timestamp. Finally, +LR+LV+TS refers to adding the timestamp synchronization optimization with the first two techniques enabled.

We first ran the same Retwis workload as we did in Figure 7(a) with a skew factor being 1.2. The results are shown in Figure 14(a). The +LR technique enables SCAR to have 20% performance gain. Similarly, the +LR technique also helps OCC and RC achieve 10% and 23% performance gain respectively. As we further add the local read validation technique to SCAR (shown as +LR+LV), the performance gain goes up to 1.7x compared to the baseline implementation. When the first two techniques are used with the timestamp synchronization (shown as +LR+LV+TS), SCAR is able to achieve 1.8x higher throughput in total than the baseline implementation.

Similar results are also observed on the YCSB workload as shown in Figure 14(b). The workload is the same as in Figure 8(a) with a skew factor being 1.2. The +LR technique helps OCC and RC achieve 15% and 29% performance gain respectively. SCAR is able to achieve 1.8x higher throughput with all three techniques enabled (shown as +LR+LV+TS).

8.7.2. PLV Results

We now study the effectiveness of the parallel locking and validation (PLV) optimization. This technique potentially reduces one network round trip for SI transactions.

As we discussed in Section 6.2, the +PVL technique can only be applied to SI transactions. For this reason, we ran the workload from Section 8.5. In Figure 14(c), the baseline implementation, which is shown as SCAR SI, refers to SCAR running under SI with all three techniques from Section 8.7.1 enabled. SCAR SI + PVL refers to adding the parallel locking and validation optimization on top of SCAR SI. The results are reported as the performance gain of running under SI over serializability.

When each operation follows a uniform distribution, the +PLV technique enables SCAR SI to have 5% more performance gain. This is because one network round trip is eliminated. As we add more contention to the workload, the additional performance gain goes up to 67%. This is because the +PLV technique can effectively reduce the time that an SI transaction holds locks as well, allowing the system to have a lower abort rate and higher throughput.

9. Related Work

The design of SCAR is inspired by many pieces of related work, inducing transaction processing, strong consistency with replication and snapshot isolation systems.

Transaction Processing. The seminal survey by Bernstein et al. (Bernstein and Goodman, 1981) summarizes classic distributed concurrency control protocols, with the exception of optimistic concurrency control (Kung and Robinson, 1981). As in-memory databases becoming more popular, there has been a resurgent interest in transaction processing in both multicore processors (Kim et al., 2016; Lim et al., 2017; Tu et al., 2013; Wu et al., 2017; Yu et al., 2016) and distributed systems (Cowling and Liskov, 2012; Harding et al., 2017; Mahmoud et al., 2014; Mu et al., 2014). None of these protocols, however, provide high availability via replication.

Query Fresh (Wang et al., 2017a) uses an append-only storage architecture to make backup nodes of a hot standby system not to block the primary node. SCAR can use the same technique to further decrease the overhead of replication. Obladi (Crooks et al., 2018) reduces bandwidth cost and increases system throughput by delaying updates within epochs. Similarly, SCAR uses asynchronous writes and replication to increase system throughput with epochs as well.

Strong Consistency with Replication. High availability is typically implemented via replication. Paxos (Lamport, 2001) is a popular solution to coordinate the concurrent reads and writes to different copies of data while providing consistency. Spanner (Corbett et al., 2012) is a Paxos-based transaction processing system based on a two-phase locking protocol. Each read request goes to the master replica which can be a remote node. Each master replica initiates a Paxos protocol to synchronize with backup replicas. The protocol incurs multiple round-trip messages for data accesses and replication coordination. MDCC (Kraska et al., 2013) is an OCC protocol that exploits generalized Paxos (Lamport, 2005) to reduce the coordination overhead where a transaction can commit with a single message round trip in the normal operation. Ganymed (Plattner and Alonso, 2004; Plattner et al., 2008) runs update transactions on a single node and propagates writes of committed transactions to a potentially unlimited number of read-only replicas. TAPIR (Zhang et al., 2015) eliminates the overhead of Paxos by allowing inconsistency in the storage system and building consistent transactions using inconsistent replication. Similar to SCAR, TAPIR uses an optimistic protocol to validate transactions. The behavior of TAPIR is similar to the OCC configuration in Section 8, and suffers its same limitations. While the systems above are different from the primary-backup design in SCAR, the use of logical timestamps to reduce coordination among replicas is applicable to these systems as well. We leave the exploration of this to future work.

By maintaining multiple data versions, TxCache (Ports et al., 2010) ensures that a transaction always reads from a consistent snapshot regardless of whether each read operation comes from the database or the cache. In SCAR, reads are from a consistent snapshot as long as they can be validated at a given logical time. Warranties (Liu et al., 2014) reduces coordination on read validation by maintaining time-based leases to popular records, but writes have to be delayed. SCAR reduces coordination without penalizing writes. This is because writes instantly make the read validity timestamps on old records expired.

Isolation Levels. Berenson et al. (Berenson et al., 1995) provides an excellent explanation of commonly used isolation levels in a database. Binning et al. (Binnig et al., 2014) show how only distributed SI transactions pay the cost of coordination. In SCAR, cross-node coordination for local transactions is not necessary as well due to the use of logical timestamps. Serial safety net (SSN) (Wang et al., 2017b) is able to make any concurrency control protocol to support serializability by detecting dependency cycles. The same technique can be applied to SCAR as well. Furthermore, SCAR is able to monitor concurrency anomalies from SI transactions through a simple equality check.

Due to the overhead of implementing strong isolation, many systems use weaker isolation levels instead (e.g., PSI (Sovran et al., 2011), causal consistency (Mehdi et al., 2017), eventual consistency (Terry et al., 1995), or no consistency (Recht et al., 2011)). Lower isolation levels trade programmability for performance and scalability. In this paper, we focus on serializability and snapshot isolation, which are the gold standard for transactional applications and the default isolation levels in all major relational systems.

10. Conclusion

In this paper, we presented SCAR, a new distributed and replicated in-memory database. It allows transactions to read from secondary replicas and enforces strong consistency. The writes of transactions are asynchronously replicated to secondary replicas and applied in any order. By serializing transactions in a logical-time order, the system is able to enforce strong consistency without expensive coordination. Our results on three popular benchmarks show that the system outperforms conventional designs by up to a factor of 2. In workloads with high contention, we also demonstrated that higher throughput can be achieved by running transactions under reduced isolation levels and the system can effectively monitor concurrency anomalies in real time.


  • (1)
  • tpc (2010) 2010. TPC Benchmark C.
  • ec2 (2019) 2019. Amazon EC2.
  • Athanassoulis et al. (2009) Manos Athanassoulis, Ryan Johnson, Anastasia Ailamaki, and Radu Stoica. 2009. Improving OLTP Concurrency through Early Lock Release.
  • Berenson et al. (1995) Hal Berenson, Philip A. Bernstein, Jim Gray, Jim Melton, Elizabeth J. O’Neil, and Patrick E. O’Neil. 1995. A Critique of ANSI SQL Isolation Levels. In SIGMOD Conference. 1–10.
  • Bernstein and Goodman (1981) Philip A. Bernstein and Nathan Goodman. 1981. Concurrency Control in Distributed Database Systems. ACM Comput. Surv. 13, 2 (1981), 185–221.
  • Bernstein et al. (1979) Philip A. Bernstein, David W. Shipman, and Wing S. Wong. 1979. Formal Aspects of Serializability in Database Concurrency Control. IEEE Trans. Software Eng. 5, 3 (1979), 203–216.
  • Binnig et al. (2014) Carsten Binnig, Stefan Hildenbrand, Franz Färber, Donald Kossmann, Juchang Lee, and Norman May. 2014. Distributed snapshot isolation: global transactions pay globally, local transactions pay locally. VLDB J. 23, 6 (2014), 987–1011.
  • Cooper et al. (2010) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In SoCC. 143–154.
  • Corbett et al. (2012) James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Database. In OSDI. 261–264.
  • Cowling and Liskov (2012) James A. Cowling and Barbara Liskov. 2012. Granola: Low-Overhead Distributed Transaction Coordination. In ATC. 223–235.
  • Crooks et al. (2018) Natacha Crooks, Matthew Burke, Ethan Cecchetti, Sitar Harel, Rachit Agarwal, and Lorenzo Alvisi. 2018. Obladi: Oblivious Serializable Transactions in the Cloud. In OSDI. 727–743.
  • DeCandia et al. (2007) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: amazon’s highly available key-value store. In SOSP. 205–220.
  • Dey (2013) Akon Dey. 2013. Scalable Transactions across Heterogeneous NoSQL Key-Value Data Stores. PVLDB 6, 12 (2013), 1434–1439.
  • Dey et al. (2014) Akon Dey, Alan Fekete, Raghunath Nambiar, and Uwe Röhm. 2014. YCSB+T: Benchmarking web-scale transactional databases. In ICDE Workshops. 223–230.
  • Eidson (2002) John Eidson. 2002. IEEE 1588 standard for a precision clock synchronization protocol for networked measurement and control systems.
  • Eswaran et al. (1976) Kapali P. Eswaran, Jim Gray, Raymond A. Lorie, and Irving L. Traiger. 1976. The Notions of Consistency and Predicate Locks in a Database System. Commun. ACM 19, 11 (1976), 624–633.
  • Gray et al. (1994) Jim Gray, Prakash Sundaresan, Susanne Englert, Kenneth Baclawski, and Peter J. Weinberger. 1994. Quickly Generating Billion-Record Synthetic Databases. In SIGMOD Conference. 243–252.
  • Harding et al. (2017) Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017. An Evaluation of Distributed Concurrency Control. PVLDB 10, 5 (2017), 553–564.
  • Kim et al. (2016) Kangnyeon Kim, Tianzheng Wang, Ryan Johnson, and Ippokratis Pandis. 2016. ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads. In SIGMOD Conference. 1675–1687.
  • Kraska et al. (2013) Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and Alan Fekete. 2013. MDCC: multi-data center consistency. In Eurosys. 113–126.
  • Kung and Robinson (1981) H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for Concurrency Control. ACM Trans. Database Syst. 6, 2 (1981), 213–226.
  • Lamport (2001) Leslie Lamport. 2001. Paxos Made Simple. ACM SIGACT News 32, 4 (2001), 18–25.
  • Lamport (2005) Leslie Lamport. 2005. Generalized Consensus and Paxos.
  • Leau (2013) Costin Leau. 2013. Spring Data Redis - Retwis-J.
  • Lim et al. (2017) Hyeontaek Lim, Michael Kaminsky, and David G. Andersen. 2017. Cicada: Dependably Fast Multi-Core In-Memory Transactions. In SIGMOD Conference. 21–35.
  • Liu et al. (2014) Jed Liu, Tom Magrino, Owen Arden, Michael D. George, and Andrew C. Myers. 2014. Warranties for Faster Strong Consistency. In NSDI. 503–517.
  • Mahmoud et al. (2014) Hatem A. Mahmoud, Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi. 2014. MaaT: Effective and scalable coordination of distributed transactions in the cloud. PVLDB 7, 5 (2014), 329–340.
  • Mehdi et al. (2017) Syed Akbar Mehdi, Cody Littley, Natacha Crooks, Lorenzo Alvisi, Nathan Bronson, and Wyatt Lloyd. 2017. I Can’t Believe It’s Not Causal! Scalable Causal Consistency with No Slowdown Cascades. In NSDI. 453–468.
  • Mills (1991) David L Mills. 1991. Internet time synchronization: the network time protocol. IEEE Transactions on communications 39, 10 (1991), 1482–1493.
  • Mu et al. (2014) Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. 2014. Extracting More Concurrency from Distributed Transactions. In OSDI. 479–494.
  • Peluso et al. (2012) Sebastiano Peluso, Paolo Romano, and Francesco Quaglia. 2012. SCORe: A Scalable One-Copy Serializable Partial Replication Protocol. In Middleware Conference. 456–475.
  • Plattner and Alonso (2004) Christian Plattner and Gustavo Alonso. 2004. Ganymed: Scalable Replication for Transactional Web Applications. In Middleware Conference. 155–174.
  • Plattner et al. (2008) Christian Plattner, Gustavo Alonso, and M. Tamer Özsu. 2008. Extending DBMSs with satellite databases. VLDB J. 17, 4 (2008), 657–682.
  • Ports et al. (2010) Dan R. K. Ports, Austin T. Clements, Irene Zhang, Samuel Madden, and Barbara Liskov. 2010. Transactional Consistency and Automatic Management in an Application Data Cache. In OSDI. 279–292.
  • Recht et al. (2011) Benjamin Recht, Christopher Ré, Stephen J. Wright, and Feng Niu. 2011.

    Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In

    NIPS. 693–701.
  • Reed (1978) David P. Reed. 1978. Naming and synchronization in a decentralized computer system. Ph.D. Dissertation. MIT, Cambridge, MA, USA.
  • Shute et al. (2013) Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. PVLDB 6, 11 (2013), 1068–1079.
  • Sovran et al. (2011) Yair Sovran, Russell Power, Marcos K. Aguilera, and Jinyang Li. 2011. Transactional storage for geo-replicated systems. In SOSP. 385–400.
  • Terry et al. (1995) Douglas B. Terry, Marvin Theimer, Karin Petersen, Alan J. Demers, Mike Spreitzer, and Carl Hauser. 1995. Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System. In SOSP. 172–183.
  • Thomas (1979) Robert H. Thomas. 1979. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases. TODS 4, 2 (1979), 180–209.
  • Tu et al. (2013) Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy transactions in multicore in-memory databases. In SOSP. 18–32.
  • Wang et al. (2017b) Tianzheng Wang, Ryan Johnson, Alan Fekete, and Ippokratis Pandis. 2017b. Efficiently making (almost) any concurrency control mechanism serializable. VLDB J. 26, 4 (2017), 537–562.
  • Wang et al. (2017a) Tianzheng Wang, Ryan Johnson, and Ippokratis Pandis. 2017a. Query Fresh: Log Shipping on Steroids. PVLDB 11, 4 (2017), 406–419.
  • Wu et al. (2017) Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. An Empirical Evaluation of In-Memory Multi-Version Concurrency Control. PVLDB 10, 7 (2017), 781–792.
  • Yu et al. (2016) Xiangyao Yu, Andrew Pavlo, Daniel Sánchez, and Srinivas Devadas. 2016. TicToc: Time Traveling Optimistic Concurrency Control. In SIGMOD Conference. 1629–1642.
  • Yu et al. (2018) Xiangyao Yu, Yu Xia, Andrew Pavlo, Daniel Sánchez, Larry Rudolph, and Srinivas Devadas. 2018. Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System. PVLDB 11, 10 (2018), 1289–1302.
  • Zhang et al. (2015) Irene Zhang, Naveen Kr. Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan R. K. Ports. 2015. Building consistent transactions with inconsistent replication. In SOSP. 263–278.
  • Zheng et al. (2014) Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism. In OSDI. 465–477.