Recent years have seen a number of in-memory transaction processing systems that can run hundreds of thousands to millions of transactions per second by leveraging multi-core parallelism (Stonebraker et al., 2007; Tu et al., 2013; Yu et al., 2016)
. These systems can be broadly classified into i) partitioning-based systems, e.g., H-Store(Stonebraker et al., 2007) which partitions data onto different cores or machines, and ii) non-partitioned systems that try to minimize the overheads associated with concurrency control in a single-server, multi-core setting by avoiding locks and contention whenever possible (Narula et al., 2014; Neumann et al., 2015; Tu et al., 2013; Yu et al., 2016), while allowing any transaction to run on any core/processor.
As shown in Figure 1, partitioning-based systems work well when workloads have few cross-partition transactions, because they are able to avoid the use of any synchronization and thus scale out to multiple machines. However, these systems suffer when transactions need to access more than one partition, especially in a distributed setting where expensive protocols like two-phase commit are required to coordinate these cross-partition transactions. In contrast, non-partitioned approaches provide somewhat lower performance on highly-partitionable workloads due to their inability to scale out, but are not sensitive to how partitionable the workload is.
In this paper, we propose a new transaction processing system, STAR, that is able to achieve the best of both worlds. We start with the observation that most modern transactional systems will keep several replicas of data for high availability purposes. In STAR, we ensure that one of these replicas is complete, i.e. it stores all records on a single machine, as in the recent non-partitioned approaches. We also ensure that at least one of the replicas is partitioned across several processors or machines, as in partitioned schemes. The system runs in two phases, using a novel phase-switching protocol: in the partitioned phase, transactions that can be executed completely on a single partition are mastered at one of the partial replicas storing that partition, and replicated to all other replicas to ensure fault tolerance and consistency. Cross-partition transactions are executed during the single-master phase, during which mastership for all records is switched to one of the complete replicas, which runs the transactions to completion and replicates their results to the other replicas. Because this node already has a copy of every record, changing the master for a record (re-mastering) can be done without transferring any data, ensuring lightweight phase switching. Furthermore, because the node has a complete replica, transactions can be coordinated within a single machine without a slow commit protocol like two-phase commit. In this way, cross-partition transactions do not incur commit overheads as in typical partitioned approaches (because they all run on the single master), and can be executed using a lightweight concurrency control protocol like Silo (Tu et al., 2013). Meanwhile, single-partition transactions can still be executed without any concurrency control at all, as in H-Store (Stonebraker et al., 2007), and can be run on several machines in parallel to exploit more concurrency. The latency that phase switching incurs is no larger than the typical delay used in high-performance transaction processing systems for group commit.
Although the primary contribution of STAR is this phase-switching protocol, to build the system, we had to explore several novel aspects of in-memory concurrency control. In particular, prior systems like Silo (Tu et al., 2013) and TicToc (Yu et al., 2016) were not designed with high-availability (replication) in mind. Making replication work efficiently in these systems requires some care, and is amenable to a number of optimizations. For example, STAR uses intra-phase asynchronous replication to achieve high performance. In the meantime, it ensures consistency among replicas via a replication fence when phase-switching occurs. In addition, with our phase-switching protocol, STAR can use a cheaper replication strategy than that employed by replicated systems that need to replicate entire records (Zheng et al., 2014). This optimization can significantly reduce bandwidth requirements (e.g., by up to an order of magnitude in our experiments with TPC-C.).
Our system does require one replica to hold an entire copy of the database, however, we believe this is not an onerous restriction. First, existing single-node main-memory database systems have a similar restriction. Second, modern high-end servers that are equipped with hundreds of gigabytes to terabytes of RAM and tens of CPU cores, which is sufficient for most transactional workloads. For example, a database machine with 2 TB of RAM might store a 1 TB database, which is sufficient to store 10 kilobytes of online state about each customer in a database with about 100M customers.
In summary, STAR is a new distributed and replicated in-memory database that employs both partitioning and replication. It encompasses a number of interesting aspects:
By exploiting multicore parallelism and fast networks, STAR is able to provide high throughput with serializability guarantees.
It employs a phase-switching scheme which enables STAR to execute cross-partition transactions without two-phase commit while preserving fault tolerance guarantees.
It uses a hybrid replication strategy to reduce the overhead of replication, while providing transactional consistency and high-availability.
In addition, we present a detailed evaluation of STAR that demonstrates its ability to provide adaptivity, high availability, and high-performance transaction execution. STAR outperforms systems that employ conventional distributed concurrency control algorithms by up to one order of magnitude on YCSB and TPC-C.
In this section, we describe how concurrency control allows database systems to execute transactions with serializability guarantees. We also introduce how replication is used in database systems with consistency guarantees.
2.1. Concurrency Control Protocols
Serializability — where the operations of multiple transactions are interleaved to provide concurrency while ensuring that the state of the database is equivalent to some serial ordering of the transactions — is the gold standard for transaction execution.
Many serializable concurrency control protocols have been proposed, starting from early protocols that played an important role in hiding latency from disks (Bernstein et al., 1987) to modern OLTP systems that are designed to exploit multicore parallelism (Tu et al., 2013; Yu et al., 2016), typically employing lock-based and/or optimistic techniques.
Two-Phase locking (2PL) is the most widely used classical protocol to ensure serializability of concurrent transactions (Eswaran et al., 1976). 2PL is considered pessimistic since the database acquires locks on operations even when there is no conflict. On the contrary, optimistic concurrency control protocols (OCC) avoid this by only checking conflicts at the end of a transaction’s execution (Kung and Robinson, 1981). OCC runs transactions in three phases: read, validation, and write. In the read phase, transactions perform read operations from the database and write operations to local copies of objects without acquiring locks. In the validation phase, conflict checks are done against all concurrently committing transactions. If conflicts exist, the transaction aborts. Otherwise, it enters the write phase and copies its local modifications to the database. Modern systems, like Silo (Tu et al., 2013), typically employ OCC-like techniques because they make it easier to avoid the overhead of shared locks during query execution.
In distributed database systems, cross-partition transactions involving many machines are classically coordinated using a two-phase commit (2PC) (Mohan et al., 1986) protocol to achieve fault tolerance, since machines can fail independently. The coordinator decides to commit or abort a transaction based on decisions collected from workers in the prepare phase. Workers must durably remember their decisions in the prepare phase until they learn the transaction outcome. Once the decision on the coordinator is made, 2PC enters the commit phase and workers commit or abort the transaction based on its decision. Although 2PC does ensure serializability, the additional overhead of multiple log messages and network round trips for each transaction can significantly reduce throughputs of distributed transactions. In STAR, we avoid two-phase commit by employing a phase-switching protocol to re-master records in distributed transactions to a single primary, which runs transactions locally, and replicates them asynchronously.
Modern database systems need to be highly available. When a subset of servers in a cluster fail, the system needs to quickly reconfigure itself and replace a failed server with a standby machine, such that an end user does not experience any noticeable downtime. High availability requires the data to be replicated across multiple machines in order to allow for fast fail-over.
Primary/backup replication is one of the most widely used schemes. After the successful execution of a transaction, the primary node sends the log to all involved backup nodes. The log can either contains values (Microsoft, 2016; McInnis, 2003; MySQL, 2018; PostgreSQL, 2018), which are applied by the backup nodes to the replica, or operations (Qin et al., 2017), which are re-executed by the backup nodes. For a distributed database, each piece of data has one primary node and one or multiple backup nodes. A backup node for some data can also be a primary node for some other data.
In distributed database systems, both 2PC and replication are important to ensure ACID properties. They are both expensive compared to a single node system, but in different ways. In particular, replication incurs very large data transfer but does not necessarily need expensive coordination for each transaction.
3. STAR Architecture
STAR is a distributed and replicated in-memory database. It supports a relational data model, where a table has typed and named records. As in most high performance transactional systems (Stonebraker et al., 2007; Tu et al., 2013; Yu et al., 2016), clients send requests to STAR by calling pre-defined stored procedures. Parameters of a stored procedure must also be passed to STAR with the request. Arbitrary logic (e.g., read/write operations) is supported in stored procedures, which are implemented in C++.
Tables in STAR are implemented as collections of hash tables, which is typical in many in-memory databases (Wang and Kimura, 2016; Wei et al., 2015). Each table is built on top of a primary hash table and contains zero or more secondary hash tables as secondary indexes. To access a record, STAR probes the hash table with the primary key. Fields with secondary indexes can be accessed by mapping a value to the relevant primary key. Although STAR is built on top of hash tables, it is easily adaptable to other data structures such as Masstree (Mao et al., 2012).
STAR uses a variant of Silo’s OCC protocol (Tu et al., 2013)
. Each transaction is assigned a transaction ID (TID) when it begins validation. The TID is used to determine which other transactions a committing transaction may have a conflict with. There are three criteria for the TID obtained from each thread: (a) it must be larger than the TID of any record in read/write-set; (b) it must be larger than the thread’s last chosen TID; (c) it must be in the current global epoch. Unlike Silo, STAR commits transactions and increments the global epoch when the system switches to the next phase.
STAR is serializable by default but can also support read committed and snapshot isolation. A transaction runs under read committed by skipping read validation on commit, since STAR uses OCC and uncommitted data never occurs in database. STAR can provide snapshot isolation by retaining additional versions for records (Tu et al., 2013).
STAR runs on a cluster of nodes such that each node has a partial or full copy of each table, as shown in Figure 2. Here nodes (left side of figure) employ full replication. During the partitioned phase, these nodes act as the masters for a fraction of the database and backups for the rest of the database, while the remaining partial replicas (right side of figure), act as the masters for the rest of the database. During the single-master phase, one of the nodes (left side of figure) acts as the master for the whole database. Note that writes of committed transactions are replicated at most times in a cluster of nodes. We envision being small (e.g., 1), while can be much larger.
At any point in time, each record is mastered on one node, with other nodes serving as secondaries. Transactions are only executed over primary records; writes of committed transactions are propagated to all replicas. Each stored procedure request is sent to a database worker thread. In STAR, each physical CPU core runs only one worker thread. After a transaction commits, the worker thread replicates the transaction’s writes to other nodes, where a worker thread serves the replication request.
STAR separates the execution of single-partition transactions and cross-partition transactions using a novel phase-switching protocol. The system dynamically switches between the partitioned phase and the single-master phase. In STAR, each partition is mastered by a single worker thread. Queries that touch only a single partition are executed one at a time to completion on that partition in the partitioned phase. Queries that touch multiple partitions (even if those partitions are on a single machine) are executed in the single-master phase on a single designated coordinator node, selected from amongst the nodes with a full copy of the database. Because the coordinator employs full replication, STAR can perform this logical re-partitioning (where the coordinator node takes mastership of all records in the database) nearly instantaneously. Thus, during the single-master phase, the coordinator node becomes the primary for all records, and executes the cross-partition transactions on multiple threads. The writes of these cross-partition transactions still need to be sent to replicas, but a multi-round distributed transaction coordination protocol like two-phase commit, which can significantly limit the throughput of distributed database systems (Harding et al., 2017), is not needed as the coordinator runs all cross-partition transactions and propagates writes to replicas. After all cross-partition transactions are executed, the system switches back to the partitioned phase and mastership is re-allocated to individual nodes and worker threads in the cluster.
There are several advantages of this phase switching approach. First, in the partitioned phase, each partition is touched only by a single thread, which means that no locks are needed, and, as will be described in Section 5, this also allows us to perform operation replication, where logical updates of operations are streamed to replicas, instead of transmitting full records. Second, in the single-master phase, we avoid executing distributed transactions. However, because a transaction runs over multiple partitions and multiple threads may access a single partition, the system has to employ record locking during the validation and write phases (as in Silo (Tu et al., 2013)), and also has to use a more expensive full-record replication scheme (similar to the way that SiloR must write full copies of records to logs (Zheng et al., 2014) for durability).
STAR can tolerate up to failed nodes simultaneously with a cluster of nodes. The details of our fault tolerance approach including how we handle coordinator failure, will be given in Section 7.
4. The Phase Switching Algorithm
We now describe the phase switching protocol we use to separate the execution of single-partition and cross-partition transactions. We first describe the two phases and next discuss how the system transitions between them. In the end, we will give a brief proof to show that STAR produces serializable results.
4.1. Partitioned Phase Execution
Each node serves as primary for a subset of the records in the partitioned phase, as shown on the top of Figure 3. During this phase, we restrict the system to run transactions which only read from and write into a single partition. Cross-partition transactions are deferred for later execution in the single-master phase. During the partitioned phase, we use operation replication for better performance, as discussed in Section 5.2.
Each record contains a TID indicating the last transaction that updated the record. In the partitioned phase, a transaction keeps a read set and a write set in its local copy, in case a transaction is aborted by the application explicitly. For example, an invalid item ID may be generated in TPC-C and a transaction with an invalid item ID is supposed to abort during execution. At commit time, it’s not necessary to lock all records in the write set and do read validation, since each partition is only updated by one thread. The system still generates a TID for each transaction and replicates a transaction’s write set for recovery purposes.
4.2. Single-Master Phase Execution
Any transaction can run in the single-master phase. Threads on the designated coordinator can access any record in any partition, since it has become the primary for all records, as shown on the bottom of Figure 3. We use multiple threads to run transactions using a variant of Silo’s OCC protocol (Tu et al., 2013) in the single-master phase.
A transaction reads data and the associated TIDs, and keeps them in its read set for later read validation. During transaction execution, a write set is computed and kept in a local copy. At commit time, each record in the write set is locked in a global order (e.g, the addresses of records) to prevent deadlocks. The transaction next generates a TID based on its read set and write set. The transaction will also abort during read validation if any record in the read set is modified (by comparing TIDs in the read set) or locked. Finally, records are updated and unlocked with the new TID. After a transaction commits (note that the result is not released to clients until the next phase switch occurs), the system replicates its write set with value replication.
4.3. Phase Transitions
We now describe how STAR transitions between the two phases, which alternate after the system is started. The system starts in the partitioned phase in which cross-partition transactions are deferred for later execution.
For ease of presentation, we assume that all cross-partition transaction requests go to the coordinator node and all participant nodes only receive single-partition transaction requests. In real scenarios, this could be implemented via router nodes that are aware of the partitioning of the database. If some transaction accesses multiple partitions on a participant node, the system would re-route the request to the coordinator node.
In the partitioned phase, a thread on each partition fetches requests from clients and runs these transactions as discussed in Section 4.1. When the execution time in the partitioned phase exceeds a given threshold , the coordinator node switches all nodes into the single-master phase.
Before the phase switching occurs, the coordinator stops all worker threads. It also initiates a “replication fence” by sending a “start fence” message to all nodes, as shown in Figure 4. During a replication fence, all nodes synchronize statistics about the number of committed transactions with one another. From these statistics each node learns how many outstanding writes it is waiting to see; nodes then wait until they have received and applied all writes from the replication stream to their local database. When a node has applied all writes from the replication stream, it sends an “end fence” message to the coordinator. Once the coordinator has received these messages from all nodes, it switches the system to the other phase.
In the single-master phase, worker threads on the coordinator node pull requests from clients and run transactions as discussed in Section 4.2. Meanwhile, the coordinator node sends writes of committed transactions to replicas and all the other nodes stand by for replication. To further improve the utilization of servers, read-only transactions can run on replicas at the client’s discretion. STAR only supports read committed isolation level for read-only transactions, since replication is out of order in the single-master phase and a client is not guaranteed to see a consistent view on replicas even it reads from one partition. Once the execution time in the single-master phase exceeds a given threshold , the system switches back to the partitioned phase using another replication fence.
The parameters and are set dynamically according to the ratio of cross-partition transactions in the workload. The iteration time (i.e., ) dictates the latency of committed transaction, since the system commits transactions only when the current phase ends. Intuitively, the system spends less time on synchronization with a longer iteration time. As will be noted in Section 8.4, we set the default iteration time to 10 ms; this provides good throughout while keeping latency at a typical level for high throughput transaction processing systems (e.g., Silo (Tu et al., 2013) uses 40 ms as a default).
We now give a brief argument that transactions executed in STAR are serializable.
A transaction only executes in a single phase, i.e., it either runs in the partitioned phase or the single-master phase. A replication fence between the partitioned phase and the single-master phase ensures that all writes from the replication stream have been applied to the database before switching to the next phase.
In the partitioned phase, there is only one thread running transactions serially on each partition. Each executed transaction only touches one partition, which makes transactions clearly serializable. In the single-master phase, STAR implements a variant of Silo’s OCC protocol (Tu et al., 2013) to ensure that concurrent transactions are serializable. With value replication, the secondary records are correctly synchronized, even though the log entries from replication stream may be applied in an arbitrary order, as will be noted in Section 5.
In this section, we first justify our decision to employ a combination of full replication and partial replication, and then discuss the details of our replication scheme.
5.1. STAR’s Replication Design
Replication is used to make database systems highly available, allowing service to continue despite the failure of one or more nodes. In STAR, all records are replicated on the nodes with a full copy of the database and partitioned across the nodes with a partial copy of the database. This design makes the system most applicable to a medium number of nodes.
Each record has a single primary, with other nodes serving as secondaries. Transactions run on primary records, and updates are propagated to secondaries. We use partition-level mastering in STAR. It would be easy to adapt our scheme to finer granularity mastering via two-tier partitioning (Taft et al., 2014) or look-up tables (Tatarowicz et al., 2012).
STAR employs intra-phase asynchronous replication and inter-phase synchronous replication. With intra-phase asynchronous replication, a transaction finishes its execution before writes are replicated (note that the result is not released to the client as discussed in Section 4). To ensure consistency, writes are replicated synchronously across different phases. For example, a replication fence is always used when the system switches to the next phase. Intra-phase asynchronous replication makes transactions have lower write latency, since unnecessary network round trips are eliminated. On the other hand, inter-phase synchronous replication allows STAR to avoid any inconsistency between replicas.
We now describe the details of our replication strategy, and how replication is done depending on the execution phase.
5.2. Value Replication vs. Operation Replication
As discussed in Section 4, STAR runs single-partition and cross-partition transactions in different phases. The system uses different replication schemes in these two phases: in the single-master phase, because a partition can be updated by multiple threads, records need to be fully-replicated to all replicas to ensure correct replication. However, in the partitioned phase, where a partition is only updated by a single thread, the system can use a better replication strategy based on replicating operations to improve performance.
To illustrate this, consider two transactions being run by two threads: T1: R1.A = R1.B + 1; R2.C = 0 and T2: R1.B = R1.A + 1; R2.C = 1. Suppose R1 and R2 are two records from different partitions and we are running in the single-master phase. In this case, because the writes are done by different threads, the order in which the writes arrive on replicas may be different from the order in which transactions commit on the primary. To ensure correctness, we employ the Thomas write rule (Thomas, 1979): tag each record with the last TID that wrote it, and apply a write if the TID of the write is larger than the current TID of the record. Because TIDs of conflicting writes are guaranteed to be assigned in the serial-equivalent order of the writing transactions, this rule will guarantee correctness. However, for this rule to work, each write must include the values of all fields in the record, not just the updated fields. To see this, consider the example in the left side of Figure 5 (only R1 is shown); For record R1, if T1 only replicates A, and T2 only replicates B, and T2’s updates are applied before T1’s, transaction T1’s update to field A is lost, since T1 is less than T2. Thus, when a partition can be updated by multiple threads, all fields of a record have to be replicated as shown in the middle of Figure 5 (note that fields that are always read-only do not need to be replicated).
Now, suppose R1 and R2 are from the same partition, and we run the same transactions in the partitioned phase, where transactions are run by only a single thread on each partition. If T2 commits after T1, T1 is guaranteed to be ahead of T2 in the replication stream since they are executed by the same thread. For this reason, only the updated fields need to be replicated, i.e., T1 can just send the new value for A, and T2 can just send the new value for B as shown in the right side of Figure 5. Furthermore, in this case, the system can also choose to replicate the operation made to a field instead of the value of a field in a record. This can significantly reduce the amount of data that must be sent. For example, in the Payment transaction in TPC-C, a string is concatenated to a field with a 500-character string in Customer table. With operation replication, the system only needs to replicate a short string and can re-compute the concatenated string on each replica, which is much less expensive than sending a 500-character string over network. This optimization can result in an order-of-magnitude reductions in replication cost.
Previous systems have advocated the use of command replication (Malviya et al., 2014; Vandiver et al., 2007), where parameters of stored procedures are sent from the primary to replicas. In this way, replicas replay all the transactions executed on the primary replica. As long as all the transactions run in the same order, the database on each replica is consistent. Such a scheme could be used on single-partition transactions just as with our operation replication above.
There are some weaknesses in systems based on command replication. First, it’s expensive to execute transactions based on a single serial order, since computing a global order requires workers to communicate with each other when running a cross-partition transaction. As the ratio of cross-partition transactions is increased, this will significantly impair throughput. In contrast, in the cross-partition phase where we replicate entire records, we can commit transactions without explicitly computing a global order. Second, secondary replicas under command replication need the same resources to replay transactions executed on the primary replica, which gives STAR’s replication design no benefits, as nodes employing full replication have to do the same amount of work for every transaction in the partitioned phase.
In STAR, a hybrid replication strategy is used, i.e., the coordinator node uses value replication strategy in the single-master phase and all nodes use the operation replication strategy in the partitioned phase.
We now discuss the trade-offs that non-partitioned systems and partitioning-based systems achieve, and how STAR achieves the best of both worlds.
6.1. Non-partitioned Systems
A typical approach to build a fault tolerant non-partitioned system is to adopt the primary/backup model. A primary node runs transactions and replicates writes of committed transactions to one or more backup nodes. If the primary node fails, one of the backup nodes can take over instantaneously without loss of availability.
As we show in Figure 6, the writes of committed transactions can be replicated from the primary node to backup nodes either synchronously or asynchronously. With synchronous replication, the primary node holds the write locks when replicating writes of a committed transaction to backup nodes. With locks being held, the writes on backup nodes follow the same order as the primary node, which makes it possible to apply some replication optimizations (e.g., operation replication) as discussed in Section 5.2. A transaction releases the write locks and commits as soon as the writes are replicated (low commit latency), however, round trip communication is needed even for single-partition transactions (high write latency).
An alternative approach is to replicate writes of committed transactions asynchronously. Without write locks being held on the primary node, the writes may be applied in any order on backup nodes. Value replication must be used with the Thomas write rule (Thomas, 1979). To address the potential inconsistency issue when a fault occurs, a group commit (high commit latency) must be used as well. The group commit serves as a barrier that guarantees all writes are replicated when transactions commit. Asynchronous replication reduces the amount of time that a transaction holds write locks during replication (low write latency) but incurs high commit latency for all transactions.
Non-partitioned systems are not sensitive to workloads, (e.g., the ratio of cross-partition transactions), but they cannot easily scale out. The CPU resources on backup nodes are often under-utilized, using more hardware to provide a lower overall throughput.
6.2. Partitioning-based Systems
In partitioning-based systems, the database is partitioned in a way such that each node owns one or more partitions. Each transaction has access to one or more partitions and commits with distributed protocols (e.g., two phase locking) and 2PC. This approach is a good fit for workloads that have a natural partitioning as the database can be treated as many disjoint sub-databases. However, cross-partition transactions are frequent in real-world scenarios. For example, in the standard mix of TPC-C, 10% of NewOrder and 15% of Payment are cross-partition transactions. Many other applications, especially those that involve groups of communicating users such as social networks – are fundamentally hard to partition, as any partition of users may need to communicate at some point with users in another partition.
The same primary/backup model can be utilized to make partitioning-based systems fault tolerant. The write latency of partitioning-based systems with synchronous replication is the same as non-partitioned systems. With asynchronous replication, the write latency depends on the number of remote writes.
Partitioning-based systems scale out to multiple nodes at the cost of a high sensitivity to workloads. For example, if all transactions are single-partition transactions, partitioning-based systems are able to achieve linear scalability. However, even with a small fraction of cross-partition transactions, partitioning-based systems suffer from high round trip communication cost such as remote reads and distributed commit protocols.
6.3. Achieving the Best of Both Worlds
The phase switching algorithm enables STAR to achieve the best of both worlds. The system dynamically switches between the partitioned phase and the single-master phase. The iteration time can be set the same as the length of a group commit in non-partitioned systems or partitioning-based systems.
STAR has a low sensitivity to workloads in the sense that the ratio of cross-partition transactions only determines how many transactions are run on the node employing full replication. All single-partition transactions are run on nodes with partial replicas, which makes the system utilize more CPU resources from multiple nodes. The write latency is also low since all cross-partition transactions are run on a node employing full replication. In addition, writes of committed transactions are replicated asynchronously.
A workload with all single-partition transactions or cross-partition transactions are the two extreme scenarios. In the former scenario, STAR works almost as well as a partitioning-based system with transaction running on each node. In the latter scenario, STAR works the same as a non-partitioned system with all transactions running on the node employing full replication. This is consistent with our vision as shown in Figure 1 and we verified the vision with experiments in Section 8.2.
7. Fault Tolerance and Recovery
In this section, we describe how STAR uses replication to tolerate faults and performs recovery on failed nodes, without imposing any system-wide down time. We also provide a brief argument that our algorithm is correct and a discussion on durability.
7.1. Fault Tolerance
Before introducing how STAR tolerates faults, we first give some definitions and assumption on failures. An active node is a node that can accept a client’s requests, run transactions and replicate writes to other nodes. A failed node is one on which a failure has occurred (e.g., the process of a STAR instance is crashed). In this paper, we assume fail-stop failures and no network partitions, where any active node in the system can detect when a particular node has failed. A node is considered failed when it does not respond to messages sent from other nodes (DeCandia et al., 2007).
As discussed earlier, writes of committed transactions are propagated to replicas. To avoid round trip communication, transaction execution is decoupled from replication. In other words, within the partitioned phase or the single-master phase, writes are propagated asynchronously. One well-known issue is the potential data inconsistency when a failure occurs. For correctness, STAR cannot tell a client that a transaction is committed before the system switches to the next phase. This is because the system only guarantees that all writes are replicated to active replicas when the phase is switched from one to another.
Our current implementation re-masters the primary records of failed nodes over active nodes instantaneously when a failure occurs. For example, if a node employing partial replication fails, the primary records of the failed node can be re-mastered to a node employing full replication. This instant fail-over allows the system to continue to serve requests from clients without taking any down time.
The system can tolerate failures simultaneously in a cluster of nodes, where nodes employ full replication and nodes employ partial replication. This is because nodes have a copy of every record.
We now discuss how STAR recovers when a fault occurs. The database maintains two versions of each record. One is the most recent version prior to the current phase and the other one is the latest version written in the current phase. When a fault occurs, the system enters recovery mode and ignores all data versions written in the current phase, since they have not been committed by the database.
A failed node copies data from remote nodes, corrects its database and catches up to the current state of other replicas using the writes from the replication stream in parallel with the Thomas write rule (Thomas, 1979) as discussed in Section 5.2. We restrict the discussion here to the case where only one node fails for ease of presentation, but the system can recover when up to nodes fail simultaneously.
In recovery mode, STAR uses only value replication with which the system can support transaction execution and parallel recovery simultaneously. If the coordinator node crashes, the next available node with a full copy of the database becomes the coordinator node. If no active node has a full copy of the database, the system falls back to a mode in which a distributed concurrency control algorithm is employed, as in partitioning-based systems. Once the failed node restarts, each node that does not fail iterates through the database and sends a portion of its database to the restarted node. Meanwhile, the restarted node also receives writes of committed transactions. When parallel recovery finishes, the system goes back to the normal mode where the partitioned phase and the single-master phase alternate.
We show our recovery strategy is correct by showing two properties that STAR maintains.
Once a node receives a request from a client, the system does not reply to the client until the system finishes the current phase in which the transaction commits. For this reason, all updates made by committed transactions have at least copies when the result is released to the client.
The database recovered directly from other nodes may miss some updates made by concurrent transactions during recovery. However, all updates are available in the replication stream. The restarted node can correct its database by applying the writes from the replication stream to its database with the Thomas write rule (Thomas, 1979).
A disadvantage of the way that STAR handles failure is that it cannot tolerate correlated failures of nodes, e.g., a power outage. Similar limitations exist in other systems as well (e.g., H-Store (Stonebraker et al., 2007)). Additional disk-based recovery (which will incur downtime) can be added to the system without incurring a significant performance penalty (Zheng et al., 2014). For example, the system may log the writes of committed transactions to disk along with some metadata (e.g., the information about the current phase). Periodic checkpointing may also be conducted to bound the recovery time.
In this section, we evaluate the performance of STAR focusing on the following key questions:
How does STAR perform compared to non-partitioned systems and partitioning-based systems?
How does the phase switching algorithm affect the throughput of STAR and what’s the overhead?
How effective is hybrid replication compared to value replication?
How does network latency affect STAR and how does the phase switching algorithm help?
8.1. Experimental Setup
We ran our experiments on a cluster of 4 machines running 64-bit Ubuntu 12.04 with Linux kernel 3.2.0-23. Each machine has four 8-core 2.13 GHz Intel(R) Xeon(R) E7-4830 CPUs and 256 GB of DRAM. The machines are connected with a 10 GigE network. We implemented STAR and other distributed concurrency control algorithms in C++ and ran experiments without networked clients. The system is compiled using GCC 5.4.1 with -O2 option enabled.
In our experiments, we run 24 worker threads on each node, yielding a total of 96 worker threads. Each node also has 2 threads for network communication. We made the number of partitions equal to the total number of worker threads. All results are the average of three runs. We ran transactions at the serializability isolation level.
To study the performance of STAR, we ran a number of experiments using two popular benchmarks:
YCSB: The Yahoo! Cloud Serving Benchmark (YCSB) is a simple transactional workload designed to facilitate performance comparisons of database and key-value systems (Cooper et al., 2010)
. It has a single table with 10 columns. The primary key of each record is a 64-bit integer and each column consists of 10 random bytes. A transaction consists of 4 read/write operations in this benchmark and each operation accesses a random record following a uniform distribution. We set the number of records to 200K per partition, and we run a workload mix of 90/10, i.e., each access in a transaction has an 90% probability of being a read operation and a 10% probability of being a read/write operation.
TPC-C: The TPC-C benchmark111TPC-C Benchmark (Revision 5.11): http://www.tpc.org/tpcc/ is the gold standard for evaluating OLTP databases. It models a warehouse order processing system, which simulates the activities found in complex OLTP applications. It has nine tables and we partition all the tables by Warehouse ID. We support two transactions in TPC-C, namely, (1) NewOrder and (2) Payment. 88% of the standard TPC-C mix consists of these two transactions. The other three transactions require range scans, which are currently not supported in our system.
8.1.2. Concurrency Control Algorithms
We implemented each of the following concurrency control algorithms in C++ in our framework:
STAR: Our algorithm as discussed in Section 4. We set the iteration time of a phase switch to 10 ms. To have a fair comparison to other algorithms, the hybrid replication optimization is only enabled in Section 8.5.
PB. OCC: This is a variant of Silo’s OCC protocol (Tu et al., 2013) adapted for a primary/backup setting. The primary node runs all transactions and replicates the writes to the backup node. Only two machines are used in this setting.
Dist. OCC: The above protocol adapted for a distributed and replicated setting. Multiple round trip communication is needed in both execution phase and commit phase, e.g., validating the read set. When a transaction commits, we use NO_WAIT policy to avoid deadlocks, i.e., a transaction aborts if some record in the write set fails to acquire a lock.
Dist. S2PL: This is a distributed strict two-phase locking protocol. As a worker runs a transaction, it first acquires the transaction’s read and write locks. To avoid deadlocks, the same NO_WAIT policy is adopted. The worker next updates each record and replicates the writes to secondary replicas. Finally, the worker releases all acquired locks.
In our experiments, PB. OCC is a non-partitioned system, and Dist. OCC and Dist. S2PL are considered as partitioning-based systems. We do not report the results on PB. S2PL, since it always performs worse than PB. OCC (Yu et al., 2016).
8.1.3. Partitioning and Replication Configuration
In our experiment, we set the number of replicas of each partition to 2. Each partition is assigned to a node by a hash function. The primary partition and secondary partition are always hashed to two different nodes.
In STAR, we have 1 node with full replication and 3 nodes with partial replication, i.e., and . Each node masters a different portion of the database, as shown in Figure 2.
We consider two variations of PB. OCC, Dist. OCC, and Dist. S2PL: (1) Synchronous replication: all writes are synchronously replicated and a transaction commits as soon as its write set is replicated. (2) Asynchronous replication and group commit: all writes are asynchronously replicated and transactions commits with a group commit.
In the variation of synchronous replication, Silo’s epoch-based group commit is not needed in PB. OCC, since the replication order on the backup node follows the same commit order on the primary node which holds all write locks during replication. In Dist. OCC and Dist. S2PL, the write locks on primary partitions must be held until the writes are replicated to the secondary partitions. Synchronous replication makes all transactions need to hold write locks during the round trip communication for replication.
In the variation of asynchronous replication and group commit, the locks can be released as soon as the writes on the primary partition are finished. The writes are replicated asynchronously to secondary partitions, on which the Thomas write rule (Thomas, 1979) is applied. In our experiment, PB. OCC, Dist. OCC, and Dist. S2PL commit transactions with a group commit that happens every 10 ms.
8.2. Performance Comparison
We now compare STAR with a non-partitioned system and two partitioning-based systems using both YCSB and TPC-C workloads.
8.2.1. Results of asynchronous replication and group commit
We first study the performance of STAR compared to other approaches with asynchronous replication and group commit. We ran a YCSB workload and both NewOrder and Payment transactions from TPC-C with a varying ratio of cross-partition transactions.
Figure 7 shows the throughput of each approach on each transaction. When there are no cross-partition transactions, STAR has a similar throughput compared with Dist. OCC and Dist. S2PL on the YCSB workload and the Payment transaction. This is because the workload is embarrassingly parallel. Transactions do not need to hold locks for a round trip communication with asynchronous replication and group commit. For NewOrder transaction, the throughput of STAR is 83.4% of the throughput of Dist. OCC and Dist. S2PL. This is because it is more expensive to replicate the writes of NewOrder transactions since the coordinator node in STAR serves as the backup for all the other nodes.
As we increase the ratio of cross-partition transactions, the throughput of PB. OCC stays almost the same. The throughput of other approaches drops. With 10% of cross-partition transactions are present, STAR starts to outperform Dist. OCC and Dist. S2PL. For example, STAR has a 1.5x higher throughput than Dist. S2PL on NewOrder transaction. As more cross-partition transactions are present, the throughput of Dist. OCC and Dist. S2PL is significantly lower than STAR (also lower than PB. OCC) and the throughput of STAR approaches the throughput of PB. OCC. This is because STAR behaves similarly to a non-partitioned system when all transactions are cross-partition transactions.
As a result, we believe that STAR is a good fit for workloads with both single-partition and cross-partition transactions and it can outperform both non-partitioned and partitioning-based systems, as we envisioned in Figure 1.
8.2.2. Results of synchronous replication
We next study the performance of STAR compared to other approaches with synchronous replication. We ran the same workload as in Section 8.2.1 with a varying ratio of cross-partition transactions and report the results in Figure 8.
When there are no cross-partition transactions, the workload is embarrassingly parallel. However, PB. OCC, Dist. OCC and, Dist. S2PL all have a much lower throughput than STAR in this scenario. This is because even single-partition transactions need to hold locks for a round trip communication due to synchronous replication as discussed in Section 8.1.3. As we increase the ratio of cross-partition transactions, we observe that the throughput of PB. OCC almost stays the same, since a non-partitioned system is not sensitive to cross-partition transactions. The throughput of STAR drops due to the fact that more transactions are run on the coordinator node. For Dist. OCC and Dist. S2PL, more transactions need to read from remote nodes during the execution phase. They also need multiple rounds of communication to validate and commit transactions (2PC). Overall, STAR has much higher throughput than other approaches — at least 4.2x higher throughput on YCSB, 3.8x higher throughput on NewOrder transactions, and 6.7x higher throughput on Payment transactions.
Overall, the throughput of PB. OCC, Dist. OCC and Dist. S2PL is much lower than that with asynchronous replication due to the overhead of network round trips for every transaction.
8.2.3. Latency of each approach
We now study the latency of each approach with both synchronous and asynchronous replication. We ran the same workload as in Section 8.2.1 and Section 8.2.2, but only report the latency at the 50th percentile and the 99th percentile in Figure 9, when the ratio of cross-partition transactions is 10%, 50% and 90%.
We first discuss the latency of each approach with synchronous replication. We observe that PB. OCC’s latency at the 50th percentile is always less than 0.05ms (shown as 0.0). PB. OCC’s latency at the 99th percentile is also not sensitive to the ratio of cross-partition transactions for the same reason as discussed above. Dist. OCC and Dist. S2PL have a higher latency at both the 50th percentile and the 99th percentile, as we increase the ratio of cross-partition transactions. This is because there are more remote reads and the commit protocols they use need multiple round trip communication.
In STAR, the iteration time determines the latency of transactions. Similarly, the latency of transactions in Dist. OCC and Dist. S2PL with asynchronous replication also depends on the frequency of a group commit. For this reason, STAR has a similar latency at the 50th percentile and the 99th percentile to other approaches with asynchronous replication. In Figure 9, we only report the results on YCSB with 10% of cross-partition transactions. Results on other workloads are not reported, since they are all similar to one another.
8.3. Comparison with Calvin
We next compare STAR with Calvin (Thomson et al., 2012), which is a deterministic concurrency control and replication algorithm. In Calvin, a central sequencer determines the order for a batch of transactions before they start execution. The transactions are then sent to all the replica groups222A replica group is a set of nodes containing a replica of the database of the database to execute deterministically, namely, all replica groups will produce the same results for the same batch of transactions. One each node, a single-threaded lock manager grants locks to multiple execution threads following the deterministic order. As a result, Calvin does not perform replication at the end of each transaction, instead, it replicates inputs at the beginning of the batch of transactions and deterministically executes the batch across replica groups.
In this experiment, we consider three different configurations for Calvin, namely, (1) Calvin-1:3: There are two replica groups, with the first one having 1 node and the second one having 3 nodes. This replication configuration is similar to STAR, (2) Calvin-2:2: There are two replica groups with each one having 2 nodes. This replication configuration is similar to Dist. OCC and Dist. S2PL, (3) Calvin-4: There is only one replica group with 4 nodes. We implemented the Calvin algorithm in C++ in our framework as well to have a fair comparison.
The speedup of STAR’s throughput is measured relative to Calvin with different configurations. We ran NewOrder transaction from the TPC-C benchmark with a varying ratio of cross-partition transactions and report the result in Figure 10. As we increase the ratio of cross-partition transactions, more remote reads happen within a replica group in Calvin. Meanwhile, more transactions are run on the coordinator node in STAR making the system have lower throughput. From Figure 10, we can observe that STAR has a smaller relative speedup, as we increase the ratio of cross-partition transactions. For example, when no cross-partition transactions are present, STAR is 9x faster than Calvin-4. When all transactions are cross-partition transactions, the relative speedup goes down to 5.3.
Overall, Calvin is less sensitive to the ratio of cross-partition transactions. However, the performance it achieves is much lower than STAR. This is because the single-threaded lock manager of Calvin is the bottleneck making it not possible to fully utilize all 24 worker threads.
8.4. The Overhead of Phase Transitions
We now study how the iteration time of a phase switch affects the overall throughput of STAR and the overhead due to this phase switching algorithm. We set the ratio of cross-partition transactions to 10% in this experiment and report the result on NewOrder transaction from TPC-C in Figure 11. Similar results were obtained on other workloads but are not reported due to space limitations.
In each iteration, there are two phase transitions in STAR (from the partitioned phase to the single-master phase, and vice versa). We varied the iteration time of the phase switching algorithm from 1.5 ms to 150 ms, and report the system’s throughput and overhead in Figure 11. The overhead is measured as the system’s performance regression compared to the one running with a 150 ms iteration time. Increasing the iteration time decreases the overhead of the phase switching algorithm as expected, since less time is spent during the synchronization. For example, when the iteration time is 1.5 ms, the overhead is as high as 48.7% and system only achieves around half of its maximum throughput (i.e., the throughput achieved with a 150 ms iteration time). As we increase the iteration time, the system’s throughput goes up. The throughput levels off when the iteration time is larger than 20 ms.
In all experiments in this paper, we set the iteration time to 10 ms. With this setting, the system can achieve 94% of its maximum throughput and have a good balance between throughput and latency.
8.5. Replication Strategies
We next study how effective hybrid replication is compared to value replication. In this experiment, we ran Payment transaction from the TPC-C benchmark. We do not report the performance on NewOrder transaction and the YCSB benchmark, since they update the whole record and hybrid replication is as the same as value replication.
We varied the ratio of cross-partition transactions from 0% to 100%, and report the average replication cost in bytes per transaction and the overall throughput in Figure 12. With value replication, the cost to replicate the writes of a transaction stays the same as we increase the ratio of cross-partition transactions. With hybrid replication, the cost increases from 100 bytes to 600 bytes per transaction, as shown in Figure 12(a).
We now study how replication strategies affect the system’s throughput. Figure 12(b) shows the system’s throughput with both replication strategies. When there are no cross-partition transactions, STAR with hybrid replication achieves 2.9x higher throughput, since the cost of hybrid replcation is only 16.8% of value replication. As we increase the ratio of cross-partition transactions, the performance gap becomes smaller, since more transactions run on the coordinator and are replicated with value replication. For example, when 50% of cross-partition transactions are present, STAR with hybrid replication achieves 50% higher throughput than the one with value replication.
8.6. Network Latency
The performance of a distributed concurrency control algorithm is usually sensitive to the network latency, since a cross-partition transaction may access to a remote partition during the execution phase. Furthermore, multiple round trip communication (e.g., to validate a transaction with an OCC algorithm) and/or a two-phase commit are also necessary to guarantee ACID properties.
In a non-partitioned system with group commit, round trip communication only happens when the system commits a group of transactions, making it less sensitive to network delay. Similarly, STAR only waits for round trip communication when the system switches from one phase to another. However, both Dist. OCC and Dist. S2PL require waiting for multiple round trips during the execution phase and commit phase.
In this experiment, we ran both NewOrder and Payment transactions. On average, every one NewOrder transaction is followed by one Payment transaction. PB. OCC, Dist. OCC, and Dist. S2PL use group commit with a frequency of 40 ms. We set the iteration time in STAR to 40 ms as well. We added an artificial network delay between all nodes when a message is sent. In real scenarios, a distributed database may be deployed within the same rack, across racks, in the same city, or across several cities. To model these scenarios, we varied the network delay from 0.1 ms to 10 ms. We only report the overall throughput in Figure 13, since they all have similar latency with group commit.
When we increase the delay from 0.1 ms to 0.5 ms, we can observe that STAR has almost the same throughput and it behaves the same way with a non-partitioned system (e.g., PB. OCC). This is because the overhead due to the phase switching algorithm is still relatively small. All other algorithms have lower overall throughput as we increase the delay. For example, the throughput of Dist. OCC with a 1 ms network delay is only 21.3% of its original throughput. As we further increase the delay, we now observe that the throughput of both STAR and PB. OCC starts to decrease, as several round trip communication with a large network delay is not negligible with an iteration time of 40 ms. When the delay is 10 ms, the throughput of STAR drops to 32.6% of its original throughput (similar performance regression in PB. OCC as well), but it’s still 200x faster than partitioning-based systems that employ distributed concurrency control algorithms (e.g., Dist. OCC and Dist. S2PL).
9. Related Work
STAR builds on a number of pieces of related work for its design, including in-memory transaction processing, replication and durability.
In-memory Transaction Processing. Modern fast in-memory databases have long been an active area of research (Larson et al., 2011; Neumann et al., 2015; Stonebraker et al., 2007; Tu et al., 2013; Yu et al., 2016). In H-Store (Stonebraker et al., 2007), transactions local to a single partition are executed by a single thread associated with the partition. This extreme form of partitioning makes single-partition transaction very fast but creates significant contention for cross-partition transactions, where whole-partition locks are held. Silo (Tu et al., 2013) divides time into a series of short epochs and each thread generates its own timestamp by embedding the global epoch to avoid shared-memory writes, avoiding contention for a global critical section. Because of its high throughput and simple design, we adopted the Silo architecture for STAR, reimplementing it and adding our phase-switching protocol and replication. Doppel (Narula et al., 2014) executes highly contentious transactions in a separate phase from other regular transactions such that special optimizations (i.e., commutativity) can be applied to improve scalability.
F1 (Shute et al., 2013) is an OCC protocol built on top of Google’s Spanner (Corbett et al., 2012). MaaT (Mahmoud et al., 2014) reduces transaction conflicts with dynamic timestamp ranges. ROCOCO (Mu et al., 2014) tracks conflicting constituent pieces of transactions and re-orders them in a serializable order before execution. To reduce the conflicts of distributed transactions, STAR runs all cross-partition transactions on a single machine in the single-master phase. Clay (Serafini et al., 2016) improves data locality to reduce the number of distributed transactions in a distributed OLTP system by smartly partitioning and migrating data across the servers. Some previous work (Lin et al., 2016; Das et al., 2010; Chairunnanda et al., 2014) proposed to move the master node of a tuple dynamically, in order to convert distributed transactions into local transactions. Unlike STAR, however, moving the mastership still requires network communication. FaRM (Dragojevic et al., 2014), FaSST (Kalia et al., 2016) and DrTM (Wei et al., 2015) improve the performance of a distributed OLTP database by exploiting RDMA. STAR can use RDMA to further decrease the overhead of replication and the phase switching algorithm as well.
Replicated Systems. Replication is the way in which database systems achieve high availability. Eager replication was popularized by systems like Postgres-R (Kemme and Alonso, 2000) and Galera Cluster (Galera Cluster, 2017), which showed how to make eager replication practical using group communication and deferred propagation of writes. Calvin (Thomson et al., 2012) replicates transactions requests amongst replica groups and assigns a global order to each transaction for deterministic execution (Thomson and Abadi, 2010), allowing it to eliminate expensive cross-partition coordination. However, cross-node communication is still necessary during transaction execution because of remote reads. Also, a single-threaded lock manager significantly impair the system’s performance. Mencius (Mao et al., 2008) is a state machine replication method that improves Paxos to achieve high throughput under high client load and low latency under low client load by partitioning sequence numbers, even under changing wide-area network environments. In STAR, neither a global order nor group communication is necessary, even for cross-partition transactions, since we run these cross-partition transactions in parallel on a single node.
Recoverable Systems. H-Store (Malviya et al., 2014) uses transaction-level logging. It periodically checkpoints a transactionally consistent snapshot to disk and logs all the parameters of stored procedures. H-Store executes transactions following a global order and replays all the transactions in the same order during recovery. SiloR (Zheng et al., 2014) uses a multi-threaded parallel value logging scheme that supports parallel replay in non-partitioned databases. In contrast, transaction-level logging requires that transactions be replayed in the same order. In STAR, different replication strategies, including both SiloR-like parallel value replication and H-Store-like operational logging are used in different phases, significantly reducing bandwidth requirements.
In this paper, we presented STAR, a new distributed and replicated in-memory database. STAR employs a new phase-switching scheme where single-partition transactions are run on multiple machines in parallel, and cross-partition transactions are run on a single machine by re-mastering records on the fly, allowing us to avoid cross-node communication and the use of distributed commit protocols like 2PC for distributed transactions. Our results on TPC-C and YCSB show that STAR is able to dramatically exceed the performance of systems that employ conventional concurrency control and replication algorithms by up to one order of magnitude.
- Bernstein et al. (1987) Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Addison-Wesley.
- Chairunnanda et al. (2014) Prima Chairunnanda, Khuzaima Daudjee, and M. Tamer Özsu. 2014. ConfluxDB: Multi-Master Replication for Partitioned Snapshot Isolation Databases. PVLDB 7, 11 (2014), 947–958.
- Cooper et al. (2010) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In SoCC. 143–154.
- Corbett et al. (2012) James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Database. In OSDI. 261–264.
- Das et al. (2010) Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2010. G-Store: a scalable data store for transactional multi key access in the cloud. In SoCC. 163–174.
- DeCandia et al. (2007) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s highly available key-value store. In SOSP. 205–220.
- Dragojevic et al. (2014) Aleksandar Dragojevic, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. FaRM: Fast Remote Memory. In NSDI. 401–414.
- Eswaran et al. (1976) Kapali P. Eswaran, Jim Gray, Raymond A. Lorie, and Irving L. Traiger. 1976. The Notions of Consistency and Predicate Locks in a Database System. Commun. ACM 19, 11 (1976), 624–633.
- Galera Cluster (2017) Galera Cluster. 2017. http://galeracluster.com/products/technology/.
- Harding et al. (2017) Rachael Harding, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017. An Evaluation of Distributed Concurrency Control. PVLDB 10, 5 (2017), 553–564.
- Kalia et al. (2016) Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In OSDI. 185–201.
- Kemme and Alonso (2000) Bettina Kemme and Gustavo Alonso. 2000. Don’t Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication. In VLDB. 134–143.
- Kung and Robinson (1981) H. T. Kung and John T. Robinson. 1981. On Optimistic Methods for Concurrency Control. ACM Trans. Database Syst. 6, 2 (1981), 213–226.
- Larson et al. (2011) Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M. Patel, and Mike Zwilling. 2011. High-Performance Concurrency Control Mechanisms for Main-Memory Databases. PVLDB 5, 4 (2011), 298–309.
- Lin et al. (2016) Qian Lin, Pengfei Chang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, and Zhengkui Wang. 2016. Towards a Non-2PC Transaction Management in Distributed Database Systems. In SIGMOD Conference. 1659–1674.
- Mahmoud et al. (2014) Hatem A. Mahmoud, Vaibhav Arora, Faisal Nawab, Divyakant Agrawal, and Amr El Abbadi. 2014. MaaT: Effective and scalable coordination of distributed transactions in the cloud. PVLDB 7, 5 (2014), 329–340.
- Malviya et al. (2014) Nirmesh Malviya, Ariel Weisberg, Samuel Madden, and Michael Stonebraker. 2014. Rethinking main memory OLTP recovery. In ICDE. 604–615.
- Mao et al. (2008) Yanhua Mao, Flavio Paiva Junqueira, and Keith Marzullo. 2008. Mencius: Building Efficient Replicated State Machine for WANs. In OSDI. 369–384.
- Mao et al. (2012) Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In EuroSys. 183–196.
- McInnis (2003) Dale McInnis. 2003. The Basics of DB2 Log Shipping. https://www.ibm.com/developerworks/data/library/techarticle/0304mcinnis/0304mcinnis.html.
- Microsoft (2016) Microsoft. 2016. About Log Shipping (SQL Server). https://msdn.microsoft.com/en-us/library/ms187103.aspx.
- Mohan et al. (1986) C. Mohan, Bruce G. Lindsay, and Ron Obermarck. 1986. Transaction Management in the R* Distributed Database Management System. ACM Trans. Database Syst. 11, 4 (1986), 378–396.
- Mu et al. (2014) Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. 2014. Extracting More Concurrency from Distributed Transactions. In OSDI. 479–494.
- MySQL (2018) MySQL. 2018. MySQL 5.7 Reference Manual,. https://dev.mysql.com/doc/refman/5.7/en/replication.html.
- Narula et al. (2014) Neha Narula, Cody Cutler, Eddie Kohler, and Robert Morris. 2014. Phase Reconciliation for Contended In-Memory Transactions. In OSDI. 511–524.
- Neumann et al. (2015) Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. 2015. Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. In SIGMOD Conference. 677–689.
- PostgreSQL (2018) PostgreSQL. 2018. PostgreSQL 9.4.19 Documentation. https://www.postgresql.org/docs/9.4/static/warm-standby.html.
- Qin et al. (2017) Dai Qin, Ashvin Goel, and Angela Demke Brown. 2017. Scalable Replay-Based Replication For Fast Databases. PVLDB 10, 13 (2017), 2025–2036.
- Serafini et al. (2016) Marco Serafini, Rebecca Taft, Aaron J. Elmore, Andrew Pavlo, Ashraf Aboulnaga, and Michael Stonebraker. 2016. Clay: Fine-Grained Adaptive Partitioning for General Database Schemas. PVLDB 10, 4 (2016), 445–456.
- Shute et al. (2013) Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. 2013. F1: A Distributed SQL Database That Scales. PVLDB 6, 11 (2013), 1068–1079.
- Stonebraker et al. (2007) Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. 2007. The End of an Architectural Era (It’s Time for a Complete Rewrite). In VLDB. 1150–1160.
- Taft et al. (2014) Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andrew Pavlo, and Michael Stonebraker. 2014. E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing. PVLDB 8, 3 (2014), 245–256.
- Tatarowicz et al. (2012) Aubrey Tatarowicz, Carlo Curino, Evan P. C. Jones, and Sam Madden. 2012. Lookup Tables: Fine-Grained Partitioning for Distributed Databases. In ICDE. 102–113.
- Thomas (1979) Robert H. Thomas. 1979. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases. TODS 4, 2 (1979), 180–209.
- Thomson and Abadi (2010) Alexander Thomson and Daniel J. Abadi. 2010. The Case for Determinism in Database Systems. PVLDB 3, 1 (2010), 70–80.
- Thomson et al. (2012) Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In SIGMOD Conference. 1–12.
- Tu et al. (2013) Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. 2013. Speedy transactions in multicore in-memory databases. In SOSP. 18–32.
- Vandiver et al. (2007) Ben Vandiver, Hari Balakrishnan, Barbara Liskov, and Samuel Madden. 2007. Tolerating byzantine faults in transaction processing systems using commit barrier scheduling. In SOSP. 59–72.
- Wang and Kimura (2016) Tianzheng Wang and Hideaki Kimura. 2016. Mostly-Optimistic Concurrency Control for Highly Contended Dynamic Workloads on a Thousand Cores. PVLDB 10, 2 (2016), 49–60.
- Wei et al. (2015) Xingda Wei, Jiaxin Shi, Yanzhe Chen, Rong Chen, and Haibo Chen. 2015. Fast in-memory transaction processing using RDMA and HTM. In SOSP. 87–104.
- Yu et al. (2016) Xiangyao Yu, Andrew Pavlo, Daniel Sanchez, and Srinivas Devadas. 2016. TicToc: Time Traveling Optimistic Concurrency Control. In SIGMOD Conference. 1629–1642.
- Zheng et al. (2014) Wenting Zheng, Stephen Tu, Eddie Kohler, and Barbara Liskov. 2014. Fast Databases with Fast Durability and Recovery Through Multicore Parallelism. In OSDI. 465–477.
Appendix A More Results
a.1. Micro Benchmark
We now study how STAR performs with a varying number of nodes. We ran a YCSB workload and both NewOrder and Payment transactions from TPC-C. The hybrid replication optimization is also enabled. To observe the best performance that the system can achieve, we set the ratio of cross-partition transactions to 0. In other words, all transactions are run in the partitioned phase in parallel.
We report the performance gain that STAR achieves from 3 nodes to 6 nodes compared with the one running with 2 nodes in Figure 14. The absolute numbers on throughput are not reported, since STAR has much higher throughput on YCSB. As we increase the number of nodes from 2 to 4, the performance on all queries increases. For example, 47% on NewOrder transaction and 51% on YCSB. However, when we continue to increase the number of nodes to 6, the system only achieves more performance gain on the workload of YCSB. This is because the two transactions from TPC-C are write-intensive. The coordinator node in STAR has a full replica of the database and becomes the bottleneck with a single 10GigE NIC. The YCSB workload is read-intensive allowing STAR to continue to scale out with more nodes. We set the number of nodes to 4 in all experiments in this paper.