A lot of research has been devoted in the last years to eventually consistent replicated data stores, such as modern NoSQL databases (e.g.,   ). It is because these systems, unlike their strongly consistent counterparts (e.g., traditional relational database systems ), are scalable, guarantee high availability, and provide low response times. These traits make them the essential tools used to build globally accessible services running on the Internet, that are able to cope with the traffic generated by millions of clients.
Developing systems or services using eventually consistent data stores is often difficult and error prone, because one needs to anticipate the possibility of working on skewed data due to weaker consistency guarantees provided by such a data store.111Using conflict-free replicated data types (CRDTs)  may help to some degree, as CRDTs offer clear guarantees and hassle-free state convergence. However, CRDTs have limited semantics: they either require that the operations they define always commute, or that there exist some commutative, associative and idempotent state merge procedures. Moreover, programmers, who are used to traditional relational database systems, miss the fully fledged support for serializable transactions, which, naturally, cannot be provided in a highly-available fashion  . Therefore, in recent years, various NoSQL vendors started adding (quasi) transactional support to their systems. These add-on mechanisms are often very prohibitive and do not perform well. For example, both Riak and Apache Cassandra do not offer cross-object/cross-record transactions  . Additionally, Riak allows strongly consistent (serializable) operations to be performed only on distinct data , whereas using the so called light weight transactions in Apache Cassandra on data that are accessed also in the regular, eventually consistent fashion leads to undefined behaviour .
In the past various researchers attempted to incorporate transactional semantics into eventually consistent systems. Among others, the proposed solutions assumed weaker guarantees for transactional execution (e.g.,   ) or restricting the semantics of transactions (e.g.,  ). Interestingly, the first eventually consistent transactional systems, namely Bayou  and the implementations of Eventually-Serializable Data Services (ESDS) , followed a different approach. In these systems, each server speculatively total-orders all received client requests without prior agreement with other servers. The final request serialization is established by a primary server. In case the speculation is wrong, some of the requests are rolled back and reexecuted (Bayou), or, to obtain the response for a new client request, much of the requests whose final execution order is not yet established are repeatedly reexecuted (ESDS). Understandably, such implementations cannot perform and scale well. Moreover, they are not fault-tolerant because of the reliance on the primary. However, these systems have one very important trait: reasoning about their behaviour is relatively easy and comes intuitively, because, similarly to state machine replication (SMR)  , which executes all requests on all servers in the same order, on each server there is always a single serialization of all client requests the server knows about.
In this paper we take inspiration from Bayou and ESDS, and propose Creek, a novel fault-tolerant transactional replication scheme that features mixed weak-and-strong semantics. Clients submit requests to Creek in a form of arbitrary (but deterministic) transactions, called operations, each marked as weak or strong. Creek executes weak operations optimistically thus ensuring low-latency responses, in the order corresponding to the operations’ timestamps that are assigned upon operation submission. On the other hand, strong operations are executed optimistically (similarly to weak operations), yielding tentative responses, but eventually their final operation serialization is established and a stable response is returned to the client. The final operation execution order is established using our new total order protocol, called conditional atomic broadcast (CAB), which is based on indirect consensus . The messages broadcast using CAB are as small as possible and are limited to the identifiers of strong operations. The contents of all (weak and strong) operations are disseminated among Creek replicas using only a gossip protocol. Creek leverages multiversioning scheme  to facilitate concurrent execution of operations as well as to minimize the number of necessary operation rollbacks and reexecutions.
Creek can gracefully tolerate (partial) node failures, because CAB can be efficiently implemented by extending a quorum-based protocol, such as Multi-Paxos . When network partitions occur, replicas within each partition are still capable of executing weak operations and obtaining tentative responses to strong operations, and converging to the same state (when the stream of client request ceases). Formally Creek guarantees linearizability  for strong operations, and fluctuating eventual consistency  for weak operations.
Creek causally binds the execution of strong and weak operations, so that the execution of operations of different types is not entirely independent. More precisely, for any strong operation , if its tentative response was produced on a replica state that included the effects of the execution of some weak operation , the stable response of will also reflect the effects of the execution of .
We use the TPC-C benchmark  to test the performance of Creek in comparison to other replication schemes that enable arbitrary transactional semantics and from which Creek draws inspiration: Bayou, SMR, and a state-of-the-art speculative SMR scheme . By leveraging the multicore architecture of modern hardware, Creek easily outperforms SMR and Bayou. Creek provides throughput that is on-par with the speculative SMR scheme, but exhibits much lower latencies in serving client requests (up to 3 times lower for weak transactions and up to 15% lower for strong transactions). In the vast majority of cases (92-100%, depending on the scenario), the effects of the speculative request execution correspond to the final execution order as established by solving distributed consensus among Creek replicas.
Ii Related work
As we already stated earlier, Creek is inspired by Bayou  and ESDS . There are a number of subtle characteristics of both systems that we have not yet mentioned. Unlike in Creek, in Bayou updating transactions do not provide return values. Bayou also features dependency check and merge procedure mechanisms, that allow the system to perform application-level conflict detection and resolution. We do not make any (strong) assumptions on the specification of operations handled by Creek (see also Section V-A), so these mechanisms can be emulated on the level of operation specification, if required.
Creek fulfills the specification of ESDS, although ESDS features somewhat richer interface than Creek, because it allows the user to attach to an operation an arbitrary causal context that must be satisfied before the operation is executed. However, Creek can be easily extended to accommodate the full specification of ESDS. Interestingly, the basic implementation of ESDS  did not maintain an up-to-date state that is updated every time a new operation is executed. Instead, in order to obtain a response to an operation, ESDS first created a provisional state by reexecuting (some of) the previously submitted operations. Local computation was assumed to happen instantaneously. Naturally, this assumption is not realistic, so an optimized version of ESDS has been implemented, which, to some degree, limited the number of operation reexecutions and network usage .
We are aware of several systems that similarly to Creek feature requests that can be executed with different consistency guarantees. The system in  enables enforcing two kinds of stronger guarantees than causal consistency, by either a consistent numbering of requests, or the use of the three-phase-commit protocol. Unlike Creek, the system does not enforce a single total order of all client requests. Zeno  is very similar to Bayou, but it has been designed to tolerate Byzantine failures. Li et al.  demonstrate Gemini, a replicated system that satisfies RedBlue consistency. Gemini ensures causal consistency for all operations, but unlike the strong (red) operations, the weak (blue) operations must commute with all other operations. Burckhardt et al.  describe global sequence protocol (GSP), in which client applications perform operations locally and periodically synchronize with the cloud, the single source of truth. The cloud is responsible for establishing the final operation execution order. Changes to the execution order might lead to operation rollbacks and reexecutions in the client applications. When the cloud is unavailable, GSP does not guarantee progress: the clients can issue new operations that are executed locally, but they are not propagated to other clients. In effect, when the cloud is down, each client is forced to work in a separate network partition.
Since in Creek all operations are eventually serialized, the research on speculative execution in state machine replication (SMR) is also relevant. In basic SMR, every server sequentially executes all client requests (transactions) in the same order  . To this end, SMR might rely on the atomic broadcast (AB) (also called total order broadcast) protocol to ensure that all servers deliver the same set of requests in the same order. The speculative schemes based on SMR (e.g.,   ) start the execution of a request before the final operation order is established, to minimize latency in providing response to the client. However, the response is withheld until the system ensures the execution is serializable. Hence, these approaches do not guarantee low-latency responses.
To enable SMR to scale, some schemes (e.g.,  ) utilize partial replication, in which data is divided into partitions, each of which can be accessed and modified independently. The issue of data partitioning is orthogonal to speculative execution and lies outside the scope of this paper. We leave extending Creek to support partial replication for future work.
In Section VI we compare the performance of Creek to the performance of Bayou, SMR as well as Archie , the state-of-the-art speculative SMR scheme. Archie uses a variant of optimistic atomic broadcast to disseminate requests among servers that guarantees that in the stable conditions (when the leader of the broadcast protocol does not change), the optimistic message delivery order is the same as the final one. Similarly to Creek, Archie utilizes multiversioning scheme and takes full advantage of the multi-core hardware.
Recently there have been several attempts to formalize the guarantees provided by Bayou and systems similar to it. Shapiro et al.  propose a (rather informal) definition of capricious total order, in which each server total-orders all operations it received, without prior agreement with others. In , Girault et al. propose a more formal property called monotonic prefix consistency. The definition is, however, limited to systems that, unlike Creek, only feature read-only operations and updating operations that do not have a return value. To formalize Creek’s correctness we use the framework and a property called fluctuating eventual consistency that we introduced in  (see Section V-B).
We consider a fully asynchronous, message-passing system consisting of a set of processes, to which external clients submit requests in the form of operations (also called transactions) to be executed by the processes. We model each process, which we call a replica, as a state automaton, that has a local state and, in reaction to events, executes steps that atomically transition the replica from one state to another. We consider a crash-stop failure model, in which a process can crash by ceasing communication. A replica that never crashes is said to be correct, otherwise it is faulty.
Replicas communicate via reliable channels. Replicas can use reliable broadcast (RB) , that is defined through two primitives: and . Intuitively, RB guarantees that even in case a faulty replica s some message and it is ed by at least one correct replica, all other correct replicas eventually . Formally, RB requires: (1) validity: if a correct replica s some message , then the replica eventually s , (2) uniform integrity: for any message , every process s at most once and only if was previously , and (3) agreement: if a correct replica s some message , then eventually all replicas .
As we have already outlined earlier, clients may issue two kinds of operations: weak and strong. Weak operations are meant to be executed in a way that minimizes the latency in providing a response to the client. Hence, we require that a replica that received a weak operation executes it immediately in an eventually consistent fashion on the local state and issues a response to the client without waiting for coordination with other replicas. This behaviour is necessary (but not sufficient) to ensure that in the presence of network partitions (when communication between subgroups of replicas is not possible for long enough), the replicas’ states in each subgroup converge once the stream of client requests ceases. Naturally, a response to a weak operation might not be correct, in the sense that it might not reflect the state of replicas once they synchronize. On the other hand, a replica returns to a client a (stable) response to a strong operation only after the replicas synchronize and achieve agreement on the final operation execution order (relative to other, previously handled operations). Achieving agreement among distributed consensus. replicas on how to execute (serialize) a strong operation requires solving We assume availability of failure detector , the weakest failure detector capable of solving distributed consensus in the presence of failures .
Iv Conditional Atomic Broadcast
Similarly to atomic broadcast (AB) (also called total order broadcast) , CAB enables dissemination of messages among processes with the guarantee that each process delivers all messages in the same order. Unlike AB, however, CAB allows a process to defer the delivery of a message until a certain condition is met (e.g., certain network communication is completed). To this end, CAB defines two primitives: , which is used by processes to broadcast a message with a test predicate (or simply, predicate ), and to deliver on each process but only when the predicate is satisfied. Since might depend on , we write if is evaluated to true (on some process ). needs to guarantee eventual stable evaluation, i.e., needs to be a stable predicate that eventually evaluates to true on every correct process. Otherwise, not only CAB would not be able to terminate, but different processes could different sets of messages. We formalize CAB through the following requirements:
validity: if a correct process s a message with predicate , and eventual stable evaluation holds for , then eventually s ,
uniform integrity: for any message with predicate , every process s at most once, and only if was previously and at ,
uniform agreement: if a process (correct or faulty) s (with predicate ), and eventual stable evaluation holds for , then all correct processes eventually (with ),
uniform total order: if some process (correct or faulty) s (with predicate ) before (with predicate ), then every process s (with ) only after it has ed (with ).
Iv-B Reducing CAB to indirect consensus
There is a strong analogy between CAB and atomic broadcast (AB) built using indirect consensus . Our approach is quite a straightforward extension of the AB reduction to indirect consensus presented there, as we now discuss.
As shown in , AB can be reduced to a series of instances of distributed consensus. In each instance processes reach agreement on a set of messages to be delivered. Once the agreement is reached, messages included in the decision value are delivered in some deterministic order by each process. Indirect consensus reduces the latency in reaching agreement among the processes by distributing the messages (values being proposed by the processes) using a gossip protocol and having processes to agree only on the identifiers of the messages. Hence, a proposal in indirect consensus is a pair of values , where is a set of message identifiers (and are the messages whose identifiers are in ), and is a function, such that is true only if the process has received . Indirect consensus’ primitives are almost identical to the ones of classic distributed consensus: and , where is the number identifying a concrete consensus execution. Naturally, whenever a decision is taken on , indirect consensus must ensure that all correct processes eventually receive . We formalize this requirement by assuming eventual stable evaluation of .222In the original paper , this requirement has been called hypothesis A. Formally, indirect consensus requires:
termination: if eventual stable evaluation holds, then every correct process eventually decides some value,
uniform validity: if a process decides , then was proposed by some process,
uniform integrity: every process decides at most once,
uniform agreement: no two processes (correct or not) decide a different value,
no loss: if a process decides at time , then for one correct process at time .
In indirect consensus, the function explicitly concerns local delivery of messages, whose identifiers are in . However, could be replaced by any function that has the same properties as , i.e., eventual stable evaluation holds for . In CAB, instead of , we use a conjunction of and test predicates for each message , whose identifier is in . This way we easily obtain an efficient implementation of CAB, because we minimize the sizes of propositions, on which consensus is executed. The complete reduction of CAB to indirect consensus follows the approach from  and is presented in Appendix A-A. We formally show that the reduction satisfies the requirements of CAB. In practice, a very efficient implementation of CAB can be obtained by slightly modifying the indirect variant of Multi-Paxos .
V-a Basic scheme
Our specification of Creek, shown in Algorithm 1, is rooted in the specification of the Bayou protocol  presented in . We assume that clients submit requests to the system in the form of operations with encoded arguments (line 15), and await responses. Operations are defined by a specification of a (deterministic) replicated data type  (e.g., read/write operations on a register, list operations, such as append, getFirst, or arbitrary SQL queries/updates). Each operation is marked as weak or strong (through the argument). Operations are executed on the object (line 4), which encapsulates the state of a copy of a replicated object implementing (see how StateObject can be implemented in Appendix A-B).
Upon invocation of an operation (line 15), it is wrapped in a structure (line 17) that also contains the current timestamp (stored in the field) which will be used to order among weak operations and strong operations executed in a tentative way), and its unique identifier (stored in the field, which is a pair consisting of the Creek replica number and the value of the monotonically increasing local event counter ). Such a package is then distributed among replicas using some gossip protocol (here represented by reliable broadcast, line 23; we simply say that has been and later ed; through the code in lines 22 and 24 we simulate immediate local y of ). If is a strong operation, we additionally attach to the message the causal context of , i.e., the identifiers of all operations that have already been ed by the replica and which will be serialized before (line 19).333Operations serialized before include all operations ed by the replica whose final operation execution order is already established, and other weak operations whose timestamp is smaller than ’s timestamp. Later we explain why the causal context of cannot include the identifiers of any strong operations whose final execution order is not yet determined.
This information can be effectively stored in a dotted version vector (dvv), which is logically a set of pairs of a replica identifier and an event number (in the variable, line 7, a replica maintains the identifiers of all operations the replica knows about, see the routines in lines 26 and 32). For a strong operation, the replica also s the operation’s identifier with a test predicate specified by the function (line 20). By specification of CAB, (line 10) is evaluated by each replica when solving distributed consensus on a concrete operation identifier , that is about to be ed, and then, after the decision has been reached, in an attempt to locally. The function checks whether the replica has already ed the strong operation identified by , and if so, whether it has also already ed all operations that are in the causal context of . Note that a replica will ’s identifier only if it had already ed ’s structure.
When an operation is ed (line 26), the replica adds its identifier to if is a weak operation (line 30), and then uses ’s to correctly order among other weak operations and strong operations targeted for speculative execution (on the list of s, line 39). Then the new operation execution order is established by concatenating the list and the list (line 40). The list maintains the structures for all operations, whose final execution order has already been established. Then, the function (line 42) compares the newly established operation execution order with the order in which (some) operations have already been executed (see the variable). Operations, for which the orders are different, need to be rolled back (in order opposite to their execution order) and reexecuted. In an ideal case, is simply added to the end of the list, and awaits execution.444Note that no rollbacks will be required also when operation execution lags a bit behind the y of operations. In such case, the tail of the list will undergo reordering. To limit the number of possible rollbacks, local clocks, which are used to generate timestamps for structures, should not deviate too much from each other.
When an operation ’s identifier is ed (line 32), the replica can commit , i.e., establish its final execution order. To this end, the replica firstly stabilizes some of the operations, i.e., moves the structures of all operations included in the causal context of from the list to the end of the list (line 49). Then the replica add ’s structure to the end of the list as well (line 51). Note that this procedure maintains the relative order in which weak operations from the causal context of appear on the list. This procedure also maintains the causal precedence of those operations in relation to . All operations not included in the causal context of stay on the list (line 50). As before, the function is called to mark some of the executed operations for rollback and reexecution (line 54). Note that in an ideal case, operations (including ) can be moved from the to the list without causing any rollbacks or reexecutions. Unfortunately, if any (weak) operation submitted to some other replica is ordered in-between operations from the causal context of , and some of these operations are already executed, rollbacks cannot be avoided in the basic version of Creek. In Section V-C we discuss how this situation can be mitigated to some extent.
Recall that the causal context of a strong operation does not include the identifiers of any strong operations that are not yet committed. We cannot include such dependencies because, ultimately, the order of strong operations is established by CAB, which is unaware of the semantics and the possible causal dependency between messages sent through CAB. Hence, the order of strong operations established by CAB could be different from the order following from the causal dependency we would have had defined. In principle, such dependencies could be enforced using Zab  or executive order broadcast . However, these schemes would have to be extended to accommodate the capabilities of CAB. We opted not to further complicate the presentation of Creek. In Creek, the identifier of a strong operation is added to the global variable (which we use to create a causal context for all strong operations) only upon . But then we commit , thus establishing its final execution order.
Operation rollbacks and executions happen within transitions specified in lines 58–61 and 62–76. Whenever an operation is executed on a replica given, the replica check whether the operation has been originally submitted to this replica (line 65). Then, when necessary, it returns the response to the client. Note that in our pseudocode, before a client receives a stable response to a strong operation, it may receive multiple tentative responses, one for each time the operation is (re)executed. Sometimes the replica returns a stable response in the function (line 48). It happens when a strong operation has been speculatively executed in an order equivalent to its final execution order.
The most faithful description of the characteristics of Creek can be made using the framework from , where we analyze the behaviour and then formalize the guarantees of the seminal Bayou protocol . Creek’s principle of operation is very similar to Bayou’s, and so Creek also exhibits some of Bayou’s quirky behaviour. Most crucially, Creek allows for temporary operation reordering, which means that the replicas may temporarily disagree on the relative order in which the operations submitted to the system were executed. In consequence, clients may observe operation return values which do not correspond to any operation execution order that can be produced by traditional relational database systems or typical NoSQL systems. As we prove, this characteristics is unavoidable in systems that mix weak and strong consistency. The basic version of Creek is also not free of the other two traits of Bayou mentioned in , namely circular causality and unbounded wait-free execution of operations. The former can be mitigated in a similar fashion as in Bayou.
Formally, the guarantees provided by Creek can be expressed using Fluctuating Eventual Consistency (FEC) , a property that precisely captures temporary operation reordering and is not tied to a concrete data type.555Creek does not make any assumptions on the semantics of operations issued to replicas other than that operations must be deterministic. In Appendix A-C we argue why Creek satisfies FEC for weak operations and sequential consistency  for strong operations.
V-C High-performance protocol
An obvious optimization of Creek involves executing weak read-only operations without performing any network communication with other replicas. This optimization does not address the core limitation of Creek, which comes from excessive number of operation rollbacks and reexecutions. To improve Creek’s performance, we modified Creek in several ways. In our discussion below we focus on the updating operations.
Since rolling back operations is costly, we need to perform rollbacks only if necessary. Suppose there are two already executed operations , and appears before on . If is moved to (e.g., because is being committed and does not belong to the causal context of ), the basic version of Creek must rollback both operations and reexecute them but in the opposite order. However, if and
operated on distinct data, no rollbacks or reexecutions are necessary (at least with respect to only these two operations). Typical workloads exhibit locality, i.e., the requests do not access all data items with uniform probability. Hence, such an optimization brings dramatic improvement in Creek’s performance.
To facilitate efficient handling of situations similar to the one described above, we extended Creek with multiversioning scheme . The modified version of Creek holds multiple immutable objects called versions for data items accessed by operations. Versions are maintained within a version store. Each version is created during execution of some operation and is marked using a special timestamp that corresponds to the location of on the list. The execution of any operation happens in isolation, on a valid snapshot. It means that the snapshot includes all and only the versions created as the result of execution of all operations , such that appears before on at the time of execution of . A rollback of does not dispose of versions created during its execution. Instead, all versions created during execution of are marked, so they are not included in the snapshots used during execution of all operations that start execution after the rollback of .
A rollback of may cascade into rollbacks of other operations. Suppose as before that there are two already executed operations and appears before on . Suppose also that is ed, and has a lower timestamp than . In the basic version of Creek, both and would be rolled back and reexecuted after the execution of . Thanks to multiversioning, we can execute on a consistent snapshot corresponding to the desired order of on and then check, whether the execution of created new versions for any objects read by . If not, we do not need to roll back and we can proceed to check in a similar way the possible conflict between and . On the other hand, if is rolled back, we need to check for conflicts between and as well as between and , because might have read some no longer valid versions created by .
Note that one needs to be careful in garbage collecting versions. Since a newly ed operation can be placed in the middle of the list, we need to maintain all versions produced during execution of the operations on the list. We track live operations (operations being executed) to see which snapshots they operate on. This way we never garbage collect versions that might be used by live operations. Having that in mind, for each data item we can attempt to garbage collect all versions which were created during executions of operations , except for the most recently created value. We can also eventually remove all versions created by operations that were later rolled back (by specification of Creek, new transactions that start execution after the rollback already happened will not include the rolled back versions in their snapshots).
Under normal operation, when strong operations are committed every once in a while, the number of versions for each data item should remain roughly the same. However, when no strong operations are being committed (because no strong operations are submitted for a longer period of time or no message can be ed due to a network partition), the number of versions starts to grow. We could counter such a situation by, e.g., periodically issuing strong no-op operations, that would gradually stabilize weak operations. Otherwise, we need to maintain all versions created by operations . In such case, we could limit the amount of data we need to store, by collecting complete snapshots (that represent some prefix of ), and recreate some versions when needed, by reexecuting some previously executed operations on the snapshot.
Thanks to multiversioning, we could relatively easily further extend Creek to the support concurrent execution of multiple operations. Having multiple operations execute concurrently does not violate correctness, because each operation executes in isolation and on a consistent snapshot. The newly created versions are added to our version store after the operation completes execution. We do so atomically and only after we checked for conflicts with other concurrently executed operations which already completed their execution. In case of a conflict, we discard versions created during the execution and reexecute the operation.
Vi Experimental evaluation
Since Creek has been designed with low latency in mind, we are primarily interested in the on-replica latencies (or simply latencies) exhibited by Creek when handling client requests (the time between a replica receives a request from a client and sends back a response; the network delay in communication with the client is not included). One can expect that from a client’s perspective, important are the stable latencies for strong requests and the tentative latencies for weak requests: when a request is marked as strong, it means it is essential for the client to obtain a response that is correct (results from a state that is agreed upon by replicas). On the other hand, weak requests are to be executed in an eventually consistent fashion, so the client expects that the tentative response might not be 100% accurate (i.e., the same as the stable response produced once the request is stabilized).
We compare the latencies exhibited by Creek with the latencies produced by other replication schemes that enable arbitrary semantics. To this end we test Creek against SMR  , Archie  (a state-of-the-art speculative SMR), and Bayou  (mainly due its historical significance). For all systems we also measure the average CPU load and network congestion. Moreover, for Creek and Archie we check the accuracy of the speculative execution, i.e., the percentage of weak operations, for which the first speculative execution yielded results that match the ultimate results corresponding to this request. Archie, as specified in , does not return tentative responses after completing speculative execution. We can however predict what would be the tentative latency for Archie and thus we plot it alongside stable latency.
Recall that Creek (similarly to Bayou) uses a gossip protocol to disseminate (both weak and strong) operations among replicas. To ensure minimal communication footprint of the inter-replica synchronization necessary for strong operations, Creek uses an indirect consensus-based implementation of CAB. On the other hand, Archie and efficient SMR implementations (e.g., ) disseminate entire messages (operations) solely through atomic broadcast (AB). Since our goal is to conduct a fair comparison between the schemes, our implementations of SMR and Archie rely on a variant of AB that is also based on indirect consensus.
Vi-a Test environment
We test the systems in a simulated environment, which allows us to conduct a fair comparison: all systems share the same implementation of the data store abstraction and the networking stack, the sizes of the exchanged messages are uniform across systems (apart from the additional messages exchanged through CAB in Creek), and all statistics related to the test executions are measured in the same fashion. We simulate a 5-replica system connected via 1Gbps network. Each replica can run up to 16 tasks in parallel (thus simulating a 16-core CPU).
For our tests we use TPC-C , a popular OLTP benchmark. Every test run involves a uniform stream of client requests (transactions), each randomly marked weak or strong and sent to a randomly chosen replica. The fraction of strong transactions in the workload depends on the strong transaction percentage (stxp) parameter, set to 10%. The network communication latencies are set to represent the typical latencies achieved in a data center (0.2-0.3 ms). To simulate different contention levels, we conduct tests with the TPC-C scale factor set to 1 and 5 (the dataset contains either 1 or 5 warehouses; smaller number of warehouses means higher contention).
Vi-B Test results
|high contention||medium contention|
|zoom on low throughput|
In Figure 1 we present the on-replica latencies for all systems in the function of achieved throughput. In all tests the network is not saturated for any of the systems: messages exchanged between the replicas are small and transactions take a significant time to execute.
SMR and Bayou, whose maximum throughput is about 2.7k txs/s (regardless of the contention level), are easily outperformed by Creek and Archie, both of which take advantage of multicore architecture. The latter systems’ peak throughput is about 12k txs/s for the high contention scenario and 35k txs/s for the medium contention scenario. When CPU is not saturated, Creek’s latency for tentative responses (for both weak and strong transactions) is steady around 0.3 ms, which corresponds to the average time of executing a TPC-C transaction in the simulation. Creek’s latency in obtaining a stable response is a few times higher, as producing the response involves inter-replica synchronization. More precisely, to produce a stable response, a request needs to be , which means that under our assumptions and using a Paxos-based implementation of CAB, the request can be ed after 3 communication phases. Hence, network communication adds at least about 1 ms to the latency. In both medium and high contention scenarios, the stable latencies for strong transactions in Creek are about 15% lower compared to the latencies exhibited by Archie (see also below). In Creek, CPU utilization gradually increases with the increasing load. Eventually, the CPU saturates and, when the backlog of unprocessed transactions starts to build up (as signified by the latency plot striking up), Creek reaches its peak throughput.
Returning tentative responses makes little sense, when most of the time they are incorrect (they do not match the stable responses). Our tests show, however, that the tentative responses produced by Creek are in the majority of cases correct: the accuracy of the speculative execution ranges between 92-100% in the high contention scenario and between 99-100% in the medium contention scenario (see Figures 2 and 3 in Appendix A-D).
The tentative response latency observed for Creek is up to 3 times smaller than for Archie. It is because before an Archie’s replica can start processing a transaction, it first needs to broadcast and deliver it. More precisely, an Archie replica starts processing a transaction upon optimistic delivery of a message containing the transaction, which was sent using AB. The speculative execution in Archie has little impact on stable latency: on average, before a speculative transaction execution is completed by an Archie replica, an appropriate AB message is delivered by the replica, thus confirming the optimistic delivery order (hence the perfect accuracy of the speculative execution, see Figures 2 and 3 in Appendix A-D). It means that a replica starts the execution of a transaction a bit earlier than it takes for a message to be delivered by AB. The small benefits of returning a tentative response earlier can be seen on the bottom-left and bottom-right plots in Figure 1, which show the zoomed-in views over the top plots (for modest workloads).666The impressive speed-up achieved by Archie, as described in the original paper , was due to network communication latencies, which were about 3-4ms, over 10 times higher than the ones we assume in the simulation.
In the high contention scenario, for both Creek and Archie the execution ratio (the average number of executions performed for each transactions) gradually increases from 1 to almost 1.9, when CPU gets saturated (see Figures 2 and 3 in Appendix A-D
). The execution ratio is slightly higher for Creek compared to Archie’s due to Creek’s higher variance in the relative order between tentatively executed transactions. Archie curbs the number of rollbacks and reexecutions by allowing the replicas to limit the number of concurrently executed transactions. Moreover, in Archie, when the leader process of the underlying AB does not change, the optimistic message delivery order always matches the final delivery order. For the medium contention scenario, the execution ratio for both systems ranges from 1 to 1.3, with smaller differences between the systems.
SMR executes all transactions sequentially, after they have been delivered by AB. It means that SMR exhibits high latency compared to Creek and Archie, and has very limited maximum throughput. Bayou cuts the latency compared to SMR, because Bayou speculatively executes transactions before the final transaction execution order is established. However, its maximum throughput is comparable to SMR’s, as Bayou also processes all transactions sequentially.
Vi-C Varying test parameters
A low contention level (when the scale factor in TPC-C is set to 20) translates into better overall throughput (about 40k txs/s) and more uniform latencies for both Creek and Archie. In this scenario, transactions in both systems are rarely rolled back (the execution ratio never exceeds 1.1), because there are few conflicts between concurrently executed transactions. In these conditions, Creek always achieves perfect accuracy of speculative execution. Understandably, a low contention level does not have any impact on the performance of SMR, which executes all transactions sequentially. Bayou’s performance is similar to its performance in the other scenarios.
Increasing the percentage of strong transactions means that the latency of stable responses for strong transactions in Creek is now a bit closer to Archie’s latency. It is because now, on average there are fewer transactions in the causal context of each strong transaction, and thus the transaction can be ed earlier. The smaller causal contexts also translate into a slightly higher execution ratio, as fewer transactions can be committed together (recall that a strong transaction stabilizes weak transactions from its causal context upon commit). Changes to the stxp parameter neither impacts the performance of SMR nor Bayou.
Now we consider what happens when transactions take longer to execute. In the additional tests we increased the transaction execution time five times. Understandably the maximum throughput of all systems decreased five times. The maximum execution ratio for both Creek and Archie is lower than before, because there are fewer transactions issued concurrently. Longer execution times also mean that the inter-replica communication latency has smaller influence on the overall latency in producing (stable) responses (execution time dominates network communication time). In effect, when stxp is high (50%), the latency of stable execution in Creek matches the (tentative/stable) execution latency in Archie, and the latency of Bayou is closer to SMR’s. When stxp is relatively low (10%), the latency for Creek is lower compared to Archie’s due to the same reasons as before.
Understandably, using machine clusters containing more replicas do not yield better performance, because all tested replication schemes assume full replication. Consequently every replica needs to process all client requests. To improve the horizontal scalability of Creek, it needs to be adapted to support partial replication. We leave that for future work.
Using CPUs with more cores has no effect on SMR and Bayou, but allows Creek and Archie to (equally well) handle higher load. We skip the plots for these tests, as they resemble the already shown results but scaled out to higher maximum throughput values.
Similarly to Bayou, but unlike Archie and SMR, Creek remains available under network partitions (naturally, stable responses for strong transactions are provided only in the majority partition, if such exists). Under a heavy workload, Creek takes a long time to reconcile system partitions once the connection between the partitions is reestablished: the execution order of many transactions needs to be revisited, which means that they have to be reexecuted. In principle there is no other way to do it if we make no assumptions about the system semantics. Making such assumptions could allow us in some cases to, e.g., efficiently merge the states of the replicas from different partitions, as in CRDTs .
As shown by the TPC-C tests, Creek greatly improves the latency compared to Archie, the state-of-the-art speculative SMR system, and also provides much better overall throughput than SMR and Bayou. In fact, the tentative latency exhibited by Creek is up to 3 times lower compared to Archie’s. Moreover, when the percentage of strong transactions is relatively low, Creek improves the stable latency by 15% compared to Archie (when the percentage of strong transaction is high, the stable latencies exhibited by Creek and Archie are comparable). Crucially, the tentative responses provided by Creek for both weak and strong transactions are correct in the vast majority of cases.
Naturally, eventually consistent systems which restrict the semantics of operations (e.g., by providing only CRUD semantics), such as Apache Cassandra , can perform much better than Creek. It is because these systems limit or avoid altogether operation reexecutions resulting from changes in the order in which the updates are processed. However, as we argued in Section I, these systems are not suitable for all applications and are difficult to use correctly.
In this paper we presented Creek, a proof-of-concept, eventually-consistent, transactional scheme that also enables execution of strongly consistent requests. By its design, Creek provides low latency in handling client requests and yields throughput that is comparable with a state-of-the-art speculative SMR scheme. Creek does so while remaining general: it supports execution of arbitrary (deterministic) transactions. We believe that the Creek’s principle of operation can be used as a good starting point for other mixed-consistency replication schemes which are optimized for more specific use.
-  Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: Amazon’s highly available key-value store. SIGOPS Operating Systems Review, 41(6):205–220, October 2007.
-  Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, June 2008.
-  Avinash Lakshman and Prashant Malik. Cassandra: A decentralized structured storage system. SIGOPS Operating Systems Review, 44(2):35–40, April 2010.
-  Philip A., Bernstein, Vassco Hadzilacos, and Nathan Goodman. Concurrency control and recovery in database systems. Addison-Wesley, 1987.
-  Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. Conflict-free replicated data types. In Proc. of SSS ’11, May 2011.
-  Eric A. Brewer. Towards robust distributed systems (abstract). In Proc. of PODC ’00, July 2000.
-  Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002.
-  Basho documentation. Consistency levels in Riak. https://docs.basho.com/riak/kv/2.2.3/developing/app-guide/strong-consistency, 2019.
-  Apache Cassandra documentation. Light weight transactions in Cassandra. https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertLWT.html, 2019.
-  Apache Cassandra Issues (Jira). Mixing LWT and non-LWT operations can result in an LWT operation being acknowledged but not applied. https://jira.apache.org/jira/browse/CASSANDRA-11000, 2016.
-  Sebastian Burckhardt, Daan Leijen, Manuel Fähndrich, and Mooly Sagiv. Eventually consistent transactions. In Proc. of ESOP ’12, March 2012.
-  Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. Highly available transactions: Virtues and limitations. Proc. VLDB Endow., 7(3):181–192, November 2013.
-  Deepthi Devaki Akkoorath, Alejandro Z. Tomsic, Manuel Bravo, Zhongmiao Li, Tyler Crain, Annette Bieniusa, Nuno M. Preguiça, and Marc Shapiro. Cure: Strong semantics meets high availability and low latency. In Proc. of ICDCS ’16. IEEE Computer Society, June 2016.
-  Andrea Cerone, Giovanni Bernardi, and Alexey Gotsman. A framework for transactional consistency models with atomic visibility. In Proc. of CONCUR ’15, volume 42. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, September 2015.
-  Eric Koskinen, Matthew Parkinson, and Maurice Herlihy. Coarse-grained transactions. In Proc. of POPL ’10, January 2010.
-  Douglas Terry, Marvin Theimer, Karin Petersen, Alan Demers, Mike Spreitzer, and Carl Hauser. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proc. of SOSP ’95, December 1995.
-  Alan Fekete, David Gupta, Victor Luchangco, Nancy Lynch, and Alex Shvartsman. Eventually-serializable data services. In Proc. of PODC ’96, May 1996.
-  Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), July 1978.
-  Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys (CSUR), 22(4):299–319, December 1990.
-  Richard Ekwall and André Schiper. Solving atomic broadcast with indirect consensus. In Proc. of DSN ’06, June 2006.
-  Philip A. Bernstein and Nathan Goodman. Multiversion concurrency control—theory and algorithms. ACM Transactions on Database Systems (TODS), 8(4), December 1983.
-  Leslie Lamport. The part-time parliament. ACM Transactions on Computer Systems (TOCS), 16(2), 1998.
-  Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems (TOPLAS), 12(3), 1990.
-  Maciej Kokociński, Tadeusz Kobus, and Paweł T. Wojciechowski. Brief announcement: On mixing eventual and strong consistency: Bayou revisited. In Proc. of PODC ’19, 2019.
-  Transaction Processing Performance Council. TPC Benchmark C, Standard Specification Version 5.11, 2010.
-  Sachin Hirve, Roberto Palmieri, and Binoy Ravindran. Archie: A speculative replicated transactional system. In Proc. of Middleware ’14, December 2014.
-  Oleg M. Cheiner and Alexander A. Shvartsman. Implementing and evaluating an eventualy-serializable data service. In Proc. of PODC ’98, June 1998.
-  Rivka Ladin, Barbara Liskov, Liuba Shrira, and Sanjay Ghemawat. Providing high availability using lazy replication. ACM Transactions on Computer Systems (TOCS), 10(4):360–391, November 1992.
-  Atul Singh, Pedro Fonseca, Petr Kuznetsov, Rodrigo Rodrigues, and Petros Maniatis. Zeno: Eventually consistent byzantine-fault tolerance. In Proc. of NSDI ’09, April 2009.
-  Cheng Li, Daniel Porto, Allen Clement, Johannes Gehrke, Nuno Preguiça, and Rodrigo Rodrigues. Making geo-replicated systems fast as possible, consistent when necessary. In Proc. of OSDI ’12, October 2012.
-  Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich. Global sequence protocol: A robust abstraction for replicated shared state. In Proc. of ECOOP ’15, July 2015.
-  B. Kemme, F. Pedone, G. Alonso, A. Schiper, and M. Wiesmann. Using optimistic atomic broadcast in transaction processing systems. IEEE Tran. on Knowledge and Data Engineering, 15(4), July 2003.
-  Roberto Palmieri, Francesco Quaglia, and Paolo Romano. OSARE: Opportunistic Speculation in Actively REplicated transactional systems. In Proc. of SRDS ’11, Oct 2011.
-  Carlos Eduardo Bezerra, Fernando Pedone, and Robbert Van Renesse. Scalable state-machine replication. Proc. of DSN ’14, 2014.
-  Michael Wei, Amy Tai, Christopher J. Rossbach, Ittai Abraham, Maithem Munshed, Medhavi Dhawan, Jim Stabile, Udi Wieder, Scott Fritchie, Steven Swanson, Michael J. Freedman, and Dahlia Malkhi. vcorfu: A cloud-scale object store on a shared log. In Proc. of NSDI ’17, March 2017.
-  Marc Shapiro, Masoud Saeida Ardekani, and Gustavo Petri. Consistency in 3d. In Proc. of CONCUR ’16, volume 59, August 2016.
-  Alain Girault, Gregor Gößler, Rachid Guerraoui, Jad Hamza, and Dragos-Adrian Seredinschi. Monotonic prefix consistency in distributed systems. In Proc. of FORTE ’18, June 2018.
-  Vassos Hadzilacos and Sam Toueg. A modular approach to fault-tolerant broadcasts and related problems. Technical report, Ithaca, NY, USA, 1994.
-  Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4), July 1996.
-  Xavier Défago, André Schiper, and Péter Urbán. Total order broadcast and multicast algorithms: Taxonomy and survey. ACM Computing Surveys, 36(4), December 2004.
-  Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of ACM, 43(2):225–267, March 1996.
-  Jan Kończak, Nuno Santos, Tomasz Żurkowski, Paweł T. Wojciechowski, and André Schiper. JPaxos: State machine replication based on the Paxos protocol. Technical Report EPFL-REPORT-167765, EPFL, July 2011.
-  Sebastian Burckhardt. Principles of eventual consistency. Foundations and Trends in Programming Languages, 1(1-2):1–150, October 2014.
-  Nuno M. Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, and Ricardo Gonçalves. Dotted version vectors: Logical clocks for optimistic replication. CoRR, abs/1011.5808, 2010.
-  Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. Zab: High-performance broadcast for primary-backup systems. In Proc. of DSN’11, 2011.
-  Maciej Kokociński, Tadeusz Kobus, and Paweł T. Wojciechowski. Make the leader work: Executive deferred update replication. In Proc. of SRDS ’14, October 2014.
-  L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9), September 1979.
-  Tao Jiang, Qianlong Zhang, Rui Hou, Lin Chai, Sally A. McKee, Zhen Jia, and Ninghui Sun. Understanding the behavior of in-memory computing workloads. In Proc. of IISWC ’14, October 2014.
-  Paweł T. Wojciechowski, Tadeusz Kobus, and Maciej Kokociński. State-machine and deferred-update replication: Analysis and comparison. IEEE Transactions on Parallel and Distributed Systems, 28(3):891–904, 2017.
Appendix A Appendix
A-a Reducing CAB to indirect consensus
In Algorithm 2 each process maintains a set of ed messages in . We use the set to keep the identifiers of messages received but not yet ordered by the process, and the list, to store the identifiers of messages ordered but not yet ed by the process.
In order to a message with predicate , both and are first to other processes (line 9). Once a process s a message, it is added to the set (line 13). If the message has not been already ordered, its identifier is placed in the set (line 15). Whenever the set is not empty, a process initiates another indirect consensus execution and tries to propose the set (line 18). A process agrees to terminate an indirect consensus instance on a set of message identifiers (line 19) only if has already ed all messages, whose identifiers are in (as in indirect consensus), but also for each message the test predicate holds on (line 11). Once a decision on a set of message identifiers is reached, messages whose identifiers are in are deterministically ordered (line 21), and ed in a proper order by when the content of the message is available (line 23).
Now we will show that Algorithm 2 satisfies CAB.777The original paper on indirect consensus  does not include formal correctness proofs of the reduction of AB to indirect consensus. To this end, we formulate a number of lemmas first.
Let be a correct process. If, for some message (with predicate ), at , where is some decided instance of indirect consensus (IC), then eventually s .
We give a proof by contradiction. Assume that never s , and is the first such message. Since , by the algorithm, on the list. Since is the first not ed message, . Since has been decided (at some time ), by the no loss property of IC, the function has been evaluated to true at some correct process (say ) at time . Hence for all messages , such that , must be ed at at time , and must be true at at time . By the agreement property of RB eventually s all such messages (including ). By eventual stable evaluation, eventually at . Then, by the algorithm, must , which contradicts our assumption. ∎
Let be any process. If, for some message (with predicate ), at , where is some decided instance of indirect consensus, then there does not exists process and , so that at , where is some decided instance of indirect consensus.
Without loss of generality assume that . Let us make two observations:
any process can only proceed to round after is decided at , and every process decides only once per IC instance (by the uniform integrity property of IC),
any process proposes through IC (for some message ) only if at at that time.
Now we consider two cases:
Let at . It means that ed and will not again (by the uniform integrity property of RB). Since at then is removed from the set and subsequently added to the list on . Hence will not propose in any IC instance .
has yet to . Since at , then is added to the list on . When s , by the algorithm, will not be added to the set. So again will not propose in any IC instance
For to be in decided in any IC instance , some process would have to propose in instance . But as we saw, it is impossible. ∎
Algorithm 2 satisfies the validity property of CAB.
We need to show that if a correct process s a message with predicate , and eventual stable evaluation holds for , then eventually s . We give a proof by a contradiction. Assume that never s (with ).
Since s a message with predicate , there exists some message , such that and , which was by . Since is correct, by the validity property of RB, s . By the agreement property of RB all correct processes will eventually .
Upon y of , adds to the set and then repeatedly proposes until is no longer in . Since all correct processes will eventually , all propositions in some instance of IC will contain . By the termination property of IC, some decision is made for instance . By the uniform validity property of indirect consensus, this decision must be , such that . Then, by Lemma