Causal Consistency and Latency Optimality: Friend or Foe?

03/12/2018 ∙ by Diego Didona, et al. ∙ 0

Causal consistency is an attractive consistency model for replicated data stores. It is provably the strongest model that tolerates partitions, it avoids the long latencies associated with strong consistency, and, especially when using read-only transactions, it prevents many of the anomalies of weaker consistency models. Recent work has shown that causal consistency allows "latency-optimal" read-only transactions, that are nonblocking, single-version and single-round in terms of communication. On the surface, this latency optimality is very appealing, as the vast majority of applications are assumed to have read-dominated workloads. In this paper, we show that such "latency-optimal" read-only transactions induce an extra overhead on writes, the extra overhead is so high that performance is actually jeopardized, even in read-dominated workloads. We show this result from a practical and a theoretical angle. First, we present a protocol that implements "almost laten- cy-optimal" ROTs but does not impose on the writes any of the overhead of latency-optimal protocols. In this protocol, ROTs are nonblocking, one version and can be configured to use either two or one and a half rounds of client-server communication. We experimentally show that this protocol not only provides better throughput, as expected, but also surprisingly better latencies for all but the lowest loads and most read-heavy workloads. Then, we prove that the extra overhead imposed on writes by latency-optimal read-only transactions is inherent, i.e., it is not an artifact of the design we consider, and cannot be avoided by any implementation of latency-optimal read-only transactions. We show in particular that this overhead grows linearly with the number of clients.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Geo-replication is gaining momentum in industry [Nishtala:2013, Lu:2015, Noghabi:2016, Corbett:2013, Bacon:2017, Calder:2011, actor:2017, DeCandia:2007, Verbitski:2017] and academia [Kraska:2013, Crooks:2016b, Sovran:2011, Zhang:2013, Corbett:2013, Moniz:2017, Zhang:2015, Zhang:2016, Nawab:2015] as a design choice for large-scale data platforms to meet the strict latency and availability requirements of on-line applications [Rahman:2017, Terry:2013, Ardekani:2014]. Geo-replication aims to reduce operation latencies, by storing a copy of the data closer to the clients, and to increase availability, by keeping multiple copies of the data at different data centers (DC).

Causal consistency. Causal consistency (CC) [Ahamad:1995] is an attractive consistency model for building geo-replicated data stores. On the one hand, it has an intuitive semantics and avoids many anomalies that are allowed under weaker consistency properties [Vogels:2009, DeCandia:2007]. On the other hand, it avoids the long latencies incurred by strong consistency (e.g., linearizability and strict serializability) [Herlihy:1990, Corbett:2013] and tolerates network partitions [Lloyd:2011]. In fact, CC is provably the strongest consistency that can be achieved in an always-available system [Mahajan:2011, Attiya:2015]. CC is the target consistency level of many systems [Lloyd:2011, Lloyd:2013, Du:2013, Du:2014, Almeida:2013, Bravo:2017, Eunomia:2017], it is used in platforms that support multiple levels of consistency [Balegas:2016, Li:2014], and it is a building block for strong consistency systems [Balegas:2015] and formal checkers of distributed protocols [Gotsman:2016].

Read-only transactions in CC. High-level operations such as producing a web page often translate to multiple reads from the underlying data store [Nishtala:2013]. Ensuring that all these reads are served from the same consistent snapshot avoids undesirable anomalies,

in particular, the following well-known anomaly: Alice removes Bob from the access list of a photo album and adds a photo to it but Bob reads the original permission and the new version of the album [Lloyd:2011, Lu:2016]. Therefore, the vast majority of CC systems provide read-only transactions (ROTs) to read multiple items at once from a causally consistent snapshot [Lloyd:2011, Lloyd:2013, Du:2014, Almeida:2013, Akkoorath:2016, Zawirski:2015]. Large-scale applications are often read-heavy [Atikoglu:2012, Nishtala:2013, Noghabi:2016, Lu:2015], and achieving low-latency ROTs becomes a first-class concern for CC systems.

Earlier CC ROT designs were either blocking [Du:2013, Du:2014, Akkoorath:2016, Almeida:2013] or required multiple rounds of communications to complete [Lloyd:2011, Lloyd:2013, Almeida:2013]. Recent work on the COPS-SNOW system [Lu:2016] has, however, demonstrated that it is possible to perform causally consistent ROTs in a single round of communication, sending only one version of the keys involved, and in a nonblocking fashion. Because it exhibits these three properties, the COPS-SNOW ROT protocol was termed latency-optimal (LO). The protocol achieves LO by imposing additional processing costs on writes. One could argue that this is a correct tradeoff for the common case of read-heavy workloads, because the overhead affects the minority of operations and is thus to the advantage of the majority of them. This paper sheds a different light on this tradeoff.

Contributions. In this paper we show that the extra cost on writes is so high that so-called latency-optimal ROTs in practice exhibit higher latencies than alternative designs, even in read-heavy workloads. This extra cost not only reduces the available processing power, leading to lower throughput, but it also leads to higher resource contention, which results in higher queueing times, and, ultimately, in higher latencies. We demonstrate this counterintuitive result from two angles, a practical and a theoretical one.

1) From a practical standpoint, we show how an existing and widely used design of CC can be improved to achieve almost all the properties of a latency-optimal design, without incurring the overhead on writes that latency optimality implies. We implement this improved design in a system that we call Contrarian. Measurements in a variety of scenarios demonstrate that, for all but the lowest loads and the most read-heavy workloads, Contrarian provides better latencies and throughput than an LO protocol.

2) From a theoretical standpoint, we show that the extra cost imposed on writes to achieve LO ROTs is inherent to CC, i.e., it cannot be avoided by any CC system that implements LO ROTs. We also provide a lower bound on this extra cost in terms of communication overhead. Specifically, we show that the amount of extra information exchanged on writes potentially grows linearly with the number of clients.

The relevance of our theoretical results goes beyond the scope of CC. In fact, they apply to any consistency model strictly stronger than causal consistency, e.g., linearizability [Herlihy:1990]. Moreover, our result is relevant also for systems that implement hybrid consistency models that include CC [Balegas:2016] or that implement strong consistency on top of CC [Balegas:2015].

Roadmap. The remainder of this paper is organized as follows. Section 2 provides introductory concepts and definitions. Section 3 surveys the complexities involved in the implementation of ROTs. Section 4 presents our Contrarian protocol. Section 5 compares Contrarian and an LO design. Section 6 presents our theoretical results. Section 7 discusses related work. Section 8 concludes the paper.

2 System Model

2.1 Api

We consider a multi-version key value data store. We denote keys by lower case letters, e.g., , and the versions of their corresponding values by capital letters, e.g., . The key value store provides the following operations:

X GET(x): A GET operation returns the value of the item identified by . GET may return to show that there is no item yet identified by .

PUT(x, X): A PUT operation creates a new version of an item identified by . If item does not yet exist, the system creates a new item with value .

(X, Y, …) ROT (x, y, …) :

A ROT returns a vector (

, , …) of versions of the requested keys (, , … ). A ROT may return to show that some item does not exist.

In the remainder of this paper we focus on PUT and ROT operations.

2.2 Causal Consistency

The causality order is a happens-before relationship between any two operations in a given execution [Lamport:1978, Ahamad:1995]. For any two operations and , we say that causally depends on , and we write , if and only if at least one of the following conditions holds: and are operations in a single thread of execution, and happens before ; such that creates version of key , and reads ; such that and . If and are two PUTs with values and respectively, then (with a slight abuse of notation) we also say causally depends on , and we write .

A causally consistent key value store respects the causality order. Intuitively, if a client reads and , then any subsequent read performed by the client on returns either or a newer version. I.e., the client cannot read . A ROT operation returns item versions from a causally consistent snapshot [Mattern89, Lloyd:2011]: if a ROT returns and such that , then there is no such that .

To circumvent trivial implementations of causal consistency, we require that a value, once written, becomes eventually visible, meaning that it is available to be read by all clients after some finite time [Bailis:2013].

Causal consistency does not establish an order among concurrent (i.e., not causally related) updates on the same key. Hence, different replicas of the same key might diverge and expose different values [Vogels:2009]. We consider a system that eventually converges: if there are no further updates, eventually all replicas of any key take the same value, for instance using the last-writer-wins rule [Thomas:1979].

Hereafter, when we use the term causal consistency, eventual visibility and convergence are implied.

2.3 Partitioning and Replication

We target key value stores where the data set is split into partitions. Each key is deterministically assigned to a partition by a hash function. A PUT(x, X) is sent to the partition that stores x. For a ROT(x, …) a read request is sent to all partitions that store keys in the specified key set.

Each partition is replicated at DCs. Our theoretical and practical results hold for both single and replicated DCs. In the case of replication, we consider a multi-master design, i.e., all replicas of a key accept PUT operations.

3 Challenges in Implementing
Read-only Transactions

Figure 1: Challenges in implementing CC ROTs. issues . If returns to , then cannot return because there is such that .

Single DC case. Even in a single DC, partitions involved in a ROT cannot simply return the most recent version of a requested key if one wants to ensure that a ROT observes a causally consistent snapshot. Consider the scenario of Figure 1, with two keys and , with initial values and , and residing on partitions and , respectively. Client performs a ROT on keys and , and client performs a PUT on with value and later a PUT on with value . By asynchrony, the read on x by arrives at before the PUT by on , and the read by on y arrives at after the PUT by on y. Clearly, cannot return to , because a snapshot consisting of and , with violates the causal consistency property for snapshots (see Section 2.2).

COPS [Lloyd:2011] presented the first solution to this problem. It encodes causality as direct dependencies of the form “version of depends on version of ”, stored with , and “client has established a dependency on version of ”, stored with . These dependencies are passed around as necessary to maintain causality. COPS solves the aforementioned challenge as follows. when performs its ROT, in the first round of the protocol, and return the most recent version of and , and . Partition also returns to the dependency “ depends on ”. From this piece of information can determine that and do not form a causally consistent snapshot. Thus, in the second round of the protocol, requests from a more recent version of to have a causally consistent snapshot, in this case . This protocol is nonblocking, but requires (potentially) two rounds of communication and two versions of key(s) being communicated. Eiger [Lloyd:2013] improves on this design by using less meta-data, but maintains the potentially two-round, two-version implementation of ROTs.

In later designs for CC systems [Du:2013, Du:2014, Akkoorath:2016], direct dependencies were abandoned in favor of timestamps, which provide a more compact and efficient encoding of causality. To maintain causality, a timestamp is associated with every version of every data item. Each client and each partition also maintain the highest timestamp they have observed. When performing a PUT, a client sends along its timestamp. The timestamp of the newly created version is then one plus the maximum between the client’s timestamp and the partition’s timestamp, thus encoding causality. After completing a PUT, the partition replies to the client with this new version’s timestamp. To implement ROTs it then suffices to pick a timestamp for the snapshot, and send it with the ROTs to the partitions. A partition first makes sure that its timestamp has caught up to the snapshot timestamp [Akkoorath:2016]. This ensures that later a version cannot be created with a lower timestamp than the snapshot timestamp. Then, the partition returns the most recent key values with a timestamp smaller than or equal to the snapshot timestamp.

The snapshot timestamp is picked by a transaction coordinator [Du:2014, Akkoorath:2016]. Any server can be the coordinator of a ROT. Thus, the client contacts the coordinator, the coordinator picks the timestamp, and the client or the coordinator then sends this timestamp along with the keys to be read to the partitions. The client provides the coordinator with the last observed timestamp, and the coordinator picks the transaction timestamp as the maximum of the client’s timestamp and its own. Observe that, in general, the client cannot pick the snapshot timestamp itself, because the timestamp may be arbitrarily far behind, compromising eventual visibility.

Timestamps may be generated by logical or by physical clocks. Returning to our example of Figure 1, assume that the logical clocks at and are initially 0, the logical clocks at and are initially 90. the timestamps of and are 70, and that a transaction coordinator chooses a snapshot timestamp 100. When receiving the read of with snapshot timestamp 100, advances its logical clock to 100, and returns . When receives PUT(), it creates with timestamp 101, and returns that value to . then sends the PUT() to with timestamp 101, and is created with timestamp 102. When the read of on arrives with snapshot timestamp 100, uses the timestamps of and to conclude that it needs to return , the most recent version with timestamp smaller than or equal to 100. As with COPS, this protocol is nonblocking; unlike COPS, it requires only a single version of each key, but it always requires two rounds of communication [Akkoorath:2016].

Figure 2: COPS-SNOW design. declares that depends on . Before making visible, runs a “readers check” with and is informed that has observed a snapshot that does not include .

A further complication arises when (loosely synchronized) physical clocks are used for timestamping [Du:2014, Akkoorath:2016], since physical clocks, unlike logical clocks, can only move forward with the passage of time. As a result, in our example, when the read on arrives with snapshot timestamp 100, has to wait until its physical clock advances to 100 before it can return . This makes the protocol blocking, in addition to being one-version and two-round.

(a) 1 1/2 rounds (3 communication steps).
(b) 2 rounds (4 communication steps).
Figure 3: ROT implementation in Contrarian. Numbered circles depict the order of operations. The client always piggybacks on its requests the last snapshot it has seen (not shown), so as to observe monotonically increasing snapshots. Any node involved in a ROT can act as the coordinator of the ROT. Using 1 1/2 rounds reduces the number of communication hops with respect to 2 rounds, at the expenses of more messages exchanged to run a ROT.

The question then becomes: does there exist a single-round, single-version, nonblocking protocol for CC ROTs? This question was answered in the affirmative by a follow-up to the COPS and Eiger systems, called COPS-SNOW [Lu:2016]. Using again the previous example, we depict in Figure 2 how the COPS-SNOW protocol works at a high level. Each ROT is given a unique identifier. When a ROT reads , records as a reader of . It also records the (logical) time at which the read occurred. On a later PUT on , is added to the “old readers of ”, the set of transactions that have read a version of that is no longer the most recent version, again together with the logical time at which the read occurred.

When later sends its PUT on to , it includes (as in COPS) that this PUT is dependent on . Partition now interrogates as to whether there are old readers of , and, if so, records the old readers of into the old reader record of , together with their logical time. When later the read of on arrives, finds in the old reader record of . therefore knows that it cannot return . Using the logical time in the old reader record, it returns the most recent version of before that time, in this case . In the rest of the paper, we refer to this procedure as the readers check. This protocol is one-round, one-version and nonblocking, and therefore termed latency-optimal.

This protocol, however, incurs a very high cost on PUTs. We demonstrate this cost by slightly modifying our example. Let us assume that hundreds of ROTs read before the PUT(

) (as might well occur with a skewed workload in which x is a hot key). Then all these transactions must be stored as readers and then as old readers of

, communicated to , and examined by on each incoming read from a ROT. Let us further modify the example by assuming that reads other keys from partitions different from and before writing . Because has established a dependency on all the versions it has read, in order to compute the old readers for , needs to interrogate not only , but all the other partitions .

Challenges of geo-replication. Further complications arise in a geo-replicated setting with multiple DCs. We assume that keys are replicated asynchronously, so a new key version may arrive at a DC before its causal dependencies. COPS and COPS-SNOW deal with this situation through a technique called dependency checking. When a new key version is replicated, its causal dependencies are sent along. Before the new version is installed, the system checks by means of dependency check messages to other partitions that its causal dependencies are present. When its dependencies have been installed in the DC, the new key version can be installed as well. In COPS-SNOW, in addition, the readers check for the new key version proceeds in a remote data center as it does in the data center where the PUT originated. To amortize the overhead, the dependency check and the readers check are performed as a single protocol.

An alternative technique, commonly used with timestamp-based methods, is to use a stabilization protocol [Babaoglu:1993, Du:2014, Akkoorath:2016]. Variations exist, but in general each data center establishes a cutoff timestamp below which it has received all remote updates. Updates with a timestamp lower than this cutoff can then be installed. Stabilization protocols are cheaper to implement than dependency checking [Du:2014], but they lead to a complication in making ROTs nonblocking, in that one needs to make sure that the snapshot timestamp assigned to a ROT is below the cutoff timestamp, so that there is no blocking upon reading.

4 Contrarian: An efficient
but not latency-optimal design

We now present Contrarian, a protocol that implements almost all the properties of latency-optimal ROTs, without incurring the overhead that stems from latency-optimal ROTs, thereby providing low latency, resource efficiency and high throughput.

Our goal is not to propose a radically new design of CC. Rather, we aim to show how an existing and widely employed non-latency optimal design can be improved to achieve almost all the desirable properties of latency optimality without incurring the overhead that inherently results from achieving all of them (as we demonstrate in Section 6).

Contrarian builds on the aforementioned coordinator-based design of ROTs and on the stabilization protocol-based approach (to determine visibility of remote items ) in the geo-replica- ted setting. These characteristics, all or in part, lie at the core of many state-of-the-art systems, like Orbe [Du:2013], GentleRain [Du:2014], Cure [Akkoorath:2016] and CausalSpartan [CausalSpartan]. The improvements we propose in Contrarian, thus, can be employed to improve the design of these and similar systems.

Properties of ROTs. Contrarian’s ROT protocol runs in 1 1/2 rounds, is one-version, and nonblocking. While Contrarian sacrifices a half round in latency compared to the theoretically LO protocol, it retains the low cost of PUTs as in other non-LO designs.

Contrarian implements ROTs in 1 1/2 rounds of communication, by one-round trip between the client and the partitions (one of which is chosen as the coordinator) with an extra hop from the coordinator to the partitions. As shown in Figure 3, this design requires only three communication steps instead of four as the classical coordinator-based approach described in Section 3. Contrarian reduces the communication hops to improve latency at the expense of generating more messages to serve a ROT with respect to a 2-round approach. As we shall see in Section 5.3, this leads to a slight throughput loss. Contrarian can be configured to run ROTs with 2 rounds (even on a per-ROT basis) to maximize throughput.

Contrarian achieves the one-version property because partitions read the freshest version within the snapshot proposed by the coordinator.

Contrarian implements nonblocking ROTs by using logical clocks. In the single-DC case, logical clocks allow a partition to move its local clock’s value to the snapshot timestamp of an incoming ROT, if needed. Hence, ROTs can be served without blocking (as described in Section 3).

We now describe how Contrarian implements geo-replica- tion and retains the nonblocking property in that setting.

Geo-replication. Similarly to Cure [Akkoorath:2016], Contrarian uses dependency vectors to track causality, and employs a stabilization protocol to determine a cutoff vector in a DC (rather than a cutoff timestamp as discussed earlier). Every partition maintains a version vector with one entry per DC. is the timestamp of the latest version created by the partition, where is the index of the local DC. , is the timestamp of the latest update received from the replica in the th DC. A partition sends a heartbeat message with its current clock value to its replicas if it does not process a PUT for a given amount of time.

Periodically, the partitions within exchange their s and compute the aggregate minimum vector, called Global Stable Snapshot (). The GSS represents a lower bound on the snapshot of remote items that have been installed by every node in . The GSS is exchanged between clients and partitions upon each operation to update their views of the snapshot installed in .

Items track causal dependencies by means of dependency vectors , with one entry per DC. If , then (potentially) causally depends on all the items originally written in with a timestamp up to . , where is the source replica, is the timestamp of and it is enforced to be higher than any other entry in upon creation of , to reflect causality. The remote entries of the GSS are used to build the remote entries of of newly created items. can be made visible to clients in a remote if is entry-wise smaller than or equal to the GSS on the server that handles in . This condition implies that all s dependencies have already been received in .

The ROT protocol uses a vector to encode a snapshot. The local entry of is the maximum between the clock at the coordinator and the highest local timestamp seen by the client. The remote entries of are given by the maximum between the at the coordinator and the highest seen by the client. An item belongs to the snapshot encoded by if . This protocol is nonblocking because partitions can move the value of their local clock forward to match the local entry of and the remote entries of correspond to a causally consistent snapshot of remote items that have already been received in the DC.

Freshness of the snapshots. The is computed by means of the minimum operator. Because logical clocks on different nodes may advance at different paces, a single laggard node in one DC can keep entries in the from progressing, thus increasing the staleness of the snapshot. A solution to this problem is to use loosely synchronized physical clocks [Du:2013, Du:2014, Akkoorath:2016]. However, physical clocks cannot be moved forward to match the timestamp of an incoming ROT, which can jeopardize the nonblocking property [Akkoorath:2016].

To achieve fresh snapshots and preserve nonblocking ROTs, Contrarian uses Hybrid Logical Physical Clocks (HLC) [Kulkarni:2014]. In brief, an HLC is a logical clock that generates timestamps by taking the maximum between the local physical clock and the highest timestamp seen by the node plus one. On the one hand, HLCs behave like logical clocks, so a server can move its clock forward to match the timestamp of an incoming ROT request, thereby preserving the nonblocking behavior of ROTs. On the other hand, HLCs behave like physical clocks, because they advance even in absence of events and inherit the (loose) synchronicity of the underlying physical clock. Hence, the stabilization protocol identifies fresh snapshots. Importantly, the correctness of Contrarian does not depend on the synchronization of the clocks, and Contrarian preserves its properties even if using plain logical clocks.

Contrarian is not the first CC system that proposes the use of HLCs to generate event timestamps. However, existing systems use HLCs either to avoid blocking PUT operations [CausalSpartan], or reduce replication delays [Eunomia:2017], or improve the clock synchronization among servers [Mehdi:2017]. Here, we show how HLCs can be used to implement nonblocking ROTs.

5 Experimental Study

Parameter Definition Value Motivation
Write/read ratio (w) #PUTS/(#PUTs+#individual reads) 0.01 Extremely read-heavy workload
0.05 Default read-heavy parameter in YCSB [Cooper:2010]
0.1 Default parameter in COPS-SNOW [Lu:2016]
Size of a ROT (p) # Partitions involved in a ROT 4,8,24 Application operations span multiple partitions [Nishtala:2013]
Size of values (b) Value size (in bytes). Keys take 8 bytes. 8 Representative of many production workloads [Atikoglu:2012, Nishtala:2013, Reda:2017]
128 Default parameter in COPS-SNOW [Lu:2016]
2048 Representative of workloads with large items
Skew in key popularity (z) Parameter of the zipfian distribution. 0.99 Strong skew typical of many production workloads [Atikoglu:2012, Balmau:2017]
0.8 Moderate skew and default in COPS-SNOW [Lu:2016]

No skew (uniform distribution

Table 1: Workload parameters considered in the evaluation. The default values are given in bold.

5.1 Summary of the results

Main findings. We show that the resource demands to perform PUT operations in the latency-optimal design are in practice so high that they not only affect the performance of PUTs, but also the performance of ROTs, even with read-heavy workloads. In particular, with the exception of scenarios corresponding to extremely read-heavy workloads and modest loads, where the two designs are comparable, Contrarian achieves ROT latencies that are lower than a latency-optimal design. In addition, Contrarian achieves higher throughput for almost all workloads.

Lessons learnt. In light of our experimental findings, we draw three main conclusions.

Overall system efficiency is key to both low latency and high throughput. It is fundamental to understand the cost of optimizing an operation on the system even though the optimized operation dominates the workload.

The high-level theoretical model of a design may not capture the resource utilization dynamics incurred by the design. While a theoretical model represents a powerful lens to compare and qualitatively analyze designs, the choice of a target design for a system should rely also on a more quantitative analysis, e.g., by means of analytical modeling [Tay:2010].

Ultimately, the optimality of a design is closely related to the target workload as well as target architecture and available computational resources.

5.2 Experimental environment.

Implementation and optimizations. We implement Contrarian, Cure 111Cure supports an API that is different from Contrarian’s [Akkoorath:2016]. We modify Cure to comply with the model described in Section 2. and the COPS-SNOW design in the same C++ code-base. Clients and servers use Google Protocol Buffer [protobuf] for communication. We call CC-LO the system that implements the design of COPS-SNOW. We improve its performance over the original design by more aggressive eviction of transactions from the old reader record. Specifically, we garbage-collect a ROT id after 500 msec from its insertion in the readers list of a key (vs the 5 seconds of the original implementation) and we enforce that each readers-check message response contains at most one ROT id per client, i.e., the one corresponding to the most recent ROT of that client. These two optimizations reduce by one order of magnitude the amount of ROT ids exchanged, leading it to approach the lower bound we describe in Section 6. We use NTP [ntp] to synchronize clocks in Contrarian and Cure, and the stabilization protocol is run every 5 msec.

Platform. We use 64 machines equipped with 2x4 AMD Opteron 6212 (16 hardware threads) and 130 GB of RAM and running Ubuntu 16.04 with a 4.4.0-89-generic kernel. We consider a data set sharded across 32 partitions. Each partition is assigned to a different server. We consider a single DC scenario and a replicated scenario with two replicas. Machines communicate over a 10Gbps network.

Using only two replicas is a favorable condition for CC-LO, Since the readers check has to be performed also for replicated updates in the remote DCs, the corresponding overhead grows linearly with the number of DCs. We also note that the overheads of the designs we consider are largely unaffected by the communication latency between replicas, because update replication is asynchronous and happens in the background. Thus, emulating a multi-DC scenario over a local area network suffices to capture the most relevant performance dynamics that depend on (geo-)replication [Lu:2016].

Methodology. Experiments run for 90 seconds, and clients issue operations in closed loop. We generate different loads for the system by spawning different numbers of client threads (starting from one thread per client machine). We have run each experiment up to 5 times, with minimal variations between runs, and report the median result.

Workloads. Table 1 summarizes the workload parameters we consider. We use read-heavy workloads, in which clients issue ROTs and PUTs according to a given w/r ratio (w), defined as #PUT/(#PUT + #READ). A ROT reading keys counts as READs. ROTs span a target number of partitions (p), chosen uniformly at random, and read one key per partition. Keys in a partition are chosen according to a zipfian distribution with a given parameter (z). Every partition stores 1M keys, and items have a constant size (b).

The default workload we consider uses w = 0.05, i.e., the default value for the read-heavy workload in YCSB [Cooper:2010]; z = 0.99, which is representative of skewed workloads [Atikoglu:2012]; p = 4, which corresponds to small ROTs (which exacerbate the extra communication cost in Contrarian); and b = 8, as many production workloads are dominated by tiny items [Atikoglu:2012].

Performance metrics. We focus our study on the latencies of ROTs because, by design, CC-LO favors ROT latencies over PUTs. As an aside, in our experiments CC-LO incurs up to one order of magnitude higher PUT latencies than Contrarian. For space constraints, we focus on average latencies. We report the 99-th percentile of latencies for a subset of the experiments. We measure the throughput of the systems as the number of PUTs and ROTs per second.

Figure 4: Evaluation of Contrarian’s design (2-DC, default workload). Throughput vs average ROT latency (y axis in log scale). Contrarian achieves lower latencies than Cure by means of nonblocking ROTs. Using 1 1/2 rounds of communication reduces latency at low load, but it leads to exchange more messages than using 2 rounds, and hence to a lower maximum throughput (Section 4).
(a) Throughput vs Avg. ROT latency.
(b) Throughput vs 99-th percentile of ROT latencies.
Figure 5: ROT latencies (average and 99-th percentile) in Contrarian and CC-LO as a function of the throughput (default workload). The resource contention induced by the extra overhead posed by PUTs in CC-LO affects especially tail latency.
Figure 6: ROT ids collected on average during a readers check in CC-LO (1-DC, default workload). The amount of information exchanged grows linearly with the number of clients, matching the bound stated in Section 6. The average number of servers contacted during a readers check is 12.
(a) Throughput vs Avg. ROT latency (1 DC).
(b) Throughput vs Avg. ROT latency (2 DCs).
Figure 7: Performance with different w/r ratios. Contrarian achieves lower ROT latencies than CC-LO, except at very moderate load and for the most read-heavy workload. Contrarian also consistently achieves higher throughput. Higher write intensities hinder the performance of CC-LO because the readers check is triggered more frequently.

5.3 Contrarian design

We first evaluate the design of Contrarian, by assessing its improvement over Cure, and by analyzing the behavior of the system when implementing ROTs in 1 1/2 or 2 rounds of communication. Figure 4 compares the three designs given the default workload in 2 DCs.

Contrarian achieves lower latencies than Cure, up to a factor of 3x (0.35 vs 1.0 msec), by implementing nonblocking ROTs. In Cure, the latency of ROTs is affected by clock skew. At low load, the 1 1/2-round version of Contrarian completes ROTs in 0.35 msec vs the 0.45 msec of the 2-round version. The two variants achieve comparable latencies at medium/high load (from 150 to 350 Kops/s). The 2-round version achieves a higher throughput than the 1 1/2-round version (by 8% in this case) because it is more resource efficient by requiring fewer messages to run ROTs.

Because we focus on latency more than throughput, hereafter we report results corresponding to the 1 1/2-round version of Contrarian.

5.4 Default workload.

Figure 5 reports the performance of Contrarian and CC-LO with the default workload, in the 1-DC and 2-DC scenarios. Figure 5(a) reports average latencies, and Figure 5(b) reports 99-th percentile. Figure 6 reports information on the readers check overhead in CC-LO in the single-DC case.

Latency. Figure 5 (a) shows that Contrarian achieves higher latencies than CC-LO only at very moderate load. Under trivial load conditions ROTs in CC-LO take 0.3 msec on average vs the 0.35 of Contrarian. For the throughput higher than 60 Kops/s in the 1-DC case and than 120 Kops/s in the 2-DC case Contrarian achieves lower latencies than CC-LO. These load conditions correspond to roughly 25% of Contrarian’s peak throughput. That is, CC-LO achieves slightly better latencies than Contrarian only for load conditions that correspond to the case where the available resources are severely under-utilized.

CC-LO achieves worse latencies than Contrarian for nontrivial load conditions because of the overhead caused by the readers check, needed to achieve latency optimality. This overhead induces higher resource utilization, and hence higher contention on physical resources. Ultimately, this leads to higher latencies, even for ROTs.

Tail latency. The effect of contention on physical resources is especially visible at the tail of the ROT latencies distribution, as shown in Figure 5 (b). CC-LO achieves lower 99-th percentile latencies only at the lowest load condition (0.35 vs 0.45 msec).

Throughput. Contrarian consistently achieves a higher throughput than CC-LO. Contrarian’s maximum throughput is 1.45x CC-LO’s in the 1-DC case, and 1.6x in the 2-DC case. In addition, Contrarian achieves a 1.9x throughput improvement when scaling from 1 to 2 DCs. By contrast, CC-LO improves its throughput only by 1.6x. This result is due to the higher replication costs in CC-LO, which has to communicate the dependency list of a replicated update, and perform the readers check in the remote DC.

Overhead analysis. To provide a sense of the overhead of the readers check, we present some data collected on the singe-DC platform at the load value at which CC-LO achieves its peak throughput (corresponding to 256 client threads). A readers check targets on average 20 keys, causing the checking partition to contact on average 12 other partitions. A readers check collects on average 252 distinct ROT ids, which almost matches the number of clients for this experiment. However, the same ROT id can appear in the readers set of multiple keys that have to be checked at different partitions. This increases the cumulative number of ROT ids exchanged during the readers-check phase, to on average 855 ROT ids for each readers check (71 per contacted node), corresponding roughly to 7KB of data (using 8 bytes per ROT id). Figure 6 shows that the average overhead of a readers check grows linearly with the number of clients in the system. This result matches our theoretical analysis (see Section 6) and highlights the inherent scalability limitations of latency-optimal ROTs.

5.5 Effect of write intensity.

Figure 7 shows how the performance of the systems is affected by varying the write intensity of the workload.

Figure 8: Effect of the skew in data popularity (single-DC). Skew hampers the performance of CC-LO, because it leads to long causal dependency chains among operations and thus to much information exchanged during the readers check.

Latency. Similarly to what is seen previously, for non-trivial load conditions Contrarian achieves lower ROT latencies than CC-LO on both the 1-DC and 2-DC scenarios and with almost all of the write intensity parameters. The only exception occurs in the case of the lowest write intensity, and even in this case the differences remain small, especially for the replicated environment.

For w = 0.01 in the single-DC case (Figure 6(a)), at the lowest load CC-LO achieves an average ROT latency of 0.3 msec vs 0.35 of Contrarian; at high load (200 Kops/s), ROTs in CC-LO completes in 1.11 msec vs 1.33 msec in Contrarian. In the 2-DC deployment, however, the latencies achieved by the two systems are practically the same, except for trivial load conditions (Figure 6(b)). This change in the relative performances of the two systems is due to the higher replication cost of CC-LO during the readers check, which has to be performed for each update in each DC.

Throughput. Contrarian achieves a higher throughput than CC-LO in almost all scenarios (up to 2.35x for w=0.1 and 2 DCs). The only exception is the w = 0.01 case in the single DC deployment (where CC-LO achieves a throughput that is 10% higher). The throughput of Contrarian grows with the write intensity, because PUTs only touch one partition and are faster than ROTs. Instead, higher write intensities hinder the performance of CC-LO, because they cause more frequent execution of the expensive readers check.

Overhead analysis. Surprisingly, the latency benefits of CC-LO are not very pronounced, even at the lowest write intensities. This is due to the inherent tension between the frequency of writes and their costs. A low write intensity leads to a low frequency at which readers checks are performed. However, it also means that every write is dependent on many reads, resulting in more costly readers checks.

5.6 Effect of skew in data popularity.

Figure 8 depicts how performance varies with skew in data popularity, in the single-DC platform. We focus on this deployment scenario to factor out the replication dynamics of CC-LO and focus on the inherent costs of latency optimality.

Latency. Similarly to the previous cases, Contrarian achieves ROT latencies that are lower than CC-LO’s for non-trivial load conditions ( Kops/s, i.e., 30% of Contrarian’s maximum throughput).

Throughput. The data popularity skew does not sensibly affect Contrarian, whereas it hampers the throughput of CC-LO. The performance of CC-LO degrades because a higher skew causes longer causal dependency chains among operations [Bailis:2013, Du:2014], leading to a higher overhead incurred by the readers checks.

Overhead analysis. With low skew, a key is infrequently accessed, so it is likely that many entries in the readers of can be garbage-collected by the time is involved in a readers check. With higher skew levels, a few hot keys are accessed most of the time, which leads to the old reader record with many fresh entries. High skew also leads to more duplicates in the ROT ids retrieved from different partitions, because the same ROT id is likely to be present in many the old reader record. Our experimental results (not reported for space constraints) confirm this analysis. They also show that, at any skew level, the number of ROT ids exchanged during a readers check grows linearly with the number of clients (which matches our later theoretical analysis).

Figure 9: Effect of ROT sizes (single-DC). The latency advantage of CC-LO at low load decreases as p grows, because contacting more partitions amortizes the cost of the extra communication needed by Contrarian to execute a ROT.

5.7 Effect of size of transactions.

Figure 9 shows the performance of the systems while varying the number of partitions involved in a ROT. We again report results corresponding to the single-DC platform.

Latency. Contrarian achieves ROT latencies that are lower than or comparable to CC-LO’s for any number of partitions involved in a ROT. The latency benefits of CC-LO over Contrarian at low load decrease as p grows, because contacting more partitions amortizes the cost of the extra communication needed by Contrarian to execute a ROT.

Throughput. Contrarian achieves higher throughput than CC-LO (up to 1.45x higher, with p=4) for any value of p. The throughput gap between the two systems shrinks with p, because of the extra messages that are sent in Contrarian from the coordinator to the other partitions involved in a ROT. The fact that only one key per partition is read in our experiment is an adversarial setting for Contrarian, because it exacerbates the cost of the extra communication hop used to implement ROTs. Such communication cost would be amortized if ROTs read multiple items per partition. Contrarian can be configured to resort to the 2-round ROT implementation when contacting a large number of partitions, to increase resource efficiency. We are currently testing this optimization.

5.8 Effect of size of values.

Larger items naturally result in higher CPU and network costs for marshalling, unmarshalling and transmission operations. As a result, the performance gap between the systems shrinks as the size of the item values increases. Even in the case corresponding to large items, however, Contrarian achieves ROT latencies lower than or comparable to the ones achieved by CC-LO, and a 43% higher throughput (in the single-DC scenario). We omit plots and additional details for space constraints.

6 Theoretical Results

Our experimental study shows that the state-of-the-art CC design for LO ROTs delivers sub-optimal performance, caused by the overhead (imposed on PUTs) for dealing with old readers. One can, however, conceive of alternative implementations. For instance, rather than storing old readers with the data items in the partitions, one could contemplate an implementation which stores old readers at the client which does a PUT and forwards this piece of information to other partitions when doing next PUTs. Albeit in a different manner, this implementation still communicates the old readers to the partition where a PUT is performed. One may then wonder: is there an implementation that avoids this overhead altogether in order not to exhibit the performance issues we have seen with CC-LO in Section 5?

We now address this question. We show that the extra overhead on PUTs is inherent to LO by Theorem 1. Furthermore, we show that the extra overhead grows with the number of clients, implying the growth with the number of ROTs and echoing the measurement results we have reported in Section 5. Our proof is by contradiction and consists of three steps. First, we construct a set of at least two executions in each of which, different clients issue the same ROT on keys and then causally related PUTs on occur. Our assumption for contradiction is as follows: in our construction, although different clients issue the same ROT, the communication between servers remains the same. (In other words, roughly speaking, servers do not notice all clients that issue the ROT.) Then based on our assumption, we are able to construct another execution in which some clients issue the ROT while causally related PUTs on occur. Finally, still based on our assumption, we show that in , although the ROT is in parallel with the causally related PUTs, no server is able to tell so and then the ROT returns a causally inconsistent snapshot. This completes our proof: (roughly speaking) servers must communicate all clients that issue a ROT and the worst-case communication is then linear in the number of clients.

Our theorem applies to the system model described in Section 2. Below we start with an elaboration of our system model (Section 6.1) and the definition of LO (Section 6.2). Then we present and prove our theorem (Section 6.3).

6.1 System Model

For the ease of definitions (as well as proofs), we assume the existence of an accurate real-time clock to which no partition or client has access. When we mention time, we refer to this clock. Furthermore, when we say that two client operations are concurrent, we mean that the duration of the two operations overlap according to this clock.

Among other things, this clock allows us to give a precise definition of eventual visibility. If PUT starts at time (and eventually ends), then there exists finite time such that any ROT that reads and is issued at time returns either or some of which PUT starts no earlier than ; we say is visible since .

We assume the same APIs as described in Section 2.1

. Clients and partitions exchange messages of which delays are finite, but can be unbounded. Clients and partitions can use their local clocks; however clock drift can be arbitrarily large and infinite (so for some time moment

, some clock can never reach ). To capture the design of CC-LO, we also assume that an idle client sends no message to any partition; when performing an operation on some keys, a client sends messages only to the partitions which store values for these keys; a partition sends messages to client only when responding to some operation issued by ; and clients do not communicate with each other. For simplicity, we consider any client issuing a new operation only after its previous operation returns. We assume at least two partitions and a potentially growing number of clients.

6.2 Properties of LO ROTs

(a) Execution
(b) Execution (with omitted)
Figure 10: Two (in)distinguishable executions in the proof of Theorem 1

We adopt the definition of LO ROTs from [Lu:2016], which refers to three properties: one-round, one-version, and nonblocking. The one-round property states that for every client ’s ROT , sends one message to and receives one message from each partition involved in . The nonblocking property states that for any partition to which sends a message, eventually sends one message (the one defined in the one-round property) to , even if receives no message from a server during . A formal definition of one-version property is more involved. Basically, for every client ’s ROT , we consider the maximum amount of information that may be calculated by any implementation algorithm of based on the messages which receives during .222We consider the amount of information instead of the plaintext as values can be encoded in different ways. For example, if a message contains and for two values of the same key, then in the plaintext, there is only one version yet some implementation can calculate two versions from the plaintext. The definition of one-version property excludes such message as well as such implementation. The one-version property specifies that given the messages which receives from any (non-empty) subset of partitions during , the maximum amount of information contains only one version per key for the keys stored in .

6.3 The cost of

We say a PUT operation completes if returns to the client that issued ; and the value written by becomes visible. Our theoretical result (Theorem 1) highlights that the cost of may occur before any dangerous PUT completes. (We say a PUT operation is dangerous if causally depends on some PUT that causally depends on and overwrites a non- value.)

Theorem 1 (Cost of LO ROTs).

Achieving ROT requires communication, potentially growing linearly with the number of clients, before every dangerous PUT completes.

The intuition behind the cost of is that a (dangerous) PUT operation, PUT, eventually completes; however, due to the asynchronous network, a request resulting from a ROT operation which reads keys may arrive after PUT completes, regardless of the other request(s) resulting from . Suppose that has returned value with respect to value such that , then can be at risk of breaking causal consistency. As a result, the partition which provides should notify others of the risk and hence the communication.

Inspired by our intuition, we assume that keys , belong to different partitions and , respectively. We call client an old reader of , with respect to PUT,333 The definition of an old reader of here specifies a certain PUT on and is thus more specific than the definition in CC-LO, an old reader of in general. The reason to specify a certain PUT is to emphasize the causal relation . The proof hereafter takes the more specific definition when mentioning old readers. if issues a ROT operation which (1) is concurrent with PUT and PUT and (2) returns . In general, if issues a ROT operation that reads , then we say is a reader of . Thus the risk lies in the fact that due to the asynchronous network, any reader can potentially be an old reader.

To have , for simplicity, we consider a scenario where some client does four PUT operations in the following order: PUT, PUT, PUT and PUT, and issues each PUT (except for the first one) after the previous PUT completes. To prove Theorem 1, we consider the worst case: all clients except can be readers. We identify similar executions where a different subset of clients are readers. Let be the set of all clients except . We construct the set such that each execution has one subset of as readers. Hence contains executions in total. We later show one execution in in which the communication carrying readers grows linearly with and thus prove Theorem 1.

executions . Each execution is based on a subset of as readers. Every client in issues ROT at the same time . By one-round property, sends two messages , to and respectively at . We denote the event that receives by , the event that receives by . By the nonblocking property, and can be considered to receive messages from and send messages to at the same time .444Clearly, and may receive messages at different time, and the proof still holds. The same time is assumed for the simplicity of presentation. Finally, receives messages from and at the same time . We order events as follows: and are visible, , , PUT is issued, , PUT is issued. Let be the time when PUT completes. For every execution in , take the same values while actually denotes the maximum value.

To emphasize the burden on , we consider communication that precedes a message that receives: we say message precedes message if (1) some process sends after receives , or (2) message such that precedes and precedes . Clearly, the executions in are the same until time . Since , these executions, especially, the communication between and may change. We construct all executions in altogether: if at some time point, in one execution, some server sends a message, then we construct all other executions such that the same server sends the same message except that the server is , or contaminated by or . By contamination, we mean that at some point, or sends message but we are unable to construct all other executions to do the same; then the message and server which receives are contaminated and can further contaminate other servers. In our construction, we focus on the non-contaminated messages which are received at the same time across all executions in . For other messages, if in at least two executions, the same contaminated message can be sent, then we let to be received at the same time across these executions; otherwise, We do not restrict the schedule.

We show that the worst-case execution exists in our construction of . To do so, we first show a property of ; i.e., for any two executions , in (with different readers), the communication of and must be different, as formalized in Lemma 1.555Lemma 1 abstracts ways of communication between and so that it is independent of certain implementations, and covers the following example implementations of communication for old readers as in CC-LO, as the example introduced at the beginning of this section, as well as the following: keeps asking whether a reader of is a reader of to determine whether all readers of have arrived at (so that there is no old reader with respect to ).

Lemma 1 (Different readers, different messages).

Consider any two executions . In , denote by the messages which or sends to a process other than and which precedes some message that receives during in , and denote by the concatenation of ordered messages in ordered by the time when every message is sent. Then .

The main intuition behind the proof is that if communication were the same regardless of readers, would be unable to distinguish readers from old readers. Suppose now by contradiction that . Then our construction of allows us to construct an special execution based on (as well as ). Let the subset of for be for . W.l.o.g., . We construct such that clients in are old readers (and show that breaks causal consistency due to old readers).

Execution with old readers. In , both and issue ROT at . To distinguish between events (as well as messages) resulting from and , we use superscripts and to denote the events, respectively. For simplicity of notations, in , we call the two events at the server-side (for which and receive messages from respectively) also and , illustrated in Figure 9(a). In , we now have four events at the server-side: , , , . We construct based on by scheduling and in at (the same time as and in ), and postponing (as well as ), as illustrated in Figure 9(b). The ordering of events in is thus different from . More specifically, the order is: and are visible, , , PUT is issued, PUT is issued, , (for every client in as has occurred), (for every client in , not shown in Figure 9(b)), returns ROT. By asynchrony, the order is legitimate, which results in old readers .

Proof of Lemma 1.

Our proof is by contradiction. As , according to our construction, does not receive any message preceded by some different contaminated message in and . Therefore even if we replace in for in (as in ), then by , is unable to distinguish between and .

Previously, our construction of is until . Let us now extend so that and are the same after . Namely, in , after , every client issues ROT; and as illustrated in Figure 10, is scheduled at the same time in and in .

Let be the return value of ’s ROT in either execution. By eventual visibility, in , . We now examine . By eventual visibility, as is after and are visible, . As is before PUT is issued, . By ’s indistinguishability between and , and according to the one-version property, as in . Thus in , and , a snapshot that is not causally consistent. A contradiction. ∎

Lemma 1 demonstrates a property for any two executions in , which implies another property of : if for any two executions, communication has to be different, then for all executions, the number of possibilities of what is communicated grows with the number of elements in . Recall that is a function of . Hence, we connect the communication and in Lemma 2.

Lemma 2 (Lower bound on the cost).

Before PUT completes, in at least one execution in , the communication of and takes at least bits where is a linear function.

Proof of Lemma 2.

We index each execution by the set of clients which issue ROT at time . We have therefore executions: . Let be the messages which and send in as defined in Lemma 1, and let . By Lemma 1, we can show that . Then . Therefore, it is impossible that every element in has fewer than bits. In other words, in , we have at least one execution where takes at least bits, a linear function in . ∎

Recall that is a variable that grows linearly with the number of clients. Thus following Lemma 2, we find contains a worst-case execution that supports Theorem 1 and we thus complete the proof of Theorem 1.

Remark on implementations. The proof shows the necessary communication of readers when each client issues one operation. Here we want to make the link back to the implementation of LO ROTs in CC-LO. The reader may wonder in particular about the relationship between the transaction identifiers that are sent as old readers in CC-LO, and the worst-case communication linear in the number of clients derived in the theorem. In fact, the CC-LO implementation considers that clients may issue multiple transactions at the same time, and then different ROTs of a single client should be considered as different readers, hence the use of transaction identifiers to distinguish one from another.

A final comment is on a straw-man implementation where each operation is attached to the output of a Lamport Clock [Lamport:1978] (called logical time below) alone. Such implementation (without communication of potentially old readers) still fails. The problem is that the number of increments in logical time after ROTs is at most the number of all ROTs, i.e., . Then for some and , Lemma 1 does not hold, i.e., the communication is the same. Although when issuing the ROT, client in can send logical time to servers, the logical time sent in and is the same and thus does not help to distinguish between and , resulting in the violation of causal consistency again. Hence communication of readers, as Theorem 1 indicates, is still required for this straw-man implementation.

7 Related work

System ROT latency optimality Write cost Clock
Nonblocking #Rounds #Versions Communication Meta-data
COPS [Lloyd:2011] 1 - deps - Logical
Eiger [Lloyd:2013] 1 - deps - Logical
ChainReaction [Almeida:2013] 2 1 1 1 deps M Logical
Orbe [Du:2013] 2 1 1 - NxM - Logical
GentleRain [Du:2014] 2 1 1 - 1 - Physical
Cure [Akkoorath:2016] 2 1 1 - M - Physical
OCCULT [Mehdi:2017] 1 1 1 - O(P) - Hybrid
POCC [Spirovska:2017] 2 1 1 - M - Physical
COPS-SNOW [Lu:2016] 1 1 1 O(N) deps O(K) Logical
Contrarian 1 1/2 (or 2) 1 1 - M - Hybrid
Table 2: Characterization of CC systems with ROTs support, in a geo-replicated setting. N, M and K represent, respectively, the number of partitions, DCs, and clients in a DC. indicates a single-master system, and represents the number of DCs that act as master for at least one partition. , resp., , indicates client-server, resp. inter-server, communication.

Causally consistent systems. Table 2classifies existing systems with ROT support according to the cost of performing ROT and PUT operations. COPS-SNOW is the only latency-optimal system. COPS-SNOW achieves latency optimality at the expense of more costly writes, which carry detailed dependency information and incur extra communication overhead. Previous systems fail to achieve at least one of the sub-properties of latency optimality.

ROTs in COPS and Eiger might require two rounds of client-server communication to complete. The second round is needed if the client reads, in the first round, two items that might belong to different causally consistent snapshots. COPS and Eiger rely on fine-grained protocols to track and check the dependencies of replicated updates (see Section 3), which have been shown to limit their scalability [Du:2013, Du:2014, Akkoorath:2016]. ChainReaction uses a potentially-blocking and potentially multi-round protocol based on a per-DC sequencer node.

Orbe, GentleRain, Cure and POCC use a coordinator-based approach similar to what described in Section 3, and require two communications rounds. These systems use physical clocks and may block ROTs either because of clock skew or to wait for the receipt of remote updates.

Occult uses a primary-replica approach and use HLCs to avoid blocking due to clock skew. Occult implements ROTs that run in potentially more than one round and that potentially span multiple DCs (which makes the system not always-available). Occult requires at least one dependency timestamp for each DC that hosts a master replica.

Unlike these systems, Contrarian leverages HLCs to implement ROTs that are always-available, nonblocking and always complete in 1 1/2 (or 2) rounds of communication.

Other CC systems include SwiftCloud [Zawirski:2015], Bolt-On [Bailis:2013], Saturn [Bravo:2017], Bayou [Petersen:1997, Terry:1995], PRACTI [Belaramani:2006], ISIS [Birman:1987], lazy replication [Ladin:1992], causal memory [Ahamad:1995], EunomiaKV [Eunomia:2017] and CausalSpartan [CausalSpartan]. These systems either do not support ROTs, or target a different model from the one considered in this paper, e.g., they do not implement sharding the data set in partitions. Our theoretical results require at least two partitions. Investigating the cost of LO in other system models is an avenue for future work.

CC is also implemented by systems that support different consistency levels [Crooks:2016], implement strong consistency on top of CC [Balegas:2015], and combine different consistency levels depending on the semantics of operations [Li:2014, Balegas:2016] or on target SLAs [Ardekani:2014, Terry:2013]. Our theorem provides a lower bound on the overhead of latency-optimal ROTs with CC. Hence, any system that implements CC or a strictly stronger consistency level cannot avoid such overhead. We are investigating how the lower bound on this overhead varies depending on the consistency level, and what is its effect on performance.

Theoretical results on causal consistency. Causality was introduced by Lamport [Lamport:1978]. Hutto and Ahamad [Hutto:1990] provided the first definition of causal consistency, later revisited from different angles [Mosberger:1993, Adya:1999, Crooks:2016, Viotti:2016]. Mahajan et al. have proved that real-time CC is the strongest consistency level that can be obtained in an always-available and one-way convergent system [Mahajan:2011]. Attiya et al. have introduced the observable CC model and have shown that it is the strongest that can be achieved by an eventually consistent data store implementing multi-value registers [Attiya:2015].

The SNOW theorem [Lu:2016] shows that LO can be achieved by any system that is not strictly serializable [Papadimitriou:1979] or does not support write transactions. Based on this result, the SNOW paper suggests that any protocol that matches one of these two conditions can be improved to be latency-optimal. The SNOW paper indicates that a way to achieve this is to shift the overhead from ROTs to writes. In this paper, we prove that achieving latency optimality in CC implies an extra cost on writes, which is inherent and significant.

Bailis et al. study the overhead of replication and dependency tracking in geo-replicated CC systems [Bailis:2012b]. By contrast, we investigate the inherent cost of latency-optimal CC designs, i.e., even in absence of (geo-)replication.

8 Conclusion

Causally consistent read-only transactions are an attractive primitive for large-scale systems, as they eliminate a number of anomalies and facilitate the task of developers. Furthermore, given that most applications are expected to be read-dominated, low latency of read-only transactions is of paramount importance to overall system performance. It would therefore appear that latency-optimal read-only transactions, which provide a nonblocking, single-version and single-round implementation, are particularly appealing. The catch is that these latency-optimal protocols impose an overhead on writes that is so high that it jeopardizes performance, even in read-heavy workloads.

In this paper, we present an “almost latency-optimal” protocol that maintains the nonblocking and one-version aspects of their latency-optimal counterparts, but sacrifices the one-round property and instead runs in one and a half rounds. On the plus side, however, this protocol avoids the entire overhead that latency-optimal protocols impose on writes. As a result, measurements show that this “almost latency-optimal” protocol outperforms latency-optimal protocols, not only in terms of throughput, but also in terms of latency, for all but the lowest loads and the most read-heavy workloads.

In addition, we show that the overhead of the latency-optimal protocol is inherent. In other words, it is not an artifact of current implementations. In particular, we show that this overhead grows linearly with the number of clients.