PaRiS: Causally Consistent Transactions with Non-blocking Reads and Partial Replication

02/25/2019 ∙ by Kristina Spirovska, et al. ∙ 0

Geo-replicated data platforms are at the backbone of several large-scale online services. Transactional Causal Consistency (TCC) is an attractive consistency level for building such platforms. TCC avoids many anomalies of eventual consistency, eschews the synchronization costs of strong consistency, and supports interactive read-write transactions. Partial replication is another attractive design choice for building geo-replicated platforms, as it increases the storage capacity and reduces update propagation costs. This paper presents PaRiS, the first TCC system that supports partial replication and implements non-blocking parallel read operations, whose latency is paramount for the performance of read-intensive applications. PaRiS relies on a novel protocol to track dependencies, called Universal Stable Time (UST). By means of a lightweight background gossip process, UST identifies a snapshot of the data that has been installed by every DC in the system. Hence, transactions can consistently read from such a snapshot on any server in any replication site without having to block. Moreover, PaRiS requires only one timestamp to track dependencies and define transactional snapshots, thereby achieving resource efficiency and scalability. We evaluate PaRiS on a large-scale AWS deployment composed of up to 10 replication sites. We show that PaRiS scales well with the number of DCs and partitions, while being able to handle larger data-sets than existing solutions that assume full replication. We also demonstrate a performance gain of non-blocking reads vs. a blocking alternative (up to 1.47x higher throughput with 5.91x lower latency for read-dominated workloads and up to 1.46x higher throughput with 20.56x lower latency for write-heavy workloads).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modern large-scale data platforms rely on geo-replication and sharding to store and manipulate large volumes of data efficiently. Geo-replication allows keeping a copy of the data in a data center (DC) closer to the users, thus reducing access latencies. Sharding enables horizontal scalability, by slicing the dataset in disjoint partitions, each of which can be assigned to a different server.

In geo-replicated environments, partial replication is an effective technique to improve storage capacity and reduce replication costs. In partial replication, each DC stores only a subset of the partitions. Hence, the system can scale to a higher number of partitions with respect to a full replication approach, and updates performed in one DC are propagated to fewer replicas.

Causal Consistency (CC) has emerged recently as an attractive consistency model for geo-replicated data platforms [Lloyd:2011, Lloyd:2013, Zawirski:2015, Akkoorath:2016, Mehdi:2017, Du:2014, Du:2013, Almeida:2013, Spirovska:2017, Didona:2018, Roohitavaf:2017]. CC provides intuitive semantics and avoids many anomalies allowed by weaker models, such as eventual consistency [DeCandia:2007]. Moreover, CC eschews the high synchronization costs of stronger consistency levels, such as linearizability [Herlihy:1990]. Transactional CC (TCC) [Zawirski:2015, Akkoorath:2016] extends CC by providing transactions that observe a causally consistent view of the data, and can perform atomic multi-object writes.


PaRiS. This paper presents PaRiS, the first system that implements TCC in a partially replicated data platform, and that supports non-blocking parallel read operations (and hence, non-blocking one-round read-only transactions). Parallel non-blocking reads are an important requirement to guarantee good performance [Lu:2016, Tomsic:2018, Corbett:2013], especially for the important and wide class of read-intensive applications [Urdaneta:2009, Nishtala:2013, Noghabi:2016].

Achieving non-blocking parallel transactional reads with partial replication is challenging. This is mainly because different reads within the same transaction may be served in parallel by servers in different DCs. In existing approaches, a DC is not aware of the set of transactions performed by other DCs. Ultimately, this can lead to consistency violations because a server in a DC is unaware of which data is returned by other servers in other DCs to the same transaction.

PaRiS addresses this issue by means of a new causal dependency tracking protocol, that we call Universal Stable Time (UST). In short, UST identifies a snapshot of the dataset that has been installed in all DCs. Hence, a transaction can read from such snapshot in any DC without blocking. In addition to the snapshot defined by UST, PaRiS equips clients with a private cache, in which clients store their own updates that are not reflected yet in the snapshot identified by the UST. This allows PaRiS to achieve TCC even if exposing to clients a snapshot of the data store that is slightly in the past. PaRiS implements UST efficiently as a periodic, lightweight intra- and inter-DC gossiping protocol. In addition, PaRiS uses only one timestamp to track dependencies and to define transactional snapshots, thus enabling scalability both in terms of number of DCs and of partitions per DC.

The trade-off –which is provably unavoidable [Tomsic:2018]– made by PaRiS is to expose to transactions a view of the data that is slightly in the past. We argue that a moderate increase in the data staleness is a reasonable price to pay for the performance benefits brought by PaRiS.

Overall, PaRiS achieves a triad of low latency, high storage capacity and rich transactional semantics. This represents a significant improvement over existing systems that either do not support partial replication, or generic read-write transactions, or block read operations to preserve consistency.

We evaluate PaRiS on a large scale AWS deployment comprising of up to 10 DCs, and with heterogeneous workloads with different degrees of locality in the data accesses. We compare PaRiS with a variant of PaRiS that supports partial replication by blocking read operations. We show that PaRiS scales well with the number of DCs and partitions, while being able to handle larger datasets than existing solutions that assume full replication.


Roadmap. The remainder of the paper is organized as follows. Section II presents the system model. Section III describes the design of PaRiS. Section IV describes the protocols and correctness of PaRiS. Section V reports the results of the evaluation of PaRiS. Section VI discusses related work. Section VII concludes the paper.

Ii Definitions and system Model

Ii-a Causal Consistency

A system is causally consistent if its servers return values that are consistent with the order defined by the causality relationship. Causality is defined as a happens-before relationship between two events [Ahamad:1995, Lamport:1978]. For two operations , we say that causally depends on , and write , if and only if at least one of the following conditions holds: and are operations in a single thread of execution, and happens before ; is a write operation, is a read operation, and reads the version written by ; there is some other operation such that and . Intuitively, CC ensures that if a client has seen the effects of operation and , then the client also sees the effects of operation .

We use lower case letters, e.g., , to refer to a key and the corresponding capital letter, e.g., to refer to a version of the key. We say that depends on if the write of causally depends on the write of .

Ii-B Transactional Causal Consistency

Semantics. TCC extends CC by means of interactive read-write transactions in which clients can read and write multiple items. TCC enforces two properties.


1. Read from a causal snapshot. A causally consistent snapshot is a set of item versions such that all causal dependencies of those versions are also included in the snapshot. All transactions read from a causally consistent snapshot. For any two items, and , if and both and belong to the same causally consistent snapshot, then there is no other , such that is created after and .


Transactional reads from a causal snapshot prevent undesirable anomalies which can arise by simply issuing multiple consecutive single read operations  [Lloyd:2011].

The majority of existing CC systems implement transactional reads by means of one-shot read-only transactions [Du:2014, Lloyd:2011, Lloyd:2013, Lu:2016].


2. Atomic updates. Either all the items written by a transaction are visible to other transactions, or none is. If a transaction writes and , then any snapshot visible to other transactions either includes both and , or none of them.


Conflict resolution. Two writes are conflicting if they are not related by causality and update the same key. Conflicting writes are resolved by means of a commutative and associative function, that decides the value corresponding to a key given its current value and the set of updates on the key [Lloyd:2011].

For simplicity, PaRiS resolves write conflicts using the last-writer-wins rule [Thomas:1979] based on the timestamp of the updates. Possible ties are settled by looking at the id of the DC combined with the identifier of the transaction that created the update. PaRiS can be extended to support other conflict resolution mechanisms [Akkoorath:2016, Lloyd:2011, Lloyd:2013, Shapiro:2011].

Ii-C System model

We assume a distributed key-value store whose dataset is split into partitions. Each key is deterministically assigned to one partition by a hash function. We assume that each server is assigned a single partition and we note the server responsible for key . Each partition is replicated at different DCs, where is the replication factor of data. There are DCs in total, hence, there is only a fragment of the full dataset present in each DC.

We assume a multi-master system, i.e., all replicas can update keys they are responsible for. Updates are replicated asynchronously to remote DCs.

We assume a multi-version data store. An update operation creates a new version of a key. Each version stores the value corresponding to the key and some meta-data to track causality. The system periodically garbage-collects old versions of keys. Partitions communicate through point-to-point lossless FIFO channels (e.g., a TCP socket).

At the beginning of a session, a client connects to a partition in one DC according to some load balancing scheme. This DC is referred to as the local DC. The partition serves all ’s operations. If does not store a key targeted by an operation, transparently forwards the operation to a replica of . does not issue the next operation until it receives the reply to the current one.

Availability. We use the term availability to indicate the ability of a system to never block a client operation in a presence of a network partition among DCs [Brewer:2000].

Ii-D APIs

PaRiS’s programming interface offers the following operations for interactive read-write transactions:


starts an interactive transaction T and returns T’s transaction identifier T and the causal snapshot S visible to T.


reads in parallel the set of items corresponding to the input set of keys for a transaction identified by T.


updates a set of keys, given as input, to the corresponding values for a transaction with T.


finalizes a transaction with T to atomically update items that have been modified by means of a WRITE operation in the scope of the transaction, if any.


Upon a start of a transaction , clients can issue multiple read and write operations that can operate on multiple keys, before committing .


Since, under the TCC programming model, conflicting updates are resolved rather than forbidden, in PaRiS transactions never abort due to conflicts. Although transactions can abort by means of system-related issues, e.g., not enough space on a server to perform an update, we do not consider aborts in this paper for simplicity reasons.

Iii Design of PaRiS

The main goal of PaRiS is to implement TCC in a partially replicated sharded system, while providing non-blocking parallel read operations and achieving scalability. We show the challenges to achieving these goals in Section III-A. Next, in Section III-B we present how PaRiS overcomes these challenges by a novel dependency tracking protocol and the use of a small client-side cache. Finally, in Section III-C we discuss fault tolerance and availability in PaRiS.

Iii-a Challenges of partial replication

Since TCC must simultaneously guarantee the preservation of causal consistency and the atomicity of multi-item writes, achieving non-blocking reads while maintaining TCC is challenging. In a fully replicated environment, the task of enforcing this behavior, is eased by two invariants: all remote updates are received by all DCs, and hence every DC receives the dependencies of each update, and all updates of a transaction are performed within the same DC, and hence all the updates of the same transaction can be found in each DC. As example, assume that , and that and are the latest versions of their keys, and , respectively. Causal consistency dictates that if a client reads and in a transaction, then if is returned, has to be returned as well. Furthermore, assume that , the last version of key , has been written by the same transaction as . TCC further implies that either both and are visible to the transaction, or none of them is. Hence, some sort of communication among partitions in the same DC is enough to ensure that is visible in the DC only after , and that and are atomically visible.

Partial replication, instead, violates the two invariants described above. This leads to a new set of challenges of enforcing consistency and atomicity. First, tracking consistency is harder. In the previous example with keys and , may be replicated from DC to DC, and from DC to DC. Then, assume that a transaction from DC reads in DC1 and in DC. The transaction has to ensure that is read in DC only if also is read, concurrently, in DC. Similarly, enforcing atomicity is harder. Assume that is replicated from DC to DC, and from DC to DC, and that a transaction from DC reads and . Then, the transaction in DC has to ensure that either both and or none of them are read in an atomic fashion.

Addressing these two challenges is made more difficult by the fact that a read operation can target any replica of the target key. Hence, consistency and atomicity have to be preserved despite the fact that different transactions targeting the same keys can hit different replicas of those keys. The complexity of the problem is further exacerbated by the fact that different replicas of a version may be in different DCs that store different sub-sets of the dependencies of .

One possible solution to these challenges could be allowing more than one round of client-server communication to perform a single parallel read operation. Servers can return possibly inconsistent versions in the first round(s), and the client can detect and fix these violations by issuing additional read requests [Lloyd:2011, Lloyd:2013, Mehdi:2017].

Another possible solution could be blocking a read on a partition until the partition knows that all other involved partitions are serving the read operations from the same causal snapshot of the data store [Almeida:2013, Du:2014, Akkoorath:2016]. Clearly, these solutions increase the latency experienced by the transactions, and reduce the achievable throughput, because they introduce waiting times or require extra communication.

Iii-B Non-blocking reads in partial replication by PaRiS

PaRiS addresses these challenges by a combination of a novel dependency tracking protocol, called UST, and a client-side cache. UST identifies snapshots of the data store that can be read by transactions without blocking. These snapshots are such that they have been already installed by every DC, so they are slightly in the past. The client-side cache stores the versions written by the client that are not yet reflected in the snapshot determined by UST. This allows clients to observe monotonically increasing snapshots even if UST identifies slightly stale snapshots. We now explain how PaRiS leverages UST and the client-side cache in its transactional protocol.

Transactions in PaRiS. PaRiS identifies key versions and snapshots by means of a scalar timestamp. Upon starting, a transaction is assigned a snapshot timestamp that, together with the content of the client-side cache, determines the snapshot visible to the transaction. Upon completing, each transaction that writes at least one key is assigned a commit timestamp that reflects causality, determined by means of a two-phase commit (2PC) protocol.

Non-blocking reads in PaRiS. The key idea in PaRiS is to identify a snapshot that has been installed by each DC. We define such snapshot stable. A stable snapshot with timestamp contains versions with a timestamp , and indicates that every transaction with a commit timestamp has been applied in every data center that stores a replica of the key written by .

Hence, a transaction can read from a stable snapshot without blocking or running multiple client-server rounds, regardless of the DC in which the individual reads of the transaction are performed.

A coordinator partition is responsible to assign a stable snapshot to a transaction that starts. Any node can act as the coordinator of any transaction. The coordinator enforces that the snapshots assigned to transactions issued by the same client advance monotonically. To this end, the client piggybacks its last observed snapshot timestamp on the transaction start message.

UST.

UST is the new protocol implemented by PaRiS to identify, in a scalable fashion, stable snapshots. Each partition maintains a version vector that indicates the timestamps of the latest applied transactions, both the ones executed by the partition itself and the ones received from remote replicas. Periodically, partitions within the same DC and across DCs, by means of a gossiping protocol, exchange the minimum of the timestamps in their version vectors. The aggregate minimum of the exchanged values identifies a timestamp such that all transactions with lower timestamps have been applied by the corresponding partitions in every DC. Namely, such aggregate minimum timestamp identifies the stable snapshot that transactions are assigned upon starting.

UST identifies stable causally consistent snapshots with a single timestamp, regardless of the scale of the system. This enables high scalability and efficiency, by reducing partition-to-partition and client-to-partition communication overhead.

Cache. UST alone cannot enforce causality. In fact, the commit timestamp assigned to a transaction issued by client is higher than the stable snapshot assigned to . On the one hand, this allows commit timestamps to reflect causality. On the other hand, it means that the commit timestamp of may be higher than the snapshot assigned by the next transaction issued by . In that case, such snapshot would not include the modifications performed by in , which may lead to violation of the read-your-own-write property required by causal consistency.

PaRiS overcomes this issue by storing on the client the versions written by the client. Upon receiving a snapshot timestamp , a client removes from the cache all versions with timestamp . These versions are, in fact, included in the snapshots visible to any future transaction issued by . Upon reading key , first checks its cache. If a version exists in the cache, that version has to be read by to enforce the read-your-own-write property. Else, issues a read request to a replica. In both cases, the read completes without blocking.

Generating timestamps. As in recent proposals [Spirovska:2018, Mehdi:2017, Roohitavaf:2017, Gunawardhana:2017], PaRiS uses Hybrid Logical Physical Clocks (HLC) [Kulkarni:2014] to generate timestamps. An HLC is a logical clock whose value on a partition is the maximum between the local physical clock and the highest timestamp seen by the partition plus one. Like logical clocks, HLCs can be moved forward to match the timestamp of an incoming event, without blocking to wait that the local physical clock catches up with the timestamp of the event. Like physical clocks, HLCs advance in the absence of events and at approximately the same pace. Hence, HLCs improve the freshness of the snapshot determined by UST over a solution that uses logical clocks, which can advance at very different rates on different partitions.

Iii-C Fault tolerance

Failures (within a DC). PaRiS can tolerate failures of a server by integrating existing solutions for 2PC-based systems, e.g., based on Paxos [Lamport:1998]. Reads are non-blocking also with such mechanisms enabled, because they access a snapshot corresponding to transactions that have been already committed.

As in previous systems based on dependency aggregation protocols, the failure of a server blocks the progress of UST, but only as long as a backup has not taken over.

Client failures are transparent to the system. The clients only keep local meta-data, and cache data that have already been committed to the data-store. The contexts corresponding to transactions of failed clients are cleaned in the background after a timeout.

Availability (among DCs). PaRiS achieves availability in a DC as long as one replica per partition is reachable by a DC. In fact, remote operations can be performed by any DC, because the snapshot visible to a transaction is the same, regardless of the partition contacted to serve an operation. In addition, local operations never block.

If all replicas of one partition cannot be reached by a DC, then PaRiS cannot complete remote operations that target that partition, thus leading to unavailability.

If a DC partitions from the rest of the system, then the UST freezes at all DCs, because it is computed as a system-wide minimum. As a result, transactions see increasingly stale snapshots of the data, and the client cache cannot be pruned.

Iv Protocols of PaRiS

We now describe the meta-data stored and the protocols implemented by the clients and servers in PaRiS.

Iv-a Meta-data

Items. An item is represented as the tuple . and are the key and value of , respectively. is the timestamp of which is assigned upon creation of and determines the snapshot to which belongs. is the id of the transaction that created the item version. is the id of the DC where the item is created.


Clients. In a client session, a client maintains the highest stable snapshot timestamp known by , noted , and the commit time of its last update transaction, noted . The client also maintains a private cache , which stores items written by that are not included in the stable snapshot yet. Finally, the client maintains the meta-data and data of the transaction that is currently running: , which is the unique identifier of the transaction, and and , which correspond to the transaction’s write set and read set, respectively.


Server. A server is identified by the partition id (), and the DC id (), which is the local DC of the server. Additionally, also stores the replica id (), where , the replication factor of partition .

Each server has access to a monotonically increasing physical clock, . The local clock value on is represented by the hybrid logical clock . also maintains two vector clocks and , that represent vectors of HLCs. , has entries, one for each replica of partition . , indicates the timestamp of the latest update received by that comes from the -th replica of partition . is the version clock of the server and represents the local snapshot installed by . , or Global Stabilization Vector, has entries. means that is aware that all the nodes in the th data center have installed all events generated in the th data center with timestamp up to . The server also stores the UST of , noted . indicates that is aware that every partition in every DC has installed a snapshot with timestamp at least t.

Finally, as a standard practice for systems that perform a 2PC protocol, keeps two queues with prepared and committed transactions. The former, noted , stores transactions for which has proposed a commit timestamp and for which is waiting the commit message. The latter, noted stores transactions that have been assigned a commit timestamp and whose modifications are going to be applied to .

1:function Start
2:       send StartTxReq to
3:       receive StartTxResp from
4:       
5:       
6:       Remove from all items with commit timestamp up to
7:end function
8:
9:function Read()
10:       
11:       for each  do
12:              check (in this order)
13:             if
14:       end for
15:       
16:       send ReadReq to
17:       receive ReadResp from
18:       
19:       
20:       return
21:end function
22:
23:function Write()
24:       for each  do Update or write new entry
25:             if then else
26:       end for
27:end function
28:
29:function Commit Only invoked if
30:       send CommitReq to
31:       receive CommitResp from
32:        Update client’s highest write time
33:       Tag entries with
34:       Move entries to  Overwrite (older) duplicate entries
35:end function
Algorithm 1 Client (open session towards ).
1:upon receive StartTxReq from  do
2:        Update universal stable time
3:       
4:        Save TX context
5:       send StartTxResp  Assign transaction snapshot
6:
7:upon receive ReadReq from  do
8:       
9:       
10:        Partitions with 1 key to read
11:       for (do  Done in parallel
12:               Returns an id of a DC that replicates partition
13:             send ReadSliceReq to
14:             receive ReadSliceResp from
15:             
16:       end for
17:       send ReadResp to
18:
19:upon receive CommitReq  from  do
20:       
21:        Max timestamp seen by the client
22:       
23:       for (do Done in parallel
24:               Returns an id of a DC that replicates partition
25:             send PrepareReq to
26:             receive PrepareResp from
27:       end for
28:       ct  Max proposed timestamp
29:       for () do send CommitReq to end for
30:       delete TX[]  Clear transactional context of
31:       send CommitResp to
Algorithm 2 Server - transaction coordinator.

Iv-B Operations

Algorithm 1 reports the client protocol. Algorithm 2 and Algorithm 3 report the protocols executed by a server to run a transaction, for the cases in which the server is or is not the transaction coordinator, respectively. Algorithm 4 describes the replication and the UST protocols.

Start. Client starts a transaction by picking at random a coordinator partition (denoted ) in the local DC and sending it a start request with . uses to update , so that can assign to the new transaction a snapshot that is at least as fresh as the one accessed by in previous transactions. uses its updated value of as snapshot for . also generates a unique identifier for , denoted , and inserts in a private data structure. replies to with and the snapshot timestamp .

Upon receiving the reply, updates and evicts from the cache any version with timestamp lower than or equal to . can prune such versions because the UST protocol ensures that they are included in the snapshot installed by any partition in the system. This means that if, after pruning, there is a version X in the private cache of , then and hence the freshest version of visible to is .

1:upon receive ReadSliceReq  from  do
2:        Update universal stable time
3:       
4:       for (do
5:               Universally visible
6:              Freshest visible vers. of
7:       end for
8:       send SliceResp to
9:
10:upon receive PrepareReq  do from
11:        Update HLC
12:        Update universal stable time
13:       pt  Proposed commit time
14:        Append to pending list
15:       send PrepareResp to
16:
17:upon receive CommitReq  do from
18:        Update HLC
19:       
20:        Remove from pending
21:        Mark to commit
Algorithm 3 Server - transaction cohort.
1:function update()
2:       create
3:       insert new item in the version chain of key
4:end function
5:
6:upon Every  do
7:       if (then 
8:       else  end if
9:       
10:       if (then Commit tx by increasing order of
11:             
12:             for ()) do
13:                    for (do
14:                          for (do update (end for
15:                    end for
16:                    for (do send Replicate to  end for
17:                    
18:             end for
19:              Set version clock
20:       else
21:              Set version clock
22:             for (do send Heartbeat to  end for  Send heartbeat to replicas
23:       end if
24:
25:upon receive Replicate from  do
26:       for (do
27:             for (do
28:                    update ()
29:             end for
30:       end for
31:       
32:        Update version clock of i-th replica of n-th partition
33:
34:upon receive Heartbeat t from  do
35:       
36:        Update version clock of i-th replica in n-th partition
37:
38:upon every time do  Gather global stable times from other DCs
39:       
40:
41:upon every time do  Compute universal stable time
42:       
43:        Enforce monotonicity of
Algorithm 4 Server - Auxiliary functions.

Read. For each key to read, searches the write-set, the read-set and the client cache, in this order. If an item corresponding to is found, it is added to the set of items to return, ensuring read-your-own-writes and repeatable-reads semantics. Reads for keys that cannot be served locally are sent to the transaction coordinator together with the transaction id. retrieves the snapshot corresponding to the transaction, and sends to each involved partition the set of keys to be read, in parallel. Because each DC only stores a subset of the full data set, some of the contacted partitions may belong to a remote DC that replicates the partitions where the keys belong. Remote DCs can be chosen depending on geographical proximity or on some load balancing scheme.

Upon receiving a read request, regardless of whether it originates from the local DC or from a remote one, first updates its , if it is smaller than the transaction’s snapshot (Alg. 3 Line 2). Next, the server returns, for each key, the version within the snapshot with the highest timestamp (Alg. 3 Lines 4–7). As we shall see shortly, the commit protocol of PaRiS allows concurrent updates on the same key, both within a DC and in different DCs. This is typically the case in TCC systems to avoid costly validation protocols for update transactions [Akkoorath:2016, Spirovska:2018]. In case multiple versions of a key are assigned the same timestamp, PaRiS totally orders versions by a concatenation of timestamp, transaction id and source data center id, in this order. Once has received the reply from all the partitions contacted, sends the items to the client, which inserts them in its read-set.


Write. Client locally buffers the writes in its write-set . If a key being written is already present in , then it is updated; otherwise, it is inserted in .


Commit. To finalize the transaction, the client sends a commit request to the coordinator with the content of , the id of the transaction and the commit timestamp of its last update transaction , if any. The commit protocol of PaRiS is based on the two-phase commit (2PC) protocol. The coordinator contacts the partitions that store the keys that need to be updated and sends them the corresponding updates and . Note that some of the contacted partitions can belong to a remote DC. Each partition involved, first updates its clock to be at least as high as the maximum between the transaction’s snapshot timestamp and . Then, each partition increases its clock and sends the updated clock value to the coordinator as a commit timestamp. The proposed timestamp reflects causality because it is higher than both the snapshot timestamp and . Each partition also inserts the transaction id, the set of keys to be modified on the partition and the proposed timestamp in the queue of pending transactions.

The coordinator picks the maximum among the proposed timestamps, sends it to the partitions involved in the transaction, clears the local context of the transaction and sends the commit timestamp to the client. Upon receiving the commit message, a partition increases its clock to match the commit time, if needed, and moves the transaction from the pending queue to the commit one, with the new commit timestamp.


Applying and replicating transactions. Periodically, the effects of transactions committed by are applied on the , in increasing commit timestamp order (Alg. 4 Lines 6-21). applies the modifications of transactions that have a commit timestamp strictly lower than the lowest timestamp present in the pending list. This timestamp represents the lower bound on the commit timestamps of future transactions on . After applying the transactions, updates its local version clock and replicates the update operations in the applied transactions to its remote replicas.

If does not commit a transaction for a given amount of time, updates its local clock, and sends it to its peer replicas by means of a heartbeat message. This ensures the progress of the UST also in absence of updates.


Stabilization protocol. Every time units, partitions within a data center exchange the minimum of their version vectors to compute the global stable time () of the local data center. Similarly to previous work [Du:2014, Akkoorath:2016], PaRiS organizes nodes within a DC as a tree to reduce message exchange. The is progressively aggregated from the leaves to the root, and then propagated from the root to all the nodes in the DC. Next, all the roots from each DC exchange their values.

Every time units, the roots compute the as the aggregate minimum of the received s and propagate it to all the other nodes in the DC.


Garbage collection. Periodically the partitions in the system exchange the oldest snapshot corresponding to an active transaction ( sends its current stable snapshot timestamp if it has no running transactions). The aggregate minimum determines the oldest snapshot that is visible to a running transaction. The partitions scan the version chain of each key backwards and keep the all the versions up to (and including) the oldest one within . Previous versions are removed. The same protocol that computes the UST also computes .

Iv-C Correctness

We now provide an informal proof sketch that PaRiS provides causal consistency by showing that reads observe a causally consistent snapshot and writes are atomic.

Lemma 1.

The snapshot time of a transaction is always lower than the commit time of , .

Proof sketch.

Let be a transaction with snapshot time and commit time . The snapshot time is determined during the start of the transaction (Alg. 2 Line 4). The commit time is calculated in the commit phase of the 2PC protocol, as maximum value of the proposed prepare times of all partitions participants in the transaction (Alg. 2 Line 26). In order to reflect causality when proposing a prepare timestamp, each partition proposes higher timestamp than the snapshot timestamp (Alg. 3 Line 12). Thus, the commit time of a transaction, , is always greater than the snapshot time, . ∎

(a) Throughput vs average TX latency (95:5 r:w ratio).
(b) Throughput vs average TX latency (50:50 r:w ratio).
Fig. 1: Performance of PaRiS and BPR (logarithmic scale) with different 95:5 (a) and 50:50 (b) r:w ratios, 4 partitions involved per transaction (5 DCs, 45 partitions, replication factor is 2, 18 machines per DC). PaRiS outperforms BPR for both read-heavy and write-heavy workloads.
Proposition 1.

If an update causally depends on an update , , then .

Proof sketch.

Let be the client that wrote . There are three cases upon which can depend on , described in Section II-A: 1) committed in a previous transaction; 2) has read , written in a previous transaction and 3) has read , and there exists a chain of direct dependencies that lead from to , i.e. and .

Case 1. When a client commits a transaction, it piggybacks the last update transaction commit time , if any, to its commit request for the transaction coordinator (Alg. 1 Line 27) which is, furthermore, piggybacked as in its prepare requests to the involved partitions (Alg. 2 Line 23). To reflect causality when proposing a commit timestamp, each partition proposes higher timestamp than both and the snapshot timestamp (Alg. 3 Lines 10–14). The coordinator of the transaction chooses the maximum value from all proposed times from the participating partitions (Alg. 2 Line 26) to serve as commit time, , for all the updated items in the transaction. The new version of the data item is written in the key-value store with as its update time, (Alg. 4 Lines 2 and 13). When commits the transaction that updates , it piggybacks the commit time of the transaction that updated . Hence, from the discussion above it follows that . Because the commit time of a transaction is the update time of all the data item versions updated in the transaction (Alg. 4 Lines 2 and 13), we have .

Case 2: could have read either from ’s client cache or from the transaction causal snapshot .

If has read from ’s client cache, then has written either in a previous transaction in the same thread of execution or in the current one. If wrote in the same transaction where is also written, then it is not possible to have because all the updates from that transaction will be given the same commit, i.e. update timestamp, indicating that . Thus, must be written in a previous transaction and from Case 1 it follows that .

Next, we will consider the case when read from the causal snapshot that contains . When a transaction is started, the snapshot is determined by Alg. 2 Line 2, . From Alg. 3 Line 5 we have that . From Lemma 1 it follows that . Therefore, .

Case 3: If depends on because of a transitive dependency out of thread-of-execution, it means that there exists a chain of direct dependencies that lead from to , i.e., and . Each pair in the transitive-chain, belongs to either Case 1 or Case 2. Hence, the proof of Case 3 comes down to chained application of the correctness arguments from Case 1 and Case 2, proving that each element has an update time lower than its successor’s. ∎

Proposition 2.

A partition vector clock implies that has received all updates from replica with commit time, .

Proof sketch.

We need to show that this proposition is valid for both local and remote updates. To prove the former, we show that there are no pending local updates with commit timestamp . When updates the local replica vector clock entry , it finds the minimum prepare timestamp of all transactions that are currently in the prepare phase (Alg 4. Line 6). Because the commit time is calculated as the maximum of all prepare times (Alg 2. Line 26) and the HLC clock is monotonic (Alg. 3 Lines 10 and 16), it is guaranteed that all future transactions will receive a commit time which is greater than or equal to this minimum prepare timestamp. So, when the is set to the minimum of the prepare times of all transactions in the prepare queue minus 1 (Alg. 4 Line 6), has already received all updates for the snapshot .

To show the latter, we use prove by contradiction. Assume there is a remote update from replica such that , and has not received . By Alg. 4 Line 30, the partition would have received an update such that . The updates are sent in the order of their commit timestamps (Alg. 4 Lines 9–16). Hence, if the could not have received another update before . Therefore, , implying that , leading to the contradiction. ∎

Proposition 3.

Snapshots in PaRiS are causal.

Proof sketch.

To start a transaction, a client piggybacks the freshest snapshot it has seen, ensuring the monotonicity of the snapshot seen by (Alg. 2 Line 2). Commit timestamps reflect causality (Alg. 2 Line 26), and UST tracks a lower bound on the snapshot installed by every partition in all DCs (Alg. 4 Lines 36-38). If is within the snapshot of a transaction, so are its dependencies (Proposition 1). On top of the snapshot provided by the coordinator, also can read the writes, that are not yet included in the snapshot, from the cache. These writes cannot depend on items created by other clients that are outside the snapshot visible to . ∎

Proposition 4.

Writes are atomic.

Proof sketch.

Although updates are made visible independently on each partition involved in the commit phase, either all updates are made visible or none of them are, i.e. the atomicity is not violated. All updates from a transaction belong to the same snapshot because they all receive the same commit timestamps (Alg. 2 Line 27). The updates are being installed in the order of their commit timestamps (Alg. 4 Lines 9–16). The visibility of the item versions is determined by the transaction snapshot (Alg. 3 Line 5), which is based on the value of ’s universal stable time . is computed by the UST protocol as the aggregate minimum of the version vectors entries of all partitions of all data centers (Alg. 4 Lines 34–38). ∎

PaRiS implements TCC, as every transaction reads from a causally consistent snapshot (Proposition 3) that includes all effects (Proposition 4) of its causally dependent transactions.

V Evaluation

(a) Throughput when varying the number of machines per DC.
(b) Throughput when varying the number of DCs.
Fig. 2: Throughput achieved by PaRiS when increasing the number of machines per DC (a) and DCs (b). PaRiS achieves good scalability both when increasing the number of machines per DCs and DCs.
(a) Throughput when varying the locality of the transactions.
(b) Latency when varying the locality of the transactions.
Fig. 3: Throughput (a) and latency (b) achieved by PaRiS when varying the locality of the transactions with 100:0, 95:5, 90:10 and 50:50 local-DC:multi-DC ratio.

Competitor system. To assess the advantages of having non-blocking reads, we compare PaRiS against a blocking protocol, that we call Blocking Partial Replication, or BPR. In BPR the snapshot of a transaction of client is determined as the maximum of the highest causal snapshot seen by and the clock value of the transaction coordinator. BPR uses one timestamp to encode transactional snapshots, so we can compare fairly the resource efficiency of PaRiS versus the one of BPR. BPR also favors the freshness of the snapshots that are visible to transactions. BPR, however, implies having blocking transactional reads, because the server must ensure that the returned version belongs to a causal snapshot. To this end, a read operation with snapshot timestamp is blocked on a partition until the partition has applied all local and remote transactions with timestamp up to .

V-a Experimental environment

Platform. We consider a geo-replicated setting deployed across up to 10 replication sites on Amazon EC2 (North Virginia, Oregon, Ireland, Mumbai, Sydney, Canada, Seul, Frankfurt, Singapore and Ohio). When using 3 DCs, we use Virginia, Oregon and Ireland. When using 5 DCs, we use the previously mentioned 3 DCs plus Mumbai and Sydney. In each DC we use up to 18 servers (c5.xlarge instances with 4 VCPUs and 8 GB of RAM). The replication factor is 2. We choose this value because it allows us to use 3 as minimum number of DCs in our experiment and use partial replication.

We spawn one client process per partition in each DC. Clients are collocated with the server partition they use as a transaction coordinator. The clients issue requests in a closed loop. To generate different load conditions, we spawn different number of threads per client process. Depending on the type of the workload or the protocol, a different number of threads is needed to saturate the target system. Each “dot” in the curve plots we report corresponds to a different number of active threads per client process.


Implementation. We implement PaRiS and BPR in the same C++ code-base. Both protocols implement the last-writer-wins rule for convergence. We use Google Protobufs for communication, and NTP to synchronize physical clocks. The stabilization protocols run every 5 milliseconds.


Workloads. We use workloads with 95:5 and 50:50 r:w ratios that correspond to the update-heavy (A) and read-heavy (B) YCSB workloads [Cooper:2010]. These are standard workloads also used to benchmark other TCC systems [Akkoorath:2016, Mehdi:2017, Zawirski:2015, Spirovska:2018]. Transactions generate the three workloads by executing 19 reads and 1 write (95:5), and 10 reads and 10 writes (50:50). Hence, in each workload a transaction executes 20 operations per transaction. A transaction first executes all the reads in parallel, and then all the writes in parallel.

A transaction can target only partitions in the local DC, or can touch random partitions in remote DCs. In the first case, we say that a transaction is “local-DC”; else, we say it is “multi-DC”. When accessing a remote partition, a client can choose among two replicas. We assign to every client in a DC the same preferred remote replica for each partition. We vary the preferred replica in the DCs using a round-robin assignment, to balance the load. To evaluate the effect of the partial replication, we use workloads with 100:0, 95:5, 90:10 and 50:50 local-DC:multi-DC ratios.

The default workload we consider uses the 95:5 r:w ratio, 95:5 local-DC:multi-DC ratio and runs transactions that involve 4 partitions on a platform deployed over 90 machines spread over 5 DCs. The default deployment has 45 partitions that are replicated with replication factor 2. Hence, each DC has a total of 18 machines.

We also consider variations of this workload in which we change the value of one parameter and keep the others at their default values. Transactions access keys within a partition according to a zipfian distribution, with parameter 0.99, which is the default in YCSB and resembles the strong skew that characterizes many production systems 

[Atikoglu:2012, Nawab:2015, Balmau:2017]. We use small items (8 bytes), which are prevalent in many production workloads [Atikoglu:2012, Nawab:2015].

V-B Latency and throughput


Blocking vs. non-blocking. Figure 0(a) and Figure 0(b) report the average transaction latency vs. throughput achieved by PaRiS and BPR with the 95:5 (the default) and with the 50:50 r:w ratios. In the read-dominated case, PaRiS achieves up to 5.91x lower response times and up to 1.47x higher throughput than BPR. PaRiS also achieves up to 20.56x lower response times and up to 1.46x higher throughput than BPR in the write-dominated workload. PaRiS achieves lower latencies because it never has to wait for a snapshot to be installed. PaRiS achieves higher throughput because it does not incur any overhead to block/unblock read requests. Because BPR is a blocking protocol, it needs a higher number of concurrent client threads to fully utilize the processing power left idle by blocked reads. This creates more contention on the physical resources and more synchronization overhead to block and unblock reads, which ultimately leads to lower throughput.


Blocking time. The average blocking time of the read phase of a transaction in BPR is 29 ms for the top throughput in the read-dominated workload (Figure 0(a)) and 41 ms for the top throughput in the write-dominated workload (Figure 0(b)).

V-C Scalability


Varying the number of machines per DCs. Figure 1(a) reports the throughput achieved by PaRiS when using 6, 12 and 18 machines/DC. We consider two geo-replicated deployments that use 3 and 5 DCs. In both cases, PaRiS achieves the ideal improvement of 3x when scaling from 6 to 18 machines/DC. This result showcases the ability of PaRiS to scale horizontally regardless of the number of DCs on which it is deployed.


Varying the number DCs. Figure 1(b) reports the throughput achieved by PaRiS when deployed on 3, 5 and 10 DCs. We consider two cases corresponding to 6 and 12 machines/DC. In both cases PaRiS achieves the ideal improvement of 3.33x, when scaling from 3 to 10 DCs. This result shows that PaRiS scales well to higher numbers of DCs for different sizes of the platform within each DC.

V-D Varying data access locality

Figure 2(a) reports the maximum throughput achieved by PaRiS when varying the locality ratio (local-DC:multi-DC) of transactions from 100:0 to 50:50. Figure 2(b) shows the average transaction latency corresponding to the throughput values reported in Figure 2(a). Performance deteriorates as the percentage of local accesses decreases. The maximum achievable throughput drops slightly, from 350 to 300 KTx/sec. Latency is more penalized, increasing from 8 ms to 150 ms. We note that the number of threads needed to saturate the system increases as the locality decreases (from 32 to 512 in this case), because requests spend much of their times traveling across DCs. This explains why the maximum throughput decreases only by 16% as opposed to the order-of-magnitude increase in latency.

As any partially replicated system, PaRiS targets workloads with high locality in the data access pattern. In case of limited locality, the performance penalty incurred by PaRiS, and partial replication in general, is the inevitable price to pay to enable higher storage capacity.

V-E Data staleness

We measure the staleness of the data returned by PaRiS by measuring the visibility latency of updates. The visibility latency of an update in is the difference between the wall-clock time when becomes visible in and the wall-clock time when was committed in its original DC. Figure 4 shows the CDF of the update visibility latency achieved by PaRiS and BPR with 5 DCs and the default workload. The CDFs are computed as follows: we first obtain the CDF on every partition and then we compute the mean for each percentile.

BPR achieves lower update visibility latency than PaRiS. The update visibility time in PaRiS is higher than in BPR (with an around 200 ms difference in the worst case). That is to be expected because UST identifies a lower bound of the update times of transactions applied in the whole cluster. BPR effectively trades data freshness for performance, because it exposes more recent snapshots of the data at the cost of blocking reads, hence achieving much lower performance than PaRiS.

Fig. 4: PaRiS has higher update visibility latency than BPR (logarithmic scale).

Vi Related work

Table Iclassifies existing CC systems according to the transactional model they expose, the capability of achieving non-blocking parallel reads, support for partial replication, and meta-data requirements. The table focuses on systems that target the system model described in Section II-C.

The vast majority of the systems assume full replication and provide none or restricted transactional capabilities. This class of systems includes COPS [Lloyd:2011], Eiger [Lloyd:2013], ChainReaction [Almeida:2013], Orbe [Du:2013], GentleRain [Du:2014], Bolt-on CC [Bailis:2013], Contrarian [Didona:2018], POCC [Spirovska:2017], CausalSpartan [Roohitavaf:2017], COPS-SNOW [Lu:2016] and EunomiaKV [Gunawardhana:2017]. All these systems implement one-shot read-only transactions, and only Eiger additionally supports one-shot write-only transactions.

A few systems support partial replication, i.e., Saturn [Bravo:2017], C [Fouto:2018], Karma [Mahmood:2018] and the one by Xiang and Vaidya [Xiang:2018]. These systems, however, implement only single-object read and write operations. Among them, only Karma discusses extensions to support read-only transactions by using an approach similar to Eiger’s.

To the best of our knowledge, only four systems implement TCC. Among these, OCCULT111OCCULT may retry read operations multiple times, instead of blocking the read. Retrying has the same effect on latency of blocking the read until the correct version to read is available on the sever that processes the operation. [Mehdi:2017] and Cure [Akkoorath:2016] can block reads on a node waiting for a snapshot to be installed on such node. Wren [Spirovska:2018] and AV [Tomsic:2018] avoid blocking by identifying stable snapshots in a way that is similar to PaRiS. All these systems, however, target full replication.

System Txs Nonbl. reads Partial rep. Meta-data
COPS [Lloyd:2011] ROT Odeps
Eiger [Lloyd:2013] ROT/WOT Odeps
ChainReaction [Almeida:2013] ROT M
Orbe [Du:2013] ROT 1 ts
Gentlerain [Du:2014] ROT 1 ts
POCC [Spirovska:2017] ROT M
COPS-SNOW [Lu:2016] ROT Odeps
OCCULT [Mehdi:2017] Generic O(M)
Cure [Akkoorath:2016] Generic M
Wren [Spirovska:2018] Generic 2 ts
AV [Tomsic:2018] Generic M
Xiang, Vaidya [Xiang:2018] 1 ts
Contrarian [Didona:2018] ROT M
C  [Fouto:2018] M
Saturn [Bravo:2017] 1 ts
Karma [Mahmood:2018] ROT Odeps
CausalSpartan [Roohitavaf:2017] M
Bolt-on CC [Bailis:2013] M
EunomiaKV [Gunawardhana:2017] M
PaRiS (this work) Generic 1 ts
TABLE I: Taxonomy of the main CC systems. is the number of DCs. stands for timestamp. For systems that do not support transactions, the non-blocking read property refers to single-item reads. PaRiS is the only system that supports partial replication with generic transactions, non-blocking parallel reads, and constant meta-data to track dependencies.

Other relevant systems include TARDiS [Crooks:2016b], GSP [gsp], SwiftCloud [Zawirski:2015], Lazy Replication [Ladin:1992], ISIS [Birman:1991] and Bayou [Terry:1995]. These systems do not support sharding, and hence neither partial replication. Many protocols have also been proposed to implement causally consistent distributed shared memories, e.g., [Baldoni:2006, Xiang:2018b, Hsu:2018]. These protocols do not support transactions and require more meta-data than PaRiS.

PaRiS is also related to systems that implement stronger consistency levels and support partial replication, such as Jessy [Ardekani:2013], P-Store [Schiper:2014], STR [Li:2018], and Spanner [Corbett:2013]. On the one hand, these systems allow fewer anomalies than what is allowed by TCC [Akkoorath:2016] and provide fresher data to the clients. On the other hand, they incur higher synchronization costs to determine the outcome of transactions. PaRiS targets applications that can tolerate weaker consistency and some degree of data staleness, e.g., social networks, and offers them low latency, scalability and high storage capacity.

Vii Conclusion

We present PaRiS, the first system that implements TCC in a partially replicated system and achieves non-blocking read operations. PaRiS implements a novel dependency tracking protocol, called UST, which requires only one timestamp to track dependencies. UST identifies a snapshot of the data that is available at every DC, thereby enabling non-blocking reads regardless of the DC in which the read takes place.

We evaluate PaRiS on a data platform replicated on up to 10 DCs. PaRiS scales well and achieves lower latency than the blocking alternative, while being able to handle larger datasets than existing solutions that assume full replication.

Acknowledgments

This research has been supported by The Swiss National Science Foundation through Grant No. 166306 and by an EcoCloud postdoctoral research fellowship.

References