Toward Adaptive Causal Consistency for Replicated Data Stores

by   Mohammad Roohitavaf, et al.

Causal consistency for key-value stores has two main requirements (1) do not make a version visible if some of its dependencies are invisible as it may violate causal consistency in the future and (2) make a version visible as soon as possible so that clients have the most recent information (to the extent feasible). These two requirements conflict with each other. Existing key-value stores that provide causal consistency (or detection of causal violation) utilize a static approach in the trade-off between these requirements. Depending upon the choice, it assists some applications and penalizes some applications. We propose an alternative where the system provides a set of tracking groups and checking groups. This allows the application to choose the settings that are most suitable for that application. Furthermore, these groups can be dynamically changed based on application requirements.


page 1

page 2

page 3

page 4


UniStore: A fault-tolerant marriage of causal and strong consistency (extended version)

Modern online services rely on data stores that replicate their data acr...

Byz-GentleRain: An Efficient Byzantine-tolerant Causal Consistency Protocol

Causal consistency is a widely used weak consistency model that allows h...

Ensuring referential integrity under causal consistency

Referential integrity (RI) is an important correctness property of a sha...

Got: Git, but for Objects

We look at one important category of distributed applications characteri...

Incremental methods for checking real-time consistency

Requirements engineering is a key phase in the development process. Ensu...

CausalSpartanX: Causal Consistency and Non-Blocking Read-Only Transactions

Causal consistency is an intermediate consistency model that can be achi...

Using Unsupervised Learning to Help Discover the Causal Graph

The software outlined in this paper, AitiaExplorer, is an exploratory ca...

1 Introduction

Causal consistency for distributed key-value stores has received much attention from academia in recent years. Existing protocols utilize a static approach in the trade-off between different conflicting requirements (e.g. consistency, visibility, and throughput). They also treat all clients the same, and assume that their usage patterns are always unchanged. For example, they assume clients only access their local data center, and any client may access any part of the data. However, different applications may have different usage patterns. To illustrate, consider a simple system that consists of two partitions and with geographically distributed copies , , and . Suppose, we are using a causal consistency protocol like [11, 12, 7, 8, 14] that does not make a version visible, unless it made sure all partitions inside a replica are updated enough. We consider two possible ways to organize the replicas: (1) two full replicas each with two partitions, referred to as (2) or four partial replicas each with one partition referred to as . These two organizations are shown in Figure 1. Now, consider two applications. The first application, consists of two clients and that access and respectively for a collaborative work. In , each client updates the data after it reads the new version written by another client. Since each client waits for the other client’s update, any increase in update visibility will reduce the throughput of . In the scenario in Figure 1(a), since and are considered in the same replica, does not make versions visible, unless it made sure is updated enough. Thus, if the communication between and is slow, it takes more time for to make a version visible. Since the data on and is irrelevant for , this delay by is unnecessary which leads to increased visibility latency which, in turn, leads to a reduced throughput of . Furthermore, if there were a large number of such partitions, this delay would be even more pronounced. By contrast, there is no such penalty in scenario in Figure 1(b), as in Figure 1(b), partitions and are considered in different replicas. Thus, they do not check each other.

Figure 1: Two ways to organize replicas

On the other hand, consider that consists of one client, say , and it accesses data from and . In scenario in Figure 1(a), is guaranteed to always read the consistent data. However, in scenario in Figure 1(b), since and do not check the freshness of each other, may suffer from finding inconsistent versions (or delays or repeated requests to find a consistent version) while accessing and .

From the above discussion, it follows that no matter how we configure the given key-value store, a system with a static configuration that treats all clients the same will penalize some clients. The goal of this work is to develop a broad framework that instead of relying on a fixed set of assumptions, allows the system to be dynamically reconfigured after learning the actual client activities and requirements.

In Section 2, we propose an approach that lets us effectively trade off between different objectives and serve different groups of clients differently. Next, in Section 3, we provide a framework that uses our proposed approach. In Section 4, we discuss ideas for creating adaptive causal systems based on our protocol. Finally, in Section 5 we conclude the paper.

2 Adaptive Causal Consistency

The broad approach for providing causal consistency is to track the causal dependencies of a version, and check them before making the version visible in another replica. Tracking and checking are usually done using timestamping versions as follows:

  • Dependency Tracking: Upon creating a new version for a key, we assign a timestamp to the version that somehow captures causal dependencies of the version.

  • Dependency Checking: Upon receiving a version, the receiving replica does not make the version visible to the clients until it makes sure that all of the dependencies of the version are also visible to the clients.

The goal of timestamping is to provide a way to capture causal relation between two versions. To satisfy  dep   (where  dep   means the event of writing has happened-before [10] the event of writing , and is the timestamps assigned to ), we need timestamps of size [5] where

is number of nodes that clients can write on. To solve the issue of large timestamps, causal consistency protocols consider servers in groups and track causality with vectors that have one entry per group. We refer to such groups as

tracking groups. Tracking dependencies in groups, provides timestamps that satisfies a weaker condition  dep  . This condition lets us guarantee causal visibility of the versions. However, since it does not provide accurate causality information, we may need to unnecessarily delay the visibility of a version by waiting for versions that are not its real dependencies. Thus, by grouping servers in tracking group, we trade off the visibility of versions for a lower metadata size.

We face a similar trade-off in the dependency checking. Dependency checking determines how conservative we are in making versions visible to the clients. Since checking the whole system is expensive, causal consistency protocols consider systems in groups, and each server only checks servers in its own group. We refer to such groups as checking groups. Most of current protocols [11, 12, 7, 8, 15, 14] group servers by their replicas. Thus, a server only checks the dependencies inside the replica that it belongs to. Table 1 shows tracking and checking groups for some of the recent causal systems.

Protocol Tracking Checking
COPS [11] Per key Per Replica
Eiger [12] Per key Per Replica
Orbe [7] Per server Per Replica
GentleRain [8] Per system Per Replica
Occult [13] Per Master Server No checking
Okapi [6] Per Replica Per system
CausalSpartan [14] Per Replica Per Replica
Table 1: Tracking and Checking in Some of Causal Systems

When we are designing a causally consistent key-value store, two natural questions arise based on the trade-offs explained above: 1) how much tracking accuracy is enough for a system? 2) how much should we be conservative in making versions visible? We believe the answer to these questions depends on the factors that should be learned at the run-time. A practical distributed data store performs in a constantly changing environment; the usage pattern of clients can change due to many reasons including time of the day in different time zones or changes in load balancing policies; data distribution can change, because we may need to add or remove some replicas; components may fail or slow down, and so on. These changes can easily invalidate assumptions made by existing causal consistency protocols such as [11, 12, 7, 8, 14, 13] which leads to their reduced performance in practical settings [3]. To solve this issue, we believe that a key-value store must monitor the factors mentioned above and dynamically trade off between different conflicting objectives. We believe dynamically changing tracking and checking grouping based on what we learn from the system is an effective approach to perform such dynamic trade-offs. Using a flexible tracking and checking grouping we are also able to treat different applications in different ways.

To use the above approach, however, we need a protocol that can be easily configured for different groupings. As shown in Table 1, existing protocols assume fixed groupings that cannot be changed. To solve this issue, in the next section, we provide a protocol that can be configured to use any desired grouping. This flexible algorithm provides a basis for creating adaptive causal systems. This algorithm also lets us treat clients in different ways, and unlike most of the existing protocols that require a certain data distribution schema, our algorithm allows us to replicate and partition our data any way we like including creating partial replicas.

3 Adaptive Causal Consistency Framework

In this section, we provide Adaptive Causal Consistency Framework (ACCF) which is a configurable framework that lets us deal with trade-offs explained in Section 2. Specifically, as the input, ACCF receives 1) function that assigns each server to exactly one tracking group, and 2) function that assigns each server to a non-empty set of checking groups.

3.1 Client-side

Algorithm 1 shows the client-side of the ACCF. A client maintains a set of pairs of tracking group ids and timestamps called dependency set, denoted as . For each tracking group , there is at most one entry in where specifies the maximum timestamp of versions read by client originally written in servers of tracking group . Each data object has a key and a version chain containing different versions for the object. Each version is a tuple , where is the value of the version, and is a list that has at most one entry per tracking group that capture dependency of the version on writes on different tracking groups.

0:  Load balancer
1:  GET (key , checking group id )
3:       send to server
4:       receive
6:  return  
7:   PUT (key , value )
9:       send to server
10:       receive
Algorithm 1 Client operations at client

To read the value of an object, the client calls GET method with the desired key to read. The client also specifies the id of the checking group that the server must use. We will see how the server uses this id in Section 3.2. We find the preferred server to read the object using the given load balancer service . After finding the preferred server to ask for the key, we send a GetReq request to the server. In addition to the key and the checking group id, we include the client dependency set in the request message. The server tries to find the most recent version that is consistent by the client’s past reads. In the Section 3.2, we explain how the server looks for a consistent version based on the .

To write a new value for an object, the client calls PUT method. The server writes the version and records client’s as the dependency of the version. After receiving a response from the server for a GET (or PUT) operation, we update such that any later version written by the client depends on the version read (or written) by this operation.

3.2 Server-side

In this section, we focus on the server-side of the protocol. We denote the physical clock at server by . To satisfy  dep   condition, and assign timestamps close to the physical clocks, ACCF relies on Hybrid Logical Clocks (HLCs) [9]. is the value of HLC at server . Each server keeps a version vector that has one entry for each tracking group denoted by . is the minimum of latest timestamps that server has received from servers in tracking group . To keep each other updated, servers send heartbeat messages to each other in case of not sending any replicate message for a specific amount of time.If there is no key that is hosted by both server and a server in tracking group , then . Each server is a member of one or more dependency checking groups. Servers inside a checking group, periodically share their VVs with each other and compute Stable Version Vector (SVV) as the entry-wise minimum of VVs. is the SVV computed in server for checking group .

Algorithm 2 shows the algorithm for PUT and GET operations at the server-side. When a client asks to read an object, the server waits if there exists in such that which means the server is not updated enough, and reading from the current version chain can violate causal consistency. When the server made sure for any in , , it checks the . If for any in , , the server returns the most recent version for such that for any in , . This guarantees that the client never has to wait if it only reads from servers in checking group . If a client uses different checking groups for different reads, it is possible that the server finds in , such that . In this situation, server forgets about , and gives the client the most recent version that has for . Note that this version is guaranteed to be causally consistent with client’s previous reads.

Once server receives a PUT request, the server updates by calling where is the highest timestamp in . Next, the server creates a new version for the key specified by the client. The server updates with the new value, and sends back its tracking group, , and the assigned timestamp, , to the client in a PutReply message.

Upon creating a new version for an object in one server, we send the new version to other servers hosting the object via replicate messages. Upon receiving a message from server , the receiving server adds the new version to the version chain of the object with key . The server also updates the entry for server in its version vector (i.e., ).

0:  Tracking grouping function , Data placement function
1:  Upon receive
2:       while there is a member in ,
3:          wait
4:       if for all in ,
5:          latest version from version chain of key         s.t. for any member in ,
6:       else
7:          latest version from version chain of key
8:       send to client
9:   Upon receive
10:       maximum value in
12:       Create new item
14:       insert to version chain of
15:       update with
16:       send to client
17:       for each server , such that
18:          send to server
19:   Upon receive from server
20:       insert to version chain of key
21:       update with
22:   updateHLCforPut ()
25:       if
27:       else if
28:       else if
29:       else
Algorithm 2 PUT and GET operations at server

3.3 Evaluation

We have implemented ACCF using DKVF [16]. You can find our implementation of ACCF in DKVF repository [2]. In this section, we provide the results of and groupings for applications and explained in Section 1. We run the system explained in Section 1 consisting of , , , and on different data centers of Amazon AWS [1]. Note that since we focus on partial replication, there is no assumption about and (respectively and ) to be collocated.

Observations for . consists of two clients and . writes the value using . reads (from ) and writes (to ). Subsequently, waits to read and writes and so on. The best scenario for this case is when you have only two partitions and in the system. Hence, we normalize the throughput with respect to this.

The results for are shown in Figure 2, where (respectively ) corresponds to the organization in Figure 1(a) (respectively, Figure 1(b)). In Figure 2(a), locations of , and are fixed, and we vary the location of from California to Singapore (ordered based on increasing ping time from located in California). In Figure 2(b), we keep the location of fixed, but artificially add to any message sent by . As we can see, by viewing the system as Figure 1(b), performance is unaffected whereas viewing the system as Figure 1(a), performance drops by more than 50%.

Observations for . In , client alternates reading from and . To provide fresh copies, another client writes the same objects on and respectively. Here, viewing the system as in Figure 1(b) drops the performance substantially as the message delay of () increases. This is due to blocking the GET operations while waiting for receiving consistent versions. By contrast, by viewing replicas as in Figure 1(a), performance remains unaffected. Throughputs are normalized with respect to the case where there is no update.

(a) Chaging the location of
(b) Changing the delay of
(c) Changing the delay of
Figure 2: Normalized throughput of and for different groupings. has higher throughput with , while has higher throughput with .

4 Discussion

In this section, we discuss how our approach differs from existing work on adaptive causal systems and identify future work that we are currently pursuing.

Dynamically adding or removing checking groups: Adding checking groups is a straightforward process. Each checking group is associated with a data structure (e.g., SVV in the algorithm provided in Section 3) that the servers need to maintain (in RAM). Hence, if we want to add a new checking group, the system can run the protocol to initialize these fields and make the new checking group available. Removing a checking group is somewhat challenging especially if some client is using it. In this case, we anticipate that the principle-of-locality would be of help. If a client has not utilized a checking group for a while, in most cases, all the data the client has read has been propagated to all copies in the system. In other words, if a client is using a checking group that has disappeared, we can have the client choose a different checking group. It is unlikely to lead to delays, as all replicas already have the data that the client has read. Two practical questions in removing checking groups are (1) the time after which we can remove a checking group and (2) how servers can determine that no client has accessed that checking group in that time. A more difficult question in this work is when to add a new checking group and how many checking groups to maintain. Clearly, we cannot create a checking group for each possible client, as it would require exponentially many checking groups.

Utilizing multiple checking groups simultaneously:  Yet another question is whether clients could have multiple checking groups or whether clients can change their checking group. The former would be desirable when the system does not offer a checking group that the client needs. However, the client could choose two (or few) checking groups whose union is a superset of the checking group requested by the client. In this case, the server providing the data would have to utilize all of these checking groups –on the fly– to determine which data should be provided to the client.

Learning required checking groups automatically: In this case, the system will learn from client requests to identify when new checking groups should be added and when existing checking groups should be removed. We expect that dynamically changing the checking groups in this manner would be beneficial due to principle-of-locality, where clients are likely to access data that similar

to the data they accessed before. (Recall that we assume that keys are partitioned with semantic knowledge rather than by approaches such as uniform hashing). We anticipate that learning techniques such as evolutionary or machine learning techniques would be useful to identify the checking groups that one should maintain.

Dynamically changing the tracking groups: Dynamically changing the tracking groups is more challenging but still potentially feasible in some limited circumstances. The reason for this is that while checking groups affect the data maintained by the servers at run-time (in RAM) tracking groups affect storage affected by keys (in long-term permanent storage). In other words, at runtime, we may run into a key that was stored with a different tracking groupings. In this case, it is necessary to convert the data stored with the old tracking grouping into the corresponding data in the new tracking grouping. We expect that principle-of-locality would be of help in this context as well; keys stored long ago are likely to have been updated in all replicas. Conversion of the data stored with keys is protocol specific but still feasible. For example, if we wanted to switch between tracking grouping used by CausalSpartan [14] (where a vector DSV is maintained with one entry per replica) to GentleRain [8] (where only a scalar entry GST is maintained) then we could convert the DSV entry into a GST entry that corresponds to the minimum of the DSV entries. However, the exact approach to do this for different tracking groupings requires semantic knowledge of those tracking groupings.

Comparison with related work: Our approach for providing adaptivity in causal consistency is different from other approaches considered in the literature. Occult [13] utilizes structural compression to reduce the size of the timestamps. Other approaches include bloom filters [4]. While these features are intended as a configurable parameter, we believe that it is not possible to dynamically change it at run-time while preserving causal consistency (or detection of its violation). Furthermore, in all these cases, the reconfiguration provided is client-agnostic; it does not take client requests into consideration. By contrast, our framework provides the ability to allow different clients a view of the system in a manner that improves their performance. Finally, it is possible to take client requests into consideration to identify how adaptivity should be provided.

5 Conclusion

In this paper, we focused on developing a system that provides causal consistency in an adaptive manner. Specifically, we introduced the notion of tracking and checking groups as a way to generalize existing protocols as well as to develop new adaptive protocols. We provided a framework that, unlike existing causal consistency protocols, can be configured to work with different tracking and checking groupings. This flexibility enables us to trade off between conflicting objectives, and provide different views to different applications so that each application gets the best performance. We argue that the approach and the framework introduced in this paper provide a basis for adaptive causal consistency for replicated data stores.


  • [1] Amazon aws.
  • [2] DKVF.
  • [3] Phillipe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, and Kaushik Veeraraghavan. Challenges to adopting stronger consistency at scale. In HotOS, 2015.
  • [4] Sérgio Almeida, João Leitão, and Luís Rodrigues. Chainreaction: a causal+ consistent datastore based on chain replication. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 85–98. ACM, 2013.
  • [5] Bernadette Charron-Bost. Concerning the size of logical clocks in distributed systems. Information Processing Letters, 39(1):11–16, 1991.
  • [6] Diego Didona, Kristina Spirovska, and Willy Zwaenepoel. Okapi: Causally consistent geo-replication made faster, cheaper and more available. arXiv preprint arXiv:1702.04263, 2017.
  • [7] Jiaqing Du, Sameh Elnikety, Amitabha Roy, and Willy Zwaenepoel. Orbe: Scalable causal consistency using dependency matrices and physical clocks. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13, pages 11:1–11:14, New York, NY, USA, 2013.
  • [8] Jiaqing Du, Călin Iorgulescu, Amitabha Roy, and Willy Zwaenepoel. Gentlerain: Cheap and scalable causal consistency with physical clocks. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, pages 4:1–4:13, New York, NY, USA, 2014.
  • [9] Sandeep S Kulkarni, Murat Demirbas, Deepak Madappa, Bharadwaj Avva, and Marcelo Leone. Logical physical clocks. In International Conference on Principles of Distributed Systems, pages 17–32. Springer, 2014.
  • [10] Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978.
  • [11] Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. Don’t settle for eventual: Scalable causal consistency for wide-area storage with cops. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, pages 401–416, New York, NY, USA, 2011.
  • [12] Wyatt Lloyd, Michael J Freedman, Michael Kaminsky, and David G Andersen. Stronger semantics for low-latency geo-replicated storage. In NSDI, volume 13, pages 313–328, 2013.
  • [13] Syed Akbar Mehdi, Cody Littley, Natacha Crooks, Lorenzo Alvisi, Nathan Bronson, and Wyatt Lloyd. I can’t believe it’s not causal! scalable causal consistency with no slowdown cascades. In NSDI, pages 453–468, 2017.
  • [14] Mohammad Roohitavaf, Murat Demirbas, and Sandeep Kulkarni. Causalspartan: Causal consistency for distributed data stores using hybrid logical clocks. In Reliable Distributed Systems (SRDS), 2017 IEEE 36th Symposium on, pages 184–193. IEEE, 2017.
  • [15] Mohammad Roohitavaf and Sandeep Kulkarni. Gentlerain+: Making gentlerain robust on clock anomalies. arXiv preprint arXiv:1612.05205, 2016.
  • [16] Mohammad Roohitavaf and Sandeep Kulkarni. Dkvf: A framework for rapid prototyping and evaluating distributed key-value stores. arXiv preprint arXiv:1801.05064, 2018.