The purpose of this paper is to propose global stabilization for implementing causal consistency in a partially replicated distributed storage system. Geo-replicated storage system plays a vital role in many distributed systems, providing fault-tolerance and low latency when accessing data. In general, there are two types of replication methods, full replication where the data are identical across all servers, and partial replication where each server can store a different subset of the data. As the amount of data stored grows rapidly, partial replication is receiving an increasing attention (Dahlin et al., 2006; Hélary and Milani, 2006; Shen et al., 2015; Crain and Shapiro, 2015; Xiang and Vaidya, 2017; Bravo et al., 2017). To simplify the applications developed based on distributed storage, most of the systems provide consistency guarantees when clients access the data. Among various consistency models, causal consistency has received significant attention recently, for its emerging applications in social networks. To ensure causal consistency, when a client can get a version of some key, it must be able to get versions of other keys that are causally preceding. There have been numerous designs for causally consistent distributed storage systems, especially in the context of full replication. For instance, Lazy Replication (Ladin et al., 1992) and SwiftCloud (Zawirski et al., 2014)
utilize vector timestamps as metadata for recording and checking causal dependencies. COPS(Lloyd et al., 2011) and Bolt-on CC (Bailis et al., 2013) keep dependent updates more precisely, and explicitly check the causality. GentleRain (Du et al., 2014) proposed the global stabilization technique for achieving causal consistency, which trades off throughput with data freshness. Eunomia (Gunawardhana et al., 2017) also uses global stabilization but only within each data center, and serializes updates between data centers in a total order that is consistent with causality. This approach avoids the expensive global stabilization among all data centers, but requires a centerlized serializer for each data center. Occult (Mehdi et al., 2017) moves the dependency checking to the read operation issued by the client to prevent data centers from cascading. In terms of partial replication, there is some recent progress as well. PRACTI (Dahlin et al., 2006) implements a protocol that sends updates only to the servers that store the corresponding keys, but the metadata is still spread to all servers. In contrast, our algorithm only requires sending metadata attached with updates and heartbeats to a subset of servers necessary. Saturn (Bravo et al., 2017) implements a tree-based metadata dissemination via a shared tree among the datacenters to provide both high throughput and data visibility. All updates between data centers are serialized and transmitted through the shared tree. Our algorithm does not require to maintain such shared tree topology for propagating metadata. Instead, our algorithm allows updates and metadata from one server to be sent to another server directly, without the extra cost of maintaining a shared tree topology among the servers. Most relevant to this paper is the global stabilization techniques used in GentleRain (Du et al., 2014). Distributed systems often require its components to exchange heartbeat messages periodically in order to achieve fault tolerance. In the design of GentleRain, each server is equipped with a loosely synchronized physical clock for acquiring the physical time. When sending heartbeats, the value of physical clock is piggybacked with the message. Also, the timestamp for each update message is the physical time when the update is issued, and all updates are serialized in a total order by their timestamps. The communication between any two servers is via a FIFO channel, hence the timestamp received by one server from another server is always monotonically increasing. Suppose the latest timestamp server receives from server is , then any updates from to with timestamp has already been received by server . Due to the total ordering of all updates by their physical time, to achieve causal consistency, each server only need to calculate the time point such that the latest timestamp value received from any other server is no less than . This indicates that server has received all updates with timestamp from other servers, and hence there will be no causal dependency missing if server returns versions with timestamp . However, there are several constraints on the design of GentleRain. In particular, (i) GentleRain applies to only full replication, where each datacenter stores a full copy of all the data (key-value pairs). Within a datacenter, the key space is partitioned among the servers in that datacenter, and such partition needs to be identical for every datacenter, (ii) each client can access servers within only one data center. In this paper, we aim to extend the idea of global stabilization to a more general case, in particular, (i) we allow arbitrary data assignment across the all the servers, and (ii) each client can communicate with any subset of servers for accessing data. The contributions of this paper are the following:
We propose an algorithm that implements causal consistency for general partially replicated distributed storage system. The algorithm allows each server to store an arbitrary subset of the data, and each client can communicate with any subset of the servers.
We prove the optimality of our algorithm in terms of remote update visibility latency, i.e. how fast an update from remote server is visible to the client locally.
We also provide trade-offs to further optimize the remote update visibility latency for realistic scenarios.
We evaluate the performance of our algorithm comparing to GentleRain in terms of remote update visibility latency.
Other Related Work
Aside from the previous work mentioned above, there has been other work dedicated to implementing causal consistency without any false dependencies in partially replicated distributed shared memory. Hélary and Milani (Hélary and Milani, 2006) identified the difficulty of implementing causal consistency in partially replicated distributed storage systems in terms of metadata sizes. They proposed the notion of share graph and argued that the metadata size would be large if causal consistency is achieved without false dependencies. However, no solution is provided in their work. Reynal and Ahamad (Raynal and Ahamad, 1998) proposed an algorithm that uses metadata of size in the worst case, where is the number of servers and is the number of objects replicated. With huge amount of keys to be stored in a key-value store nowadays, the metadata size in their algorithm is too large for practical usage. Shen et al. (Shen et al., 2015) proposed two algorithms, Full-Track and Opt-Track, that keep track of dependent updates explicitly to achieve causal consistency without false dependencies, where Opt-Track is proved to be optimal with respect to the size of metadata in local logs and on update messages. Their amortized message size complexity increases linearly with the number operations, the number of nodes in the system, and the replication factor. Xiang and Vaidya (Xiang and Vaidya, 2017) investigated how metadata is affected by data replication and client migration, by proposing an algorithm that utilizes vector timestamps and studying the lower bounds on the size of metadata. The vector timestamp in their algorithm is a function of the share graph and client-server communication pattern, and have worst case timestamp size where is the number of nodes in the system. In the above-mentioned algorithms, in order to eliminate false dependencies, the metadata sizes is large, in particular, superlinear in the number of servers. In comparison, the global stabilization technique used in our algorithm adopted for partial replication only requires metadata of constant size, independent of the number of servers, clients or keys.
2. System Model
We consider classic client-server architecture, as illustrated in Figure 1. Let there be servers, . Let there be clients, . We do not enforce any constraints on how clients access the servers, that is, each client can talk to an arbitrary set of servers , and each server can handle requests from a set of clients . We assume that each client can access all the keys stored at any server in . For brevity, we will shorten as and as in the following context when there is no ambiguity. Let consists all servers set that some client can access, i.e. . Notice that the size of is where is the total number of servers. The communication channel between servers is point-to-point, reliable and FIFO. Each server has a multi-version key-value storage, where a new version of a key is created when a client writes a new value to that key. Each version of a key also stores some metadata for the purpose of maintaining causal consistency. Each server has a physical clock (reflects physical time in the real world) that is loosely synchronized across all servers by any time synchronization protocol such as NTP (Mills, 1992). Each server will periodically send heartbeat messages (denote as HB) with its physical clock to a selected subset of servers (the choice of the subset is described later). The clock synchronization precision may only affect the performance of our algorithm, not the correctness. To access the data store at a server, a client can issue GET(key) and PUT(key, value) to the server. GET(key) will return to the client with the value of the key as well as some metadata. PUT(key, value) will create a new version of the key at the server, and return to the client with some metadata. We call all PUT operations at some server as local PUT at , and all other PUT operations as non-local PUT with respect to . We allow arbitrary partition of the keys across the servers, i.e. each server can store an arbitrary subset of the keys. In order to model the data partition, we define a share graph, which was originally introduced by Hélary and Milani (Hélary and Milani, 2006). We also define a augmented share graph that further captures how clients access servers.
Definition 0 (Share Graph (Hélary and Milani, 2006)).
Share graph is an unweighted undirected graph, defined as , where , where vertex represents server , and there exists an undirected edge if server and share common keys.
The augmented share graph extends share graph by connecting the nodes that can be accessed to the same client as well.
Definition 0 (Augmented Share Graph).
Augmented share graph is an unweighted undirected multi-graph, defined as . , where vertex represents server . There exists a real edge if server and share common keys, and there exists a virtual edge if there exists some client such that . Denote the set of real edges in as and the set of virtual edges in as .
Example: Figure 2 shows an example of the augmented share graph defined above. In the example, consists vertices , and the common keys shared by any two servers is labeled on each edge. There exists a client that can access , thus vertices are connected by virtual edges. For convenience, we assume that both and are connected. However, the results here trivially extend to the case when the graphs are partitioned. Let denote the set of keys stored at server . Let denote the set of keys shared by servers and . For example, in Figure 2, , , , and , . We assume the augmented share graph is static for most of the paper, and briefly discuss how our algorithm may be adapted when there are data insertion/deletion or adding/removing servers in Section 9.3.
2.1. Causal Consistency
Now we provide the formal definition of causal consistency. Firstly, we define the happened-before relation for a pair of operations.
Definition 0 (Happened-before (Lamport, 1978)).
Let and be two operations ( or ). happens before , denoted as , if and only if at least one of the following rules is satisfied:
and are two operations by the same client, and happens earlier than
is a PUT() operation and is a GET() operation and GET() returns the value written by
there is another operation such that and .
The above happens-before relation defines a standard causal relationship between two operations. Recall that each client’s PUT operation will create a new version of the key at the server.
Definition 0 (Causal Dependency (Roohitavaf et al., 2017)).
Let be a version of key , and be a version of key . We say causally depends on , and denote it as dep if and only if PUT() PUT().
Now we define the meaning of visibility for a client.
Definition 0 (Visibility (Roohitavaf et al., 2017)).
A version of key is visible to a client , if and only if issued by client to any server in returns a version such that or dep . We say is visible to a client from a server if and only if issued by client to server returns such that or dep .
We say a client can access a key if the client can issue GET() operation to a server that stores . Causal consistency is defined based on the visibility of versions to the clients as follows.
Definition 0 (Causal Consistency (Roohitavaf et al., 2017)).
The key-value storage is causally consistent if both of the following conditions are satisfied.
Let and be any two keys in the store. Let be a version of key , and be a version of key such that dep . For any client that can access both and , when is read by client , is visible to .
Version of a key is visible to a client after completes PUT() operation.
In Section 3, we will first present the structure of the algorithm for both clients and servers. Then in Section 4, we complete the algorithm by specifying the heartbeat exchange procedure, and the definition of the Local Stable Time (LST) and the Global Stable Time (GST) used for maintaining causal consistency. We also prove in Section 6 the optimality of our algorithm in terms of data visibility, i.e. how fast a local update is visible to clients at other servers. To further optimize the data visibility for remote updates, in Section 7 we present algorithms that can provide trade-off between data visibility and client migration latencies. The evaluation of our algorithm is provided in Section 8. More discussions and extensions are mentioned in Section 9.
In this section, we propose the algorithms for the client (Algorithm 1) and the server (Algorithm 2). The overall algorithm structure is inspired by GentleRain (Du et al., 2014) and designed for partial replication. The main idea of our algorithm is to serialize all PUT operations and resulting versions by their physical clock time (which is a scalar). For all causally dependent versions, our algorithm guarantees that the total order established by their timestamps is consistent with their causal relation, i.e. if dep then ’s timestamp is strictly larger than ’s timestamp. Such ordering simplifies causality checking, since now each server can know up to which physical time point it has received updates from other servers, when assuming FIFO channels between all servers. When a server returns a version of key to a client, the server needs to guarantee that all causally dependent versions of are already visible to the client. How to decide the version of the key to return is the main challenge of our algorithm, as represented by computing and using Global Stable Time () in the algorithm below and in Section 4. When presenting our algorithm in this section, we left the Global Stable Time () and Local Stable Time () undefined, and the definitions are provided later in Section 4. Intuitively, defines a time point, and the versions no later than this time point can be returned to the client without violating causal consistency. is a component for computing . We prove the correctness of our algorithm in Section 5. We also prove in Section 6 that our definition of is optimal in terms of remote update visibility, i.e. how fast a version of a remote update is visible to the client. In Table 1 below, we provide a summary of explanations on the symbols used in our algorithm. Recall that is the set of servers that client can access, and .
|update time, scalar|
|item version tuple|
|metadata stored at client for get dependencies, scalar|
|metadata stored at client for put dependencies, scalar|
|Local Stable Time stored at client , vector of size|
|Local Stable Time for server set at server , scalar|
|Global Stable Time, scalar|
|server set that is in|
|set of neighbors of server in the share graph|
|heartbeat value from server to server|
|physical clock at server|
|set of servers that server needs to send heartbeat to|
Algorithm 1 is the client’s algorithm. Each client can issue GET and PUT operations to the set of servers that it can access. Each client will store one dependency clock (which is a scalar) for PUT operations, one (scalar) for GET operations, and a vector of length for remote dependencies. All these parameters will be specified in Section 4. When issuing operations, client will attach its clocks with the operation, as in line in Algorithm 1. When receiving the result from the server, the client will update its clocks as in line in Algorithm 1.
Algorithm 2 below is inspired by the algorithm in (Du et al., 2014), with several important differences: (1) The Local Stable Time and the Global Stable Time computations are different and more involved, as will be specified in Section 4. (2) The heartbeat/LST exchange procedures are different (lines , in Algorithm 2). (3) The client will keep slightly more metadata locally, such as a vector of length . (4) There may be blocking for the GET operation of the client as in line of Algorithm 2. Such blocking is necessary for satisfying the second condition for causal consistency as in Definition 6, i.e. the version of client’s own PUT is always visible to the client. The intuition of the algorithm is straightforward. When handling GET operations, the server will first check if the client may have issued a PUT at other servers on some key that it also stores, and make sure such version is visible to the client (line ). Then the server will return the latest version of the key while guaranteeing not violating causal consistency (line ). The computation of Global Stable Time (GST) is designed for this purpose, as will be specified in Section 4. When handling PUT operations, the server will first wait until its physical clock exceeds client’s causal dependencies (line ). Then the server performs a put locally (line ), sends the update to other related servers (line ), and replies to the client (line ). Line is for receiving updates from other servers. Rest of the algorithm (line ) specifies how heartbeats and LSTs are exchanged among the servers.
4. Global Stabilization
In this section, we complete the algorithm by defining heartbeat exchange procedure and Global Stable Time computation. We will specify for each server the set of destination servers its heartbeat/LST messages need to be sent to and how to compute GST from received messages. The Global Stable Time is a function of the augmented share graph defined in Section 2.
4.1. Server Side: GST Computation and Heartbeat Exchange
Global Stable Time computation. Let denote the value attached with the heartbeat message sent from server to . We will later use the term heartbeat value or heartbeat message to refer . Basically, the Global Stable Time in our Algorithm 2 computes the minimum of a set of heartbeat values, representing the time point up to which all the causal dependencies have been received at corresponding servers. In this section, we provide the computation of function to be the value of . We define the length of a cycle to be the number of nodes in the cycle. Nodes with both a real edge and a virtual edge between is considered a valid cycle of length . We will use to denote the directed edge from node to . Define set with respect to a key as follows. For every simple cycle of length in such that , , we have , and if is a real edge, we also have . As an example, in Figure 2, . Recall that consists all servers set that some client can access, i.e. . Notice that the size of is where is the total number of servers. Define set with respect to as follows. For every simple path in such that , , we have if is a real edge, and if is a real edge. Define set for to be a subset of , such that for , if and only if . As an example, in Figure 2, let , then and . For client , let where . Define as follows.
Observation 1: For any and , we have . Observation 2: For any , if for some server , we have . Heartbeat exchange. In order to compute the defined above, server needs to know the heartbeat values for all pairs . Thus, our algorithm requires servers to send the following messages for computing .
will send heartbeat messages to , for such that .
will send heartbeat messages to , for such that .
will compute the Local Stable Time as
and send it to periodically
Thus, the target server set that server needs to send heartbeats to can be computed as . Now we reformulate the expression for using the definition of above. Recall that . Let us denote
meaning the clock value for the local dependencies. Observation 3: For a server set containing server and any , if for some server , we have . Then can be written as
If we denote the second part as meaning the clock value for the remote dependencies, then the can be simply written as
Finally, the computation of used in our Algorithm 2 also depends on client’s dependency clock for remote dependencies:
4.2. Client Side
Due to the delay of communication between servers, the values of s may be different at different servers in . For instance, server may receive from server at time , but server may only receive stale messages at due to network delay. To avoid issues caused by such inconsistency, the client will locally keep the value of the largest it has seen so far for . Therefore, the client maintains a vector of size for values. Also, the client will keep two scalars and as the metadata for get and put dependencies respectively.
Lemma 0 ().
Suppose that PUT() PUT(), and thus dep . Let denote the corresponding updates of PUT() and PUT(), and let denote their timestamps. Then , and .
If two PUTs are issued by the same client, when PUT() is issued, by line of Algorithm 2, will be larger than the client’s value, which is by line of Algorithm 1. Hence . If two PUTs are issued by different clients, and the happen-before relation is due to the second client reading the version of the first client’s PUT(), and then issuing PUT(). By line of Algorithm 2 and line of Algorithm 1, when the second client issues PUT(), the dependency timestamp in line of Algorithm 1 will be . Similarly due to line of Algorithm 2. For other cases when PUT() PUT(), by transitivity we have . Since the timestamp of a version equals the timestamp for the corresponding replication update , we also have . ∎
Lemma 0 ().
Suppose at some real time , a version of key is read by client from server . Consider any server and version of key such that is due to a PUT at some server other than , and dep . Then at time , (i) has been received by server , (ii) the version is visible to client from server .
The proof is provided in Section 11. ∎
Theorem 3 ().
The key-value storage is causally consistent.
To prove the first condition: Let and be any two keys in the store. Let be a version of key , and be a version of key such that dep . For any client that can access both and , when is read by client , is visible to . If is due to a local PUT at the server that client is accessing, then by line of Algorithm 2, is visible to client . Otherwise, if is due to a non-local PUT, according to Lemma 2, is received by the server which the client is accessing, and is also visible to the client. To prove the second condition: A version of a key is visible to a client after completes PUT() operation. Consider a client issuing GET() after a PUT() operation. When the GET operation is returned, we have according to line of Algorithm 2. The definition of implies that is already received by . Then by line of Algorithm 2, version is visible to client . ∎
6. Optimality of the Algorithm
In this section, we prove that the computed by our algorithm is optimal in terms of remote update visibility, that is, the version of remote update is visible to the client as soon as possible. To show the optimality, we show that at line of Algorithm 2 returning any version with timestamp larger than our value may violate causal consistency, indicating our definition of is optimal. Formally, we have the following theorem.
Theorem 1 ().
If any version with is returned to client from server as a result of its operation, then there may exists a version of some key such that dep and client can access key , but version is not visible to client , which violates the causal consistency by Definition 6.
Recall the definition of from Section 4.
where , , and . By line of Algorithm 1 and line of Algorithm 2, the value of the client keeps is the largest value it has seen so far from servers it accessed so far for . By definition, , which implies that is also computed as the minimum value of a set of heartbeat values. By the definitions above, we observe that our is computed as the minimum of a set of heartbeat values from server to server where . Let be the minimum heartbeat value from the set and therefore . There are two cases. Case I: , and thus . By the definition of , there exists a simple cycle of length in such that , , we have , and if is a real edge. First observe that due to the fact that version with is returned to the client, we have , otherwise version is not received by server yet since the latest heartbeat value received by from is . Without loss of generality, let . We can show the following possible execution that will violate causal consistency. Let there be a at server which results in a version with timestamp such that dep . The causal dependency can be created by the same procedure as described in Case I of the proof for Lemma 2. For completeness, we state the procedure here again. First, a client issues PUT() at server , which leads to an update from to . Then for sequentially, a client reads the version written by the previous client from server via a GET operation at server . If , client then issues PUT() at where , which leads to a replication update from to . If , without loss of generality, suppose can access both . Then issues PUT() at where . In the end, client reads the version , written by client , from server , and issues PUT() at server , which results in an update from to . By the definition of happens-before relation, it is clear that PUT() PUT(), namely dep . Since , is not received by at the time when version is returned to the client . Now let be delayed indefinitely, which is possible since the system is asynchronous. Consider the case that after reading version , client issues at server . Suppose that client does not issue any operation before, and thus its . Notice that the get operation is non-blocking when by line of Algorithm 2, only an older version of key such that dep may be returned to client since is delayed and not received by server . Hence is not visible to client , which violates the causal consistency.
Case II: . Then by definition, there exists a simple cycle of length in such that and . Let . Let there be a at server which results in a version with timestamp such that dep . The causal dependency can be created by a similar procedure as described in Case I above, with differences at the end: client reads the version from server , and issues PUT() at server where . Then some client that only access server () reads the version and issues PUT() at server . The fact that ensure that when client can read without being received by . Now let be delayed indefinitely. Suppose that after client gets version , it issues at server . Similar to Case I, is not visible to client , which violates the causal consistency. ∎
7. Optimization for Better Visibility
We say a client migrates from server to server , if the client issues some operation to server first, and then to server . Previously in Section 3 and 4, we allow each client to communicate with any servers in freely, even when the client migrates across different servers. In reality, the frequency of such migration may be low, i.e. a client is likely to communicate with a single main server for a long period before changing to another one, or only communicate with another server when the main server is not responding. If such migration across different servers occurs infrequently, it is possible to introduce extra latencies during the migration, in order to achieve better remote update visibility when clients issue GET operations. In fact, some system designs already observed such trade-off, such as Saturn (Bravo et al., 2017). However, Saturn’s solution requires to maintain an extra shared tree topology among all the servers, and is different from our global stabilization approach. In Section 7.1 below, we demonstrate how to design the algorithm to achieve better data visibility as the discussion above. Then in Section 7.2, we generalize the above idea from individual server to a group of servers.
7.1. Basic Case
We will use the same notation from Section 3 and 4. Recall that the Global Stable Time , computed for the client accessing server for the value of key , is the minimum of a set of heartbeat clock values, including all possible local dependencies and remote dependencies. Essentially, the reason for taking remote heartbeat values received by servers other than is to ensure that after client migrates to another server in , there is no need for blocking in the operation since all causal dependencies are ensured to be visible to the client as proven in Lemma 2. Naturally, if the client can wait for a certain period of time during the migration such that the target server is guaranteed to have the client’s causal dependencies visible, then the GST computation does not need to include the remote heartbeat values necessarily. Specifically, the Global Stable Time simply becomes
which only reflects the causal dependencies locally. When a client migrates to another server, it needs to execute operation MIGRATE as shown in Algorithm 3. Basically, the client will send its dependencies clock to the new target server for migration. For the target server, it needs to ensure the local storage has already included all the versions in the client’s causal dependencies it should have before returning acknowledgment. Specifically, the server needs to wait until is no less than the client’s dependency clock, as shown in line of Algorithm 3.
Also, there is no exchange of Local Stable Time among the servers, since now the computation of GST does not include the remote heartbeat values. This implies significant savings in bandwidth usage as the number of servers in the system scales. The major advantage of Algorithm 3 is increasing the data visibility. By line of Algorithm 3, the GST for client’s GET operations is equal to , which is very likely to be larger than the original GST, because the latter takes the minimum latest remote heartbeat values for computation. Then the version returned by client’s GET operation is more likely to have larger timestamps compared to Algorithm 2. Although there are extra delays incurred during the client’s migration procedure as in line of Algorithm 3, the penalty caused by migration delays is small if the frequency of migration is low.
7.2. General Case
The above idea can be further extended. In the basic case, we consider each server as a separate ”group”, and introduce extra latencies when clients migrate from one group to another. In general, a client may frequently communicate with a subset of servers for a period of time, and migrate to another subset of servers for frequent accessing later. For instance, each subset may be a datacenter that consists several servers, and each client usually access only one datacenter for PUT/GET operations. In this case, each ”group” that the client will access consists a subset of servers.
For the general case mentioned above, we can design an algorithm where the client can access servers within a group freely without extra delays, but need to wait extra time when migrating across groups. The algorithm is presented in Algorithm 4. We only show the difference compared to the algorithm in Section 3 here for brevity. Let be the set of all groups that clients may access to. Then the size of is where is the total number of servers. We will use the same notation from Section 3 and 4. The augmented share graph in this section contains virtual edges connecting all servers accessible by one client, including servers within the same group and across groups. Then when a client is accessing server in group , the Global Stable Time is computed as
When the client migrates to another group , extra delay will be enforced. In particular, the server in group needs to wait until where is the dependency clock of the client. The extra latency here ensures that all client’s causal dependencies has been received by the servers in the group , and visible to the client. Notice that the algorithm in Section 3 and Algorithm 3 are both special cases of Algorithm 4, where group equals and individual server respectively.
In this section, we evaluate the remote update visibility latency (or latency in short) of our algorithm comparing to the original GentleRain (Du et al., 2014). The remote update visibility latency is defined as the time period from when a remote update is received by the server to when the remote update is visible to the client.
8.1. Evaluation Setup
For evaluation purpose, we only implement and evaluate the stabilization procedure as described in our algorithm from Section 3, instead of the entire key-value store. Also we simulate multiple server processes within a single machine, and control network latencies by manually adding extra delays to all network packages. There are several reasons for using this approach: i) The performance of our algorithm is independent on the underlying key-value store, and can be evaluated individually. ii) It is easier to control all parameters and evaluate their influences on the remote update visibility latency for our algorithm. We mainly evaluate our algorithm for a special family of share graphs for the ease of comprehension. The graphs we will use are ring graphs of size , with each node to be both a client and a server. The client of one node will only access the server of that node. This family of share graph can represent simple robotic networks in practice – each node is a robot that stores key-value pairs dependent on its physical location, and only share keys with its neighbors. Our algorithm turns out to be very straightforward for this family of share graphs: each node will send heartbeat messages to only its neighbors, and is computed as the minimum of the heartbeat values received from its neighbors. For the original GentleRain algorithm, it is not clear how to handle partial replication. Therefore we adopt the most straightforward approach – pretending the system as full replication for which GentleRain can achieve causal consistency correctly. Then in GentleRain, the for each node is computed as the minimum of heartbeat values from all nodes in the ring. For example, for a ring graph of size , our algorithm will compute GST for any node as the minimum of the heartbeat values from its neighbors. In GentleRain, the GST is computed as the minimum heart values from all other nodes. Hence intuitively, GentleRain will have smaller value at each server comparing to our algorithm because its is computed as the minimum of a larger set of heartbeat values. This implies that only older versions can be visible to the client comparing to our algorithm, and thus leading to higher remote update visibility latencies. The machine used in this experiment runs Ubuntu 16.04 with 8-core CPU of 3.4GHZ, 16 GB memory and 128GB SSD storage. The program is written in Golang, and uses standard TCP socket communication. The program will execute multiple threads simultaneously, including i) one thread that periodically sends heartbeat messages to its neighbors ii) one thread that periodically sends update messages (due to operations) to target nodes iii) one thread that listens and receives messages from other nodes and iv) one thread that periodically computes and checks which remote updates are visible to the client.
8.2. Evaluation Results
We will evaluate the influence of several parameters on remote update visibility latency (or latency in short), including update throughput, heartbeat frequency, stabilization frequency, clock skew, and ring size. The latency presented in this section is computed as the average latencies over all updates from all servers. In each experiment, we repeat the measurement times and take the average as a data point. Each experiment will vary one or two parameters while keeping other parameters fixed. The default parameters for all experiments are listed below: stabilization frequency = , heartbeat frequency = , network delay = or , ring size = , update throughput =
and clock skew =. Update Throughput Since we run all server programs in a single VM, there is a limitation on the maximum update throughput, which is about updates per second for each server program when we have nodes running on our machine. There also exists a threshold after which the machine is overwhelmed by the update messages, leading to dramatic increment in the remote update visibility latencies. In order to find such threshold, we plot the latency changes with respect to the update throughput in Figure 3(a) and 3(b) with and network delays respectively.
As we can see from Figure 3(a) and 3(b), the threshold would be some value when network delay is and when network delay is . For evaluations below, we set the update throughput to be for each node, since we will increase the other parameters such as ring size, heartbeat frequency and stabilization frequency later for experiments. Stabilization Frequencies and Heartbeat Frequencies In this section, we set both stabilization frequencies and heartbeat frequencies to be variables, and compare the average latency of our algorithm with GentleRain. The network delay is set to be in this experiment.
|HB fq.Stab fq.||1||10||100||500||1000|
From Table 2 we can observe that there are significant improvements on latencies by our algorithm comparing to the GentleRain for the cases we measured. Here are some observations:
For both algorithms, the visibility latency decreases significantly with higher stabilization frequencies, except the case when the heartbeat frequency is high in GentleRain. In the latter case, the machine is already overwhelmed by heartbeat message, so increasing stabilization frequency actually damages the performance.
The heartbeat frequency does not influence the visibility latency of our algorithm much, since update messages from neighbors at frequency about also carries clock values, and computation can proceed with such clock values. However, this is not the case for GentleRain, due to the fact that each node needs to receive clocks from all other nodes rather than only its neighbors, but the update messages each node receives only come from its neighbors. Then low heartbeat frequencies will delay the GST computation and thus increase the visibility latencies of GentleRain.
Therefore, the visibility latencies improve with higher heartbeat frequencies in GentleRain, until the system is overwhelmed by the heartbeat message communication, as in the case of heartbeat frequency . On the other hand, our algorithm does not suffer from such problem, since the heartbeat messages in our algorithm will only be sent to the small set of nodes in the system, and the such set is optimal according to Section 6.
Clock Skew To evaluate the influence of clock skew on the visibility latency, we manually add clock skews between any pair of neighbors in the ring. Label the nodes in the ring with id where is the ring size. For a skew value , we add clock skew to node . We vary the skew value from ms to ms, and plot the visibility latency change in Figure 5 below.
As we can see from Figures 4(a) and 4(b), the remote update visibility latencies grow almost linearly with the clock skew in both cases. This is predictable since the latency is determined by the minimum clock value received by the server, which is influenced by the clock skew between servers. Also, our algorithm performance significantly better than GentleRain in terms of remote update visibility latency under various clock skews. Ring Sizes Intuitively, the ring size will affect the remote update visibility latency of GentleRain, since the number of heartbeat values received by any node will grow linearly with the ring size, leading to smaller and larger latencies. However, our algorithm will not be affected too much since the number of heartbeat values received is fixed as equal the number of neighbors in the ring. Figure 5(a) and 5(b) below validate the discussion above, and demonstrate the scalability of our algorithm. In both cases, the visibility latency in our algorithm remains stable while the latency in GentleRain increases as ring size increments. Notice that with network delay of , the visibility latency grows dramatically larger (more than ) as ring size increases.
9. Discussions and Extensions
9.1. Fault Tolerance
In this section, we discuss how various failures such as server failure, network failure or network partitioning may affect our stabilization algorithm. Our discussion is similar to that in the GentleRain (Du et al., 2014), since similar analysis can be applied to all stabilization based algorithms for causal consistency. The main observation is that, our stabilization algorithm will guarantee causal consistency even if the system is suffered from machine failure, machine slowdown, network delay or partitioning. Recall that in our algorithm, versions are totally ordered by their timestamps which corresponds to the physical time point when the version is created. When client issues a GET operation, the version returned will have timestamp value no more than Global Stable Time. When a server fails, the client may not receive any response from the server. However, since our algorithm allows clients to migrate across servers, the client can timeout after a long period of waiting and then connect to another server to issue operations. The failure of the server, however, will affects the computation of GST at other servers, since the failed server no long sends heartbeat messages to its neighbors and thus the value of GST at some servers may stop updating. In this case, the causal consistency is preserved, since the version returning to the client may be stale but causally consistent. However, to make sure the system can make progress and have newer versions visible to the client eventually, other servers should be able to detect the failure and recompute the value of GST eventually. For instance, servers can set a timeout for heartbeat and LST exchanges. If one server does not receive the message from another server after the timeout, it can mark this server as failed. How to recompute the new GST to make progress while maintaining the system causally consistent is an interesting open problem. We will briefly discuss ideas on how to handle dynamic changes in the system such as node join/leave and key insert/delete in Section 9.3 later. For other issues such as machine slowdown, network delay or partitioning, similarly, the computation of GST may stop making progress, but the version returned to the client is guaranteed to be causally consistent. Then when the failure is recovered, the pending heartbeats or updates can be applied at corresponding servers, and GST can continue to increment. One possible failure that can cause the violation of causal consistency is packet loss, in particular the loss of update messages. Update loss may result in returning a stale version to the client that is not causally consistent due to missing versions. In practice, we can always use reliable communication protocol for transmitting update messages to handle the issue.
9.2. Using Hybrid Logical Clocks
To reduce the latency of the PUT operation caused by clock skew, we can use hybrid logical clocks (HLC) (Roohitavaf et al., 2017) instead of a single scalar as the timestamps. The HLC for an event has two parts, a physical clock and a bounded logical clock . The HLC is designed to have the property that if event happens before event , then (Roohitavaf et al., 2017). By replacing the scalar timestamp with HLC, we can avoid the blocking at line 8 of Algorithm 2. More details about HLC can be found in (Roohitavaf et al., 2017).
9.3. Dynamic Systems
This section will briefly discuss the ideas on how the algorithm may be adapted for dynamic systems where keys can be inserted or deleted at each servers, and servers themselves can also be added or removed. The change in the system can be essentially modeled as augmented share graph change from to . The system may batch the changes to prove efficiency. When the system experiences changes, the algorithm should guarantee that the causal consistency is not violated. That is, the versions returned to the client during the system change should still be causally consistent. Therefore, the algorithm should ensures that during the system change, the Global Stable Time computed should still be nondecreasing. However, due to the change of the augmented share graph, it is possible that GST computed in the new augmented share graph has smaller values comparing to the old one. To ensure causal consistency, the algorithm can continue to use the old GST value at the time point when the augmented share graph changes, until the new GST value exceeds