Fault-tolerant distributed systems provide a dependable service on top of unreliable computers and networks. They implement fault tolerance protocols that replicate the system and ensure that from the outside all (unreliable) replicas are perceived as a single reliable one. This has been formalized by strong consistency, consensus, state machine replication. These protocols are crucial parts of many distributed systems and their correctness is very hard to obtain. Protocol designers are faced with the challenges of buffered message queues, message re-ordering at the network, message loss, asynchrony and concurrency, and process faults. Reasoning about all these features is a notoriously hard as discussed in several research papers [Chandra2007PML, OngaroO14, HawblitzelHKLPR17, EPaxos, RenesseSS15], and as a consequence testing tools like Jepsen [Jepsen] found conceptual design bugs in deployed implementations.
A programming abstraction of synchronous rounds [DLS88:jacm, Lyn96:book, Charron-BostS09] would relieve the designer from many of these difficulties. Synchronous round-based algorithms are more structured, are easier to understand, have simpler behaviors. As one only has to reason about specific global states at the round boundaries, they entail simpler correctness arguments. However, it is also well-understood that synchronous distributed systems, are often “impossible or inefficient to implement” [Lyn96:book, p. 5]. Hence, designers turn to the asynchronous model, in which the performance emerges [Lann03] from the current load of a system, which in normal operation has significant better performance. Thus, no synchronous algorithm is used in any real large scale system we are aware of.
In face of the different advantages of synchronous and asynchronous models, the question is how to connect these two worlds. We consider the question, given an asynchronously algorithm, does it have a synchronous canonic counterpart?
The main difficulty stems from the fundamental discrepancy in the control structure, that is, (i) interleavings and (ii) message buffers:
(i) Interleavings: In synchronous round-based models, computation is structured in rounds that are executed by all process in lock-step. There is no interleaving between steps, the beginning and the end of each step is synchronized across processes, by definition. In the asynchronous computational model executions are much less structured. Processes are scheduled according to interleaving semantics. This leads to an exponential number of intermediate global states (exponential in the number of steps) vs. a linear one in the synchronous case.
(ii) Message buffers: In the synchronous model, messages are received in the same round they are sent. Thus, the number of messages that are in-transit is bounded at all times, and depends on the number of processes. In the asynchronous model, fast processes may generate messages quicker than slow processes may process them. Thus communication needs to be buffered, and the buffer size is unbounded. The number of messages that are in a buffer depends on the number of processes, but also on the number of send instructions executed by each process. Moreover, the network may reorder messages, that is, a process may receive a new message before all older ones, that are still in-transit.
Due to the discrepancy, there is no obvious reason why an asynchronous algorithm should have a synchronous canonic form. In general there is none. We characterize asynchronous systems that allow to dissolve this discrepancy.
As not all conceivable asynchronous protocols can be rewritten into synchronous ones, we focus on characteristics of practical distributed systems. From a high-level viewpoint, distributed systems are about coordination in the absence of a global clock. Thus, distributed algorithms implement an abstract notion of time to coordinate. This notion of time may be implicit. However, the local state of a process maintains this abstract time notion, and a process timestamps the messages it sends accordingly. Synchronous algorithms do not need to implement an abstract notion of time, as it is present from the beginning: the round number plays this role and it is embedded in the definition of any synchronous computational model. The key insight of our results is the existence of a correspondence between values of the abstract clock in the asynchronous systems and round numbers in the synchronous ones. Using this correspondence, we make explicit the “hidden” round-based synchronous structure of an asynchronous algorithm. More systematically, in an asynchronous system:
abstract time is encoded in local variables. Modifications of their values mark the progress of abstract time, and — making the correspondence to round-based algorithms — the local beginning/ending of a round;
a global round consists of all the steps processes execute for the same value of the abstract clock;
messages are timestamped with the value of the abstract time of the sender, when the message was sent; a receiver can read the abstract time at which the message was sent, and compare it with its own abstract time;
in order to have a faithful correspondence to round-based semantics, we consider communication-closed protocols: the reception of a message is effectful only if its timestamp is equal to or greater than the local time of the receiver. In other words, stale messages are discarded.
Based on these insights (i) we characterize asynchronous protocols whose executions can be reduced to well-formed canonic executions, (ii) we define a computational model CompHO whose executions are all canonic by definition, (iii) we show how to translate an asynchronous protocol into code for the CompHO framework, and (iv) we show the benefits of this computed canonic form for design, validation, and verification of distributed systems. We discuss these four points by using a running example in the following section.
2 Our approach at a glance
The running example in Fig 1 is inspired by typical fault-tolerant distributed protocols that often rely on the notion of leadership. For example, primary back-up algorithms use a leader to order the client requests and to ensure this order among all replicas. A leader is a process that is connected via timely links to a majority of replicas. Hence, only in ballots where such a well-connected leader exists, the system should try to make progress. The algorithm in Fig. 1 implements just the leader election algorithm. All processes execute the same code, and is the number of processes. In each loop iteration each process queries its coord oracle in line 1 to check whether it is a leader candidate. Multiple processes may be candidates in the same iteration. Depending on the outcome, the code then branches to a leader branch and a follower branch. A candidate process sends to all in line 1. Then, the leader branch has the same code as the follower branch, that is, waiting for the first message by a candidate in the loop starting at lines 1 and 1. Thus, if there are multiple candidates there is a race between them. If a message from a candidate for a current of future ballot is received, processes update their ballot in line 1 and 1
, and then set their leader estimate in the next line. This estimate is then sent to all in line1 and 1, and then processes wait to receive messages from a majority (), for the current ballot. If all received messages carry the same leader identity, then a process knows a leader is elected in the current ballot, and it records the leader’s identity in line 1 and 1. From a more structural viewpoint, because this protocol is supposed to be fault-tolerant, the receive statements in, e.g., line 1 and 1 are non-blocking and may receive NULL if no message is there.
Consider the asynchronous execution on the left of Figure 2. Process P1 always takes the follower branch, P3 is a candidate in ballot 1, but its messages get delayed. So process P2 times out in ballot 1 in line 1 and 1, and becomes a candidate in ballot 2. Its message reaches P1, which jumps to ballot 2 and sends its leader estimate to all in line 1, as does P2. However, only P1 receives all these message so that it gets over the threshold to set its leader in line 1. Messages marked with a cross are dropped by the network. As the messages by P1 arrive late, i.e., the receiver’s local time passed the timestamp of the message, the messages sent by P1 become stale and are disregarded by P2 and P3.
As a result, the late messages sent by P1 have the same effect as if they were dropped by the network. In this view, the execution of the right on Figure 2 is obtained from the one on the left by a “rubber band transformation” [Mattern89], that is, transitions are reordered by maintaining to local control flow, and the causality imposed by sending and receiving a message. We call the execution on the right the canonic form of the execution on the left.
Characterization of existence of a canonic form.
Let us understand whether each execution of the asynchronous protocol from the example can be brought into a canonic form. The first observation is that the variables ballot and label encode abstract time. Let and be evaluations of the variables ballot and label. Then abstract time ranges over . We fix NewBallot to be less than AckBallot, and consider the lexicographical order over . Then we observe that the sequence of induced by an execution at a process is monotonically increasing; thus encodes a notion of time. However, a locally monotonic ascending sequence of values is not sufficient to derive a global notion of time, i.e., a globally aligned ascending sequence of values, as in Figure 2 on the right. Technically, aligning means that we need a reduction argument where (i) we tag an event with the local time of the process at which it occurs, and (ii) if in an execution a transition tagged with happens before a transition tagged with at another process, with , then swapping these two transition should again result in an execution. That is, and should commute. This condition is satisfied when stale messages are discarded. In other words, it is ensured if the protocol is communication-closed: first, each process only sends for the current timestamp, e.g., the send statement in line 1 sends a message that carries the current pair. Second, each process receives only for the current or a higher timestamp, e.g., received messages are stored, e.g., in line 1, only if they carry the current or a future pair; cf. line 1.
We introduce a tag annotation in which the programmer can provide us with the variables and parts of the messages that are supposed to encode abstract time and timestamps. Using these tags, the protocol is annotated with verification conditions stating that (1) abstract time is monotonically increasing (line 1), which use the predicate in Fig. 10 (page 10), (2) each process sends only for the current timestamp (line 1), and (3) each process receives messages from the current or a higher timestamp (line 1). Given an annotated asynchronous protocol, we check the validity of these assertions using the static verifier Verifast [verifast]. Using Verifast was extremely useful to prove the conditions on the content of the mailbox, which is an unbounded list (lines 1 and 1). For example, the assert at line 1 states that all messages in the mailbox have their ballot and label fields equal with the local variable ballot and label. The other predicates are given Fig. 10.
If these checks are successful, the existence of a canonical form is guaranteed by a new reduction theorem proven in Section 6. Our reduction uses ideas from [EF82] where the notion of communication closure was introduced in CSP. Our notion of communication closure is more permissive than the original form, as we allow to react to a message that is timestamped with higher value than the current local abstract time , provided that the code immediately “jumps forward in time” to . This corresponds in our example to P1 jumping to ballot 2 upon reception of the message by P2.
In contrast to Lipton’s reduction [Lipton75], where one proves that actions in an execution can be moved in order to get a similar execution with large atomic blocks of local code, for distributed algorithms one proves that one can group together the send transitions of all processes, then the receive transitions of all processes, and then all computation steps, for all times in increasing order. In this way, we formally establish that the asynchronous execution from the left of Figure 2 corresponds to the so-called round-based execution on the right. Executions of this form we call canonic.
A computational model for canonic executions.
Since we can reduce an asynchronous execution to a canonic one, our goal is to re-write an asynchronous protocol into a program with round-based semantics. For this we first need to establish a programming model for canonic (round-based) protocols. Several round models exist in the literature [DLS88:jacm, Gafni98, SW89:stacs, Charron-BostS09]. We adapt ideas from these models for our needs, and introduce our new CompHO model. It allows us to express a more fine grained modeling of faults, network timing, and sub-routines. The closest model from the literature is the Heard-Of Model [Charron-BostS09] that CompHO extends to multi-shot algorithms (multiple inputs received during the executions) and has a compositional semantics based on synchronized distributed procedure calls. Figure 3 shows the CompHO program obtained from Figure 1.
The interesting feature that abstracts away faults and timeouts are the so-called sets. For each round and each process , the set contains the set of processes from which hears of in that round, i.e., whose messages show up in mbox in, e.g., line 3. For instance, in the simplest case, if a message from a process is lost, it just does not appear in the set. But also the forward jumping is accounted for. Consider the execution on the left of Figure 4. While the processes P1 and P2 made progress, P3 was disconnected from them. Then, while locally still being in ballot 1, P3 receives a message for ballot 20. The execution on the right is an execution in CompHO and the jump by Process P3 is captured by , for . For all the skipped rounds, mbox evaluates to the empty list.
We have augmented the Heard-Of Model with in() and out() primitives for multi-shot algorithms such as state machine replication where the system gets commands from an external client and should output results.
Computing the canonic form.
Having defined the round-based semantics of CompHO, we introduce a rewriting procedure. It takes as input the asynchronous protocol together with the annotations that have been checked to entail a canonic form, and produce as output the protocol rewritten in CompHO.
The main challenges for the rewriting come from the different possible control flows of the programs and their relation to the abstract notion of time. Due to branching (e.g., in line 1), code that appears in different places may be required to be composed into code for the same round; e.g., line 1 on the leader branch belongs to the same round as line 1 on the follower branch and will end up in line 3 of the canonic form. (More precisely, the generated code has branching within a round and the statement of line 3 appears in both branches. We simplified the code given in Figure 3 for better readability.) In addition, there are jumps in the local time; cf. Figure 4. This corresponds in the CompHO to phases and rounds that are skipped over by a process, that is, it neither receives nor sends messages, and maintains (stutters) its local state. The asynchronous statements before and after a jump must be properly mapped to rounds in CompHO. We address these issues in Section 7 and have implemented our solution. We used it to automatically generate CompHO code for several asynchronous protocols.
Benefits of a round-based synchronous normal form.
The generated CompHO code represents a valuable design artifact. First, a designer can check whether the implementation meets the original intuition. For instance, the left execution in Figure 4 gives a typical asynchronous execution. In papers on systems, designers explain their systems with well-formed executions like the one on the right. The designer can check with the CompHO code whether the asynchronous protocol implements the intended ballot and round structure, and whether phase jumps can occur only at intended places. Second, it helps in comparing protocols: Different ways to implement the branching due to roles (e.g., leader, follower) leads to different asynchronous protocols. If different asynchronous protocols have the same canonic form, then they encode the same distributed algorithm.
Finally, the canonic form paves the way for automated verification. The specification of the running example (in the asynchronous and synchronous version) is that in any ballot , if two processes and find that leader election was successful (i.e., their log_ballot entry is true), then they agree on the leader:
To prove this property in the asynchronous model, already for the first ballot, i.e., , one needs to introduce auxiliary variables to be able to state an inductive invariant. A process elects a leader if it receives more than messages having the same leader id as payload, so one needs to reason about the set of messages they received. As discussed in [SergeyWT18], this can only be achieved by introducing auxiliary variables that record the complete history of the message pool. Then one can state an invariant over the asynchronous execution that relates the local state of processes (decided or not) with the message pool history.
The proof of the same property for the synchronous protocol requires no such invariant. Due to communication closure, no messages need to be maintained after a round terminated, that is, there is no message pool. One just needs to consider the transition relation of a phase, or ballot (conjunction of two rounds). The global state after the transition, that is, at the end of the phase, captures exactly which processes elected a leader in the considered phase.
In general, to prove the specification, we need invariants that quantify over the ballot number . As processes decide asynchronously, the proof of ballot 1 for some process must refer to the first entry of log_ballot of processes that might already be in ballot 400. Thus, the invariants need to capture the complete message history and the complete local state of processes. The proof in the synchronous case is modular: For any two phases, messages do not interfere and processes write to different ballot/phase entries. Therefore the agreement proof for one ballot generalizes for all ballots.
Many verification techniques benefit from a reduction from asynchronous to synchronous. In particular, the model checking techniques in [MaricSB17, TS11] are desingned specifically for the Heard-Of model [Charron-BostS09, BielyWCGHS07], and can be applied on our output. Theorem provers like Isabelle/HOL where successfully used to prove total correctness of algorithms in the Heard-Of model [Charron-BostDM11]. We used deductive verification methods for the Heard-Of model [vmcai] and proved the (partial) correctness of the synchronous version of the running example (and other protocols).
3 Asynchronous protocols
Protocols are written in the core language in Fig 5. All processes execute the same sequential code, which is enriched with send, receive, and timeout statements.
The communication between processes is done via typed messages. Message payloads, denoted , are wrappers of primitive or composite type. Wrappers are used to distinguish payload types from the types of the other program variables. Send instructions take as input an object of some payload type and the receivers identity or corresponding to a send to all (broadcast). Receive statements return an object of payload type and the identity of the sender, that is, one message is received at a time. Receives are not blocking. If no message is available, receive returns . We assume that each loop contains at least one send or receive statement. The iterative sequential computations are done in local functions, i.e., f(). The instructions in() and out() are used to communicate with an external environment (processes not running the protocol).
The semantics of a program is the asynchronous parallel composition of the actions performed by all processes. Formally, the state of a protocol is a tuple where: is a valuation of the variables in where the program location is added to the local state and is the multiset of messages in transit (the network may lose and duplicate messages). Given a process p ∈ P , s(p) is the local state of p, which is a valuation of p’s local variables, i.e., s(p) ∈ [Vp → D]. We use a special value ⊥ to represent the state of crashed processes. When comparing local states, ⊥ is treated as a wildcard state that matches any state.
The messages sent by a process are added to the global pool of messages , and a receive statement removes a messages from the pool. The interface operations in and out do not modify the local state of a process. These are the only statements that generate observable events.
An execution is an infinite sequence 0 0 1 1 . . . such that , is a protocol state, is a local statement and is a transition of the form corresponding to the execution of , where are the observable events generated by the (if any). We denote by the set of executions of the protocol .
4 Round-based model
We introduce CompHO by first presenting the syntax and semantics of the intra-procedural version of CompHO, extending it then to inter-procedural case.
Intra-procedural CompHO Model.
CompHO captures round-based distributed algorithms: all processes execute the same code and the computation is structured in rounds, where the round number is an abstract notion of time: processes are in the same round, and progress to the next round simultaneously. We denote by the set of processes and is a parameter. Faults and timeouts are modeled by messages not being received. In this way the central concept is the Heard-Of set, -set for short, where contains the processes from which process has heard of — has received messages from — in round .
|protocol||::=||interface variable init phase|
|interface||::=||in: () type | out: type ()|
A CompHO protocol is composed of local variables, an initialization operation init, and a non-empty sequence of rounds, called phase. The syntax is given in Fig. 6. A round is an object with a send and update method, and the phase is a fixed-size array of rounds. Each round is parameterized by a type T (denoted by round) which represents the payload of the messages. The send function has no side effects and returns the messages to be sent, a partial map from receivers to payloads, based on the local state of each sender. The update function, takes as input the received messages, i.e., a partial map from senders to payloads, and updates the local state of a process. It may communicate with an external client via , which returns an input value, and which outputs a a value to the client. For data computations, update uses iterative control structure only indirectly via auxiliary functions, like all_same in the running example, whose definition we omit.
The set of executions of a CompHO protocol is defined by the execution of the send and update functions of the rounds in the phase array in a loop, starting from the initial configuration defined by init.
A protocol state is a tuple where:
is the set of processes executing the protocol;
indicates if the next operation is send or update;
stores the process local states;
is the round number, i.e., the counter for the executed rounds;
stores the in-transit messages, where is the type of the message payload;
evaluates the -sets for the current round.
The semantics is shown in Figure 7. Initially the system state is undefined, denoted by . The first transition calls the init operation on all processes (see Start in Fig. 7), initializing the state: The round is , no messages are in the system. Start brings the system into a state that requires the next transition to be a Send. After that, an execution alternates Send and Update transitions. In the Send step, all processes send messages, which are added to a pool of messages , without modifying the local states. The values of the sets are updated non-deterministically to be a subset of . The messages in are triples of the form (sender, payload, recipient), where the sender and receiver are processes and the payload has type T. The triples are obtained from the map returned by send to which we add the identity of the process that executed send. In an Update step, messages are received and the update operation is applied in each process. A message is lost if the sender’s identity does not belong to the set of the receiver. The set of received messages is the input of update. If the processes communicate with an external process, then update might produce observable events . These events correspond to calls to , which returns an input value, and that sends the value given as parameter to the client. The communication with external processes is non-blocking; we assume that the function always returns a value when called. At the end of the round, is purged and is incremented by 1.
The right diagram of Fig. 2 corresponds to an execution of the CompHO protocol in Fig. 3. The Send step of round AckEpoch consists of process P3 sending in line 3, and the environment dropping its messages to P1 and P2. As they do not receive messages, Update does not result in a state change due to line 3. Hence old_mbox1 does no change so that the guard in line 3 evaluates to false at P2 and P3, so that they do not send in the AckEpoch round.
Inter-procedural CompHO Model.
We introduce distributed procedure calls to capture realistic examples. In Multi-Paxos [generalizedpaxos] processes agree on a order over client commands. This order is stored in a local log, that contains the commands received/committed so far. Consider Figure 8. Here a new leader gets elected with a NewBallot and AckBallot message exchange, almost as in our example in Section 2. The difference is the AckBallot round where followers (1) send only to the leader instead of an all-to-all communication, (2) the message payload contains the current log of the follower. Then the leader computes the longest log and sends it to its followers, in the third round called NewLog. Those that receive the new log start a subprotocol. The subprotocol iterates through an unbounded number of phases each consisting of a sequence of rounds, Prepare, PrepareOK and Commit, in which the replicas put commands in their logs. Iteratively, the leader takes a new input command from the client and forwards it to the replicas using a Prepare message. Followers reply with PrepareOK acknowledging the reception of the new command. If the leader receives acknowledgements it sends a Commit message, otherwise it considers its quorum lost and returns to leader election. A follower that does not receive a message from the leader, considers the leader crashed, and control returns from the subprotocol to the leader election protocol.
We only sketch the model. The inter-procedural CompHO protocol differs from its intra-procedural version only in the update function. A process may call another protocol and block until the call to this other protocol returns. An update may call at most one protocol on each path in its control flow (a sequence of calls can be implemented using multiple rounds). Due to branching, only a subset of the processes may make a call in a round. Thus, an inter-procedural CompHO protocol is a collection of (inter/intra-procedural) non-recursive CompHO protocols, that call each other, with a main protocol as entry point.
5 Formalizing Communication Closure using Tags
We now introduce tags that use so-called synchronization variables to annotate protocols. A tagging function induces a local decomposition of any execution, where a new block starts whenever the evaluation of the synchronization variables changes. (Recall the set for our example in Section 2.) This tagging thus represents a novel formalization of communication-closed protocols using syntactic definitions of local decompositions.
Definition 1 (Tag annotation)
For a protocol , a tag annotation is a tuple :
is a tuple of fresh variables,
, is a function that annotates each control location with a partially defined injective function, that maps over protocol variables, and
is an injective partially defined function, that maps variables in to components of the message type (of the same type).
The evaluation of a tag over ’s semantics is denoted , where
, is a function over the set of local process states, , defined by , with
if , where is the i variable in and is the program counter,
is a function that for any value of message type associates a tuple with
if , where is the i variable in and , the mapping of over the message type T, is defined in ;
For the protocol in Fig. 1, we consider the tag annotation over two variables that at all control locations associates with the ballot number, and with label. The first two components of messages of type are mapped to . A message (sent in line 1) is evaluated by into . The state tag evaluates into if the value of the variable is .
We characterize tag annotations that imply communication closure:
Definition 2 (Synchronization tag)
Given a program , an annotation tag is called synchronization tag iff:
for any local execution of a process in the semantics of , is a monotonically increasing sequence of tuples of values w.r.t. the lexicographic order.
for any local execution , if is a transition of , with a value of some message type, then and .
for any local execution , if is a transition of , with a value of some message type, then
if is surjective (T is a message type.
, otherwise. Also .
if then .
for any local execution , if is a transition of such that
, , that is, s and s’ differ on the variables that are neither of some message type nor synchronization variables,
or stm is a send, break, continue, or out(),
then , for all Mbox:Set(T), T, with . That is, observable state changes and sends happen only if the state tag matches the maximal received message tag.
If an annotation tag is a synchronization tag, the variables that annotated the protocol are called synchronization variables.
Condition 1 states that the variables incarnating the abstract time are not decreased by any local statement. Condition 2 states that any message sent is tagged with a timestamp that equals the local time of the current state. Condition 3 states that any message received and stored is tagged with a timestamp greater or equal than the current time of the process. All messages timestamped with greater values than the local time, must have equal timestamps. with the tag of the state where it is received. Finally, 4 states if messages from future rounds are stored in the reception variables, any statement that is executed in the following must not change the observable state, but rather increase the tag until the process has arrived at the maximal time it received a message from.
Tags and CompHO protocols.
An intra CompHO protocol defined in Section 4 executes a (infinite) sequence of phases, each consisting of a fixed number of rounds. It is thus natural to annotate the code of an asynchronous protocol with a tag . In an inter CompHO protocol, within a round, processes may call an inner CompHO protocol. Here, an instance of an inner round can be identified by phase and round of the outer (calling) protocol, and phase and round of the inner protocol. We are thus led in the following to consider tags that capture this structure:
We start with preliminary definitions. Given two values and , where (1) if is the maximum value in , then where is the minimum value in and is the successor of w.r.t. the order , denoted ; (2) else and is the successor of in .
Definition 3 (CompHO synchronization tag)
Given a protocol annotated with a synchronization tag , the tag is called CompHO synchronization tag if has an even number of variables, i.e., , such that each pair has a different type (at least on one of the components) and
takes a constant number of values, forall in ,
the monotonic increasing order is refined; for any local execution of a process in the semantics of , where iff
if then for all and forall .
Further, if or , the tag is called incremental.
For every , is called a phase tag and is called round tag.
5.1 Verification of synchronization tags
Given a protocol annotated with a (CompHO-) tag , checking that the tag is a (CompHO-) synchronization tag reduces to checking a reachability problem on the local code, that is, in a sequential system.
The non-sequential instructions are the sends and receives, appearing in Conditions 2 and 3 of Definition 2. Checking that sent messages are tagged with the tag of the state they are sent in, that is Condition 2, reduces to checking equality between local variables: the components (tagged by ) of a message type variable and the local variables associated by at the control location that sends . Recall that does not modify the local state, so, it can be replace with an assert corresponding to the aforementioned equality.
Checking that messages with lower tags are dropped, that is, Condition 3, is done by checking that the messages that are added to mbox have values (on the tagged components) greater than or equal to the local variables associated by at the control location where the addition occurs. We assume that may return any message and we check that the filters that guard the message’s addition to the mailbox respect the order relation w.r.t. the state tags. This is again expressed by a state property that relates message fields with tag variables.
Conditions 1 and 4 in Def. 2 (and Def. 3) translate into transition invariants over the synchronization variables. They state that the lexicographic order (monotonic or increasing) is preserved by any two consecutive assignments to the synchronization variables.
We automated these checks with the static verifier Verifast, and report in Section 8 on our experiments.
6 Reducing an asynchr. execution to its canonic form
After having introduced synchronization tags, we now show that any execution of an asynchronous protocol that has a synchronization tag can be reduced to a canonic execution. The proof proceeds in several steps, where in each step we will obtain a more restricted execution. The steps are as follows:
- Asynchronous executions.
We start with an asynchronous execution as defined in Section 3. Due to asynchronous interleavings, an action at process that belongs to round may come before an action at some other process in round , for .
- Big receive.
In order to capture jumping forward in rounds, we will regroup statements at different process to arrive at an asynchronous execution, where for each process a sequence of receive statements (followed by local computations for a jump) appears in a block. Thus, we can replace these blocks by a single atomic . The resulting executions we denote by .
- Monotonic executions.
We reduce asynchronous executions with Big receive semantics to execution where all tags are (non-strictly) monotonically increasing. As a result, all actions for round appear before all actions for all rounds , for .
- Round-based executions.
We reduce monotonic executions to CompHO executions as defined in Section 4.
In each step, we maintain the following important property between the original execution and the execution we reduce to:
Definition 4 (Indistinguishability)
Given two executions and of a protocol , we say a process cannot distinguish locally between and w.r.t. a set of variables , denoted , if the projection of both executions on the sequence of states of , restricted to the variables in , agree up to finite stuttering, denoted, .
Two executions and are indistinguishable w.r.t. a set of variables , denoted , iff no process can distinguish between them, i.e., .
We focus on indistinguishability because it preserves so-called local properties [Chaouch-SaadCM09], or equivalently properties that are closed under local stuttering. Important fault-tolerant distributed safety and liveness specifications fall into this class: consensus, state machine replication, primary back-up, k-set consensus, etc.
Definition 5 (Local properties)
A property is local if for any two executions and that are indistuingishable iff .
In the following we will denote by a global state and by the local state of process in the global state .
Reducing Asynchrony to Big receive.
This reduction considers the receive statements. If the local execution is of the form , , in the asynchronous execution, the two receive actions can be interleaved by actions of other processes. Following the theory by Lipton [Lipton75], all receive statements are right movers with respect to all other operations of other processes, as the corresponding send must always be to the left of the receive. In this way, we reduce an asynchronous execution to one where local sequences of receives appear as block. By the same argumentation, this block can be moved right until the first action of this process. Again the resulting block can be moved to the right w.r.t. actions at other processes. By repeating this argument, we get an asynchronous executions with blocks that consist of several receives (possibly just one receive) and statements such that at the end the local state tag matches the maximal received message tag, i.e., the process has jumped forward to a round from which it received a message. We will subsume such a block by an (atomic) action , and denote by the asynchronous semantics with the atomic Receive.
Reducing Big receive to monotonic.
Given a program if there is a synchronization tag for , then , if , and , then
Further are indistinguishable w.r.t. all protocol variables, i.e., .
From and 1 follows that , so that swapping cannot violate the local control flow. As , if is a send or a stm, the action at has no influence on the applicability of to . The only remaining case is that is a . Only if sends a message that is received in , cannot be moved to the left. We prove by contradiction that this is not the case: By 2, . By 3 and 4, and the atomicity of Receive, . Thus, which provides the required contradiction to the assumption of the lemma .
The statement on indistinguishability follows from the reduction. ∎
By inductive application of the theorem, we obtain:
Given a program if there is a synchronization tag for , then , there is a monotonic asynchronous execution , where for each and any two processes and , .
The monotonic execution is thus a sequential composition of actions of rounds in increasing order, that is, all actions of round occur before all actions in round , for all . Thus, the global state between the last round action and the first round action constitutes the boundary between these rounds. In the following section we will show that we can simplify the reasoning within a round.
Reducing a Round to a Synchronous round.
In order to reduce monotonic executions into semantics we re-use arguments by [Chaouch-SaadCM09], which we have to extend for asynchronous programs. We consider distributed programs of a specific form: The local code within each round is structured in that first there are send, then Receive, and then other statements. Similarly, we only consider protocols where it is sufficient to check states only when the tags change. If the local code within a round is “subsumed” to a single local transition, we do not lose any observable events. Rather, the subsumption is locally stutter equivalent to the original asynchronous semantics.
As we start from monotonic executions here, we can restrict ourselves to swapping actions within a round and only have to care about moving send and receive actions. For this, we can use the arguments from [Chaouch-SaadCM09]: the send actions are left movers with respect to all other operations, Receive actions are left movers with all statements except sends. By repeated application of their arguments, we arrive at executions where within a round all send actions come before all Receive actions, which come before all other actions. We call these executions send-receive-compute executions:
For each monotonic asynchronous execution, there exists an indistinguishable asynchronous send-receive-compute execution.
All sends are non-interfering and can thus be “accelerated” or “subsumed” in one global send action. As in CompHO all messages sent in a round must be of identical payload type, the type to be sent in the subsumed action is the union of the payload types. Similar for receive. Here, the sets are defined by the processes of which a process received messages in its receive operations. If in the original execution process jumped over round , there are no send, receive, and local computation actions for in . As we require in the semantics that every process performs these steps in each round, we have to complete the execution with nop steps for the missing rounds. As they do not change which messages are received in the asynchronous execution, and which local states the processes go through, we again remain stutter equivalent, and obtain.
For each asynchronous send-receive-compute execution, there exists an indistinguishable CompHO execution se, where the messages received in a Receive statement correspond to the sets.
Following Definition 5, local properties are those closed under indistinguishability, so that we obtain the following theorem.
If there exists a synchronization tag for , then there exists an CompHO-execution se that satisfies the same local properties.
7 Code to code rewriting of Asynchronous to CompHO
We introduce a rewriting algorithm make-CompHO that takes as input an asynchronous protocol annotated with a synchronization tag and either produces a (inter-procedural) CompHO protocol, denoted CompHO(), whose executions are indistinguishable from the executions of , or aborts.
Replacing reception loops with atomic mailbox definition.
We consider asynchronous protocols where message reception is implemented in a distinguished loop, that we refer to as “reception loop”. A reception loop is a simple while(true) loop, that (1) contains recv statements, (2) writes only to variables of message type, or containers of message type objects, (3) the loop is exited either because of a timeout or because some condition over the message type variables wrote in the loop holds. The algorithm in Fig. 1 has four reception loops, at lines 1, 1, 1, and 1. The exit of a reception loop is typically cardinality constraints or timeout, e.g., mboxsize > n/2 or mboxsize == 1.
A reception loop is replaced by havoc assignments of the message type/container of message type variables written by the loop. The code following the loop is left unchanged except in the following cases: (1) the boolean conditions that refer to a loop timeout are replaced by the negation of all the other conditions to exit the loop; (2) if the loop does not have a timeout exit, that is, processes wait until all required messages are received, the code following the loop is wrapped into an if statement, allowing its execution only if the loops exit condition holds. In the rest of the section we consider only protocols whose reception loops have been replaced by havoc statements.
Rewriting protocols with incremental synchronization tags.
Let be a protocol consisting only of one loop annotated with an incremental CompHO synchronization tag . The rewriting in this section builds a (intra-procedural) CompHO protocol in two steps: (1) each iteration of ’s loop defines a phase and (2) the code of each phase (the loops body) is decomposed into rounds.
Phases are matched with loop iterations if (representing the phase number) is increased exactly once in each iteration to its successive value (like the loop counter). To this we assume the protocol verified for strengthened annotations regarding tags monotonicity: the relation is an invariant of the loop, where is previous value of . If has initial value 1, then the phase number matches the iteration number. Otherwise is shifted by a bounded value. The communication closure induced by the tags (see Theorem 6.2) ensures that two processes communicate only when they are in the same iteration. Hence, it is sound to construct a phase by composing the th loop iteration of all processes. Within a phase it remains to locate the round boundaries.
A CompHO synchronization tag, ensures that the round variable takes a bounded number of values: in the running examples these values are NewBallot and AckBallot. Round bounderies are defined by the beginning/end of a loop iteration and the assignments to the round variable .
Processes can have different behaviors in the same round, depending on their local state and the messages received, although they execute the same code and go through the same sequence of rounds. For example, in the round NewBallot only the processes designated coordinators by the oracle send a message. Similarly, in the AckBallot round only the processes that received a message in this round are going to update their logs. As usual these different behaviors are captured by branching instructions in the loops’s body, and each path in the loop’s body identifies a possible process behavior in sequence of rounds.
For each value of , to compute the code of round , we consider each path in the control flow graph of the loop’s body and we identify (1) a block of instructions (possibly empty) : a sequence of instruction in that starts with and ends with the instructions preceding the next assignment to ; (2) the context under which each block is executed, that is a condition that is the conjunction of all the branches leading to on the path . The is the sequential composition of all if () with path in the control flow. Fig. 9 shows the two blocks defining round NewBallot, corresponding to the leader follower paths in the control flow graph.
To maintain the context in which a sequence of instructions is executed, i.e., , we introduce auxiliary variables. For each variable x in a conditional we introduce an auxiliary variable (of the same type with ), that is assigned only once to x, i.e, , before the condition is evaluated. The conditionals are defined over these auxiliary variables. If the variable is not read without being first assigned to a default value, we can abstract to boolean. This is the case of all our benchmarks, where auxiliary variables remember values of the mailbox in previous rounds.
Moreover, if the values of the round variables do not take all values in their domain, each condition is conjuncted with the check whether the round number of the CompHO equals the round tag variable. Intuitively, if the check fails, the asynchronous code has set the round tag to a future round (of the same phase), which results in skipping the CompHO round.
Finally, the code of every round, that is, is split into a block, consisting of all send statements guarded by the conditionals preceding them and an block that contains the rest of the code in except the mbox’s havoc. We assume contains no send, no recv, and no assignments to message type variables. Otherwise the rewriting aborts. (One could try compiler optimization techniques to reorder instructions towards the imposed order.)
The rewriting eliminates the phase and round tag variables (if no rounds are skipped) from the local process variables all program locations are tagged with the same variables. Reads of these variables are replaced with reads of the round, respectively phase number, of the CompHO protocol.
For example the asynchronous protocol in Fig. 1 in Sec. 2 is rewritten into an intra-procedural CompHO one, given in Fig. 3 in Sec. 2. However Fig. 3 contains a simplification w.r.t. what is automatically generated. The code between the lines 3 to 3 appears twice, ones if the process is leader and ones if it is not. Similar for the first round.
An asynchronous protocol is structured, if receptions loops are emphasized as defined in Sec. 7 and the blocks associated with a round are a sequence of send followed by update statements.
Given a structured asynchronous protocol consisting of only one loop, that is annotated with a strictly incremental CompHO synchronization tag of size two, , make-CompHO builds an intra-procedural CompHO protocol whose executions are indistinguishable from the executions of . The resulting protocol has only one phase that consists of as many rounds as the domain of evaluation of the round_tag. sends exactly the same messages as .
Jumping over phases.
The catch up mechanism allows processes to receive messages from future rounds, which leads to a jump to the received phase number. Moreover, in general non-incremental tags allow processes to skip to future tags (which may happen, e.g., if the leader of the current phase is suspected to have crashed). In this section, we reduce the problem of rewriting a protocol with non-incremental tags to the rewriting a protocol with incremental tags.
In Sec. 7 the loop counter coincides (modulo an initial shift) with the phase tag. Jumping over phases potentially increases the phase tag by more than one, “desynchronizing” it from the loop counter. To apply the rewriting from Sec. 7 we introduce empty loop iterations, when the loop counter is smaller than the phase tag, and we reinterpret the initial increasing tags over the new loop counter, resulting into an incremental tag annotation.
First we identify the jumping control locations. These are locations where the phase tag (1) is assigned a value that depends on the mailbox and (2) the communication closure checks show that a message tag in the mailbox may be strictly greater than local tag; cf. Section 5.1. In this case the tool partitions the path with the jumping instruction into the three sequences of instructions Before_Jump, Jump, and After_Jump. Since jumps are conditional, we have to capture the cases without jump, where Before_Jump and After_Jump are both part of a round and the case with jump where Before_Jump and After_Jump are parts of code for different rounds. The rewriting encodes this cases with an auxiliary boolean variable that non-deterministically flags a jump, and a continue statement before the jumping instruction.
In all examples we explored, Before_Jump is either empty or consists only of one send instruction. Both cases are simpler and correspond to no code being execution in CompHO semantics, and it is naturally captured by empty HO sets there (in case there are send instruction the messages can always be lost).
Protocols with nested loops.
Let us consider a protocol without reception loops. The rewriting algorithm proceeds bottom-up: it starts rewriting the most inner loop using the procedure above. For each outer loop it first replaces the nested loop with a call to the computed CompHO protocol, and then applies the same rewriting procedure. Since we considered passing by value procedure calls in the CompHO semantics, all local variables are input parameters.
Inner loops appearing on different branches may belong to the same sub-protocol; in other words these different loops exchange messages. If associates different synchronization variables to different loops then the rewriting builds one (sub-)protocol for each loop. Otherwise, the rewriting merges the loops tagged with the same synchronization variables into one CompHO protocol.
To soundly merge several loops into the same CompHO protocol, the rewrite algorithm identifies the context in which the inner loop is executed.
Given a structured asynchronous program with a CompHO synchronization tagging function , then make-CompHO applied returns an inter-procedural CompHO protocol whose executions are indistinguishable from the executions of .
8 Experimental results
We implemented the rewriting procedure in a prototype tool and applied the tool to several fault-tolerant distributed protocols. The tool is available online.111 https://github.com/alexandrumc/async-to-sync-translation Fig. 11 summarizes our experimental results.
Verification of synchronization tags.
The tool takes protocols in a C embedding of the language from Sec. 3 as input. We use a C embedding to be able to use Verifast [verifast] for checking the conditions in Sec. 5.1, i.e., the communication closure of an asynchronous protocol. Verifast is a deductive verification tool based on separation logic for sequential programs. The C embedding uses the prototype of the functions send and receive (we assume their semantics is the one in Sec. 3).
The user specifies in a configuration file the synchronization tag by (i) defining the number of (nested) protocols in the input file, (ii) for each protocol, the phase and round variables, and (iii) for each messages type the fields that encode the timestamp, i.e., the phase and round number. Fig. 11 gives the names of phase and round variables of published protocols we use as benchmarks.
The tool expects the input file to be annotated with assert statements for checking the conditions in Definition 2 w.r.t. the tags given in the configuration file and the auxiliary annotations Verifast needs to prove these asserts (inductive invariants). The annotations are defined over program variables and auxiliary history variables. Auxiliary variables are necessary to encode the monotonicity of the tags. The tool calls Verifast and checks that the input contains assert tag_leq(oldballot, oldlabel, ballot, label) after each (pair of) assignment(s) of the phase and round variables. Observe that conditions 1–3 in Definition 2 are numeric constraints over the phase and round variables, and several other tools might verify them. However, condition 4 requires reasoning about the content of the mailbox, a potentially unbounded data structure. Here is where we used the strength of Verifast, to reason about dynamically allocated data structures. The size of the program annotated with the proofs for the asserts is given in LoC in the column “Annot.” in Fig. 11. If all the checks are passed, then the rewriting proceeds, otherwise the tool outputs a warning.
|Two phase commit||
While checking the verification tags can be done for any annotated asynchronous protocol, the rewriting tool checks whether the asynchronous protocol is in a specific form and only then translates it into CompHO. While in theory this is a restriction, the benchmarks in Fig. 11 show that well-known algorithms are rewritten by our tool. For instance, the algorithm [ChandraT96, Fig. 6] solves consensus using an eventually strong failure detector. The algorithm jumps over rounds in a specific way. If a special decision message is received, a process jumps forward to a decision round and outputs the decision value. The resulting algorithm is much like Last Voting in [Charron-BostS09, Fig. 5]. ViewChange is a leader election algorithm similar to the one in ViewStamped: unlike in the running example (unlike Paxos), in ViewChange processes first agree to change the current leaders, and than on a leader. The phase number is the view number (like in Paxos), that is, two processes either agree on the identity of a leader in a view or they know of no leader. Normal-Op is the sub-protocol used in ViewStamped to implement the broadcasting of new commands by a stable leader. Multi-Paxos is described in Sec 4. It is Paxos from [generalizedpaxos] over sequences, without fast paths, where the classic path is repeated as long as the leader is stable. In Paxos parlance, the tags for leader election (outer protocol) are Phase1a, Phase1b, Phase1aStart (in this order). The rounds of the sub-protocol are called Phase2aClassic, Phase2bClassic, learn. We considers that acceptors and leaders play also the role of learners.
Our tool has rewritten the protocols from Fig. 11. The implementation uses pycparser [pycparser], a parser for the C language written in pure Python, to obtain the abstract syntax tree of the input protocol. The last two columns of Fig. 11 give the size in LoC of the asynchronous protocol without annotations and the size of its synchronous counterpart computed by the rewriting procedure from Sec. 7.
We have verified the safety specification (agreement) of the CompHO counter-parts of the running example (Figure 1), Normal-Op, and Multi-Paxos, by deductive verification using the Consensus Logic (CL for short) defined in [vmcai]. To this, we encoded the specification and the transition relation in CL, and used CL’s semi-decision procedure for satisfiability [psynctool] to discard the verification conditions. For Multi-Paxos we did a modular proof. First we prove the correctness of the sub-protocols (executed in case of a stable leader). Its specification is that the logs of all processes that execute the sub-protocol are equal at the beginning and at the end of each phase (after an iteration of Prepare, PrepareOk, Commit), knowing that processes start the sub-protocol with equal logs. Moreover, the sub-protocol preserves the invariant property that a majority of processes have the same prefix, consisting of all the committed commands. Then we prove the leader election outer loop correct. Its specification states that there is at most one leader in a ballot (like in (1)) and that a majority of processes have the same prefix, consisting of all the committed commands. The leader picks the longest log of its followers. The fact that all committed values are logged by a majority of processes ensures that the new log proposed by the leader will not have lost any committed commands. However, there are no guarantees for the uncommitted commands.
9 Related work
Our goal is to link synchronous or round-based models to asynchronous models via the notion of communication closure [EF82]. Exploiting this for better design and simpler paper-and-pencil proofs was considered, e.g., in [DBLP:journals/siamcomp/MosesR02, ChouG88, EngelhardtM05a]. Several round-based computational models are based on this idea [DLS88:jacm, Lyn96:book, Charron-BostS09, Gafni98, SW89:stacs]. Commonly the underlying idea is to design an algorithm for the round-based setting and deduce results for the asynchronous. A method that takes round-based code as input and generates asynchronous code was given in [DragoiHZ16]. However, for efficiency reasons, designers often prefer to work with asynchronous code. Therefore, in this paper we start from asynchronous protocols, and compute the round-based canonic form as design artifact and for verification purposes. From the canonic form one can choose one of the existing automated verification methods for round-based distributed algorithms [DBLP:conf/srds/TsuchiyaS07, DBLP:conf/wdag/TsuchiyaS08, Charron-BostDM11, DBLP:journals/afp/DebratM12, vmcai, GleissenthallBR16, MaricSB17, AminofRSWZ18].
There are several other frameworks for the verification of asynchronous distributed algorithms, e.g., Verdi [DBLP:conf/pldi/WilcoxWPTWEA15], IronFleet [HawblitzelHKLPR17], ByMC [KLVW17:POPL], Ivy [PadonMPSS16], and Disel [SergeyWT18]. Very interesting distributed algorithms have been verified in these frameworks. Still, they require considerable expertise either in manually fitting asynchronous code to the fragment that can be dealt with by the method, or in guiding interactive theorem provers. Typically, these works also consider verification of specific algorithms which makes it hard to generalize ideas.
Our research belongs to an effort to develop techniques for automated reduction to synchronized executions. Three concurrent approaches in this quest are the exciting results in [BouajjaniEJQ18], [KraglQH18] and [DBLP:journals/pacmpl/BakstGKJ17, GGB19]. Compared to their work, our approach is less guided by specific communication patterns of existing systems. Rather we put communication closure in the center of our considerations. Hence, we are more permissive to different communication structures. For instance, the recent paper [GGB19] does not allow skipping rounds, with the side effect that they cannot model that a process remains leader for several consecutive iterations, which is an important efficiency mechanism in systems that implement ideas from Paxos [generalizedpaxos] and Viewstamped Replication [OkiL88]. The notion of k-synchronizability in [BouajjaniEJQ18] is restricted to FIFO communication channels. In contrast our method does not make any assumptions about the communication model between processes (works for UDP, TCP/IP). Moreover, unbounded jumps over phases cannot be captured by k-synchronizability. The method in [KraglQH18] adapts verification methods for remote procedure calls to leader/follower communication. As a result they do not support rounds with all-to-all communication or that a leader plays also the role of a follower; both is the case in our running example.
10 Conclusion and future work
We formalized the notion of communication closure of asynchronous protocols and showed that several challenging benchmarks satisfy this property. We showed that communication closure captures formally the intuition of protocol designers and is an enabler for a synchronous canonic form of asynchronous protocols. This canonic form enables the use of different verification techniques, and we verified the several benchmarks using the Consensus Logic framework [vmcai].
We consider the verification of synchronous round-based protocols an orthogonal problem, however progress in this research area has direct impact on the verification of asynchronous protocols that are communication-closed. Roughly the main difficulty regarding automating reasoning of synchronous systems comes from the data they manipulate and not from their control structure.
Our methods preserves relevant safety and liveness properties. Reasoning about liveness in CompHO requires assumptions about the sets, which can be done in Consensus Logic. However, besides initial theoretical results [SHQ18], the connection between sets and the asynchronous world is formally not well understood, yet. Thus, there is no automated method that translates asynchronous receptions loops with time-outs, etc. into sets. This would be required for total correctness regarding liveness and is subject to future work.