## 1 Introduction

Consensus algorithms are an essential component of the modern fault-tolerant
deterministic services implemented as message-passing distributed systems.
In such systems, each of the distributed nodes contains a replica of the
system’s state (*e.g.*, a database to be accessed by the system’s clients), and
certain nodes may propose values for the next state of the system (*e.g.*

, requesting an update in the database). Since any node can crash at any moment, all the replicas have to keep copies of the state that are consistent with each other. To achieve this, at each update to the system, all the non-crashed nodes run an instance of a

*consensus protocol*, uniformly deciding on its outcome. The safety requirements for consensus can be thus stated as follows: “only a single value is decided uniformly by all non-crashed nodes, it never changes in the future, and the decided value has been proposed by some node participating in the protocol” [16].

The Paxos algorithm [15, 16] is the classic consensus
protocol, and its single-decree version (SD-Paxos for short) allows a set of
distributed nodes to reach an agreement on the outcome of a *single*
update.
Optimisations and modifications to SD-Paxos are common. For instance, the
multi-decree version, often called Multi-Paxos
[15, 27], considers multiple slots
(*i.e.*, multiple positioned updates) and decides upon a result for *each*
slot, by running a slot-specific instance of an SD-Paxos.
Even though it is customary to think of Multi-Paxos as of a series of independent
SD-Paxos instances, in reality the implementation features multiple
protocol-aware optimisations, exploiting intrinsic dependencies between
separate single-decree consensus instances to achieve better throughput. To a
great extent, these and other optimisations to the algorithm are pervasive,
and verifying a modified version usually requires to devise a new protocol
definition and a proof from scratch.
New versions are constantly springing (cf. Section 5 of
[27] for a comprehensive survey) widening the
gap between the description of the algorithms and their real-world
implementations.

We tackle the challenge of *specifying* and *verifying* these
distributed algorithms by contributing two verification techniques for
consensus protocols.

Our first contribution is a family of composable specifications for Paxos’
core subroutines.
Our starting point is the deconstruction of SD-Paxos by
Boichat *et al.* [2, 3],
allowing one to consider a distributed consensus instance as a
*shared-memory concurrent program*.
We introduce novel specifications for Boichat *et al.*’s modules, and let them be
non-deterministic.
This might seem as an unorthodox design choice, as it *weakens* the
specification. To show that our specifications are still *strong enough*,
we restore the top-level *deterministic* abstract specification of the
consensus, which is convenient for client-side reasoning.
The weakness introduced by the non-determinism in the specifications
has been impelled by the need to prove
that the implementations of Paxos’ components *refine* the specifications
we have ascribed [9]. We prove the refinements modularly via the
Rely/Guarantee reasoning with prophecy variables and explicit linearisation
points [26, 11].
On the other hand, this weakness becomes a virtue when better understanding
the volatile nature of Boichat *et al.*’s abstractions and of the Paxos
algorithm, which may lead to newer modifications and optimisations.

Our second contribution is a methodology for verifying composite consensus
protocols by reusing the proofs of their constituents, targeting specifically
Multi-Paxos.
We do so by distilling protocol-aware system optimisations into a separate
semantic layer and showing how to obtain the realistic Multi-Paxos implementation
from SD-Paxos by a *series of transformations* to the *network
semantics* of the system, as long as these transformations preserve the
behaviour observed by clients.
We then provide a family of such transformations along with the formal
conditions allowing one to compose them in a behaviour-preserving way.

We validate our approach for construction of modularly verified consensus protocols by providing an executable proof-of-concept implementation of Multi-Paxos with a high-level shared memory-like interface, obtained via a series of behaviour-preserving network transformations. The full proofs of lemmas and theorems from our development, as well as some boilerplate definitions, are given in the appendices at the end of the paper.

## 2 The Single-Decree Paxos Algorithm

We start with explaining SD-Paxos through an intuitive scenario. In SD-Paxos,
each node in the system can adopt the roles of *proposer* or
*acceptor*, or both. A value is decided when a *quorum* (*i.e.*, a
majority of acceptors) accepts the value proposed by some proposer. Now
consider a system with three nodes N1, N2 and N3, where N1 and N3 are both
proposers and acceptors, and N2 is an acceptor, and assume N1 and N3 propose
values and , respectively.

The algorithm works in two phases. In Phase 1, a proposer polls every acceptor in the system and tries to convince a quorum to promise that they will later accept its value. If the proposer succeeds in Phase 1 then it moves to Phase 2, where it requests the acceptors to fulfil their promises in order to get its value decided. In our example, it would seem in principle possible that N1 and N3 could respectively convince two different quorums—one consisting of N1 and N2, and the other consisting of N2 and N3—to go through both phases and to respectively accept their values. This would happen if the communication between N1 and N3 gets lost and if N2 successively grants the promise and accepts the value of N1, and then does the same with N3. This scenario breaks the safety requirements for consensus because both and —which can be different—would get decided. However, this cannot happen. Let us explain why.

The way SD-Paxos enforces the safety requirements is by distinguishing each
attempt to decide a value with a unique *round*, where the rounds are
totally ordered. Each acceptor stores its current round, initially the least
one, and only grants a promise to proposers with a round greater or equal than
its current round, at which moment the acceptor switches to the proposer’s
round. Figure 1 depicts a possible run of the
algorithm. Assume that rounds are natural numbers, that the acceptors’ current
rounds are initially , and that the nodes N1 and N3 attempt to decide their
values with rounds and respectively. In Phase 1, N1 tries to convince
a quorum to switch their current round to (messages P1A(1)). The
message to N3 gets lost and the quorum consisting of N1 and N2 switches round
and promises to only accept values at a round greater or equal than . Each
acceptor that switches to the proposer’s round sends back to the proposer its
stored value and the round at which this value was accepted, or an undefined
value if the acceptor never accepted any value yet (messages
P1B(ok, , 0), where denotes a default undefined value).
After Phase 1, N1 picks as a candidate value the one accepted at the greatest
round from those returned by the acceptors in the quorum, or its proposed
value if all acceptors returned an undefined value. In our case, N1 picks its
value . In Phase 2, N1 requests the acceptors to accept the candidate
value at round (messages P2A(, 1)). The message to N3
gets lost, and N1 and N2 accept value , which gets decided (messages
P2B(ok)).

Now N3 goes through Phase 1 with round (messages P1A(3)). Both N2
and N3 switch to round . N2 answers N3 with its stored value and with
the round at which was accepted (message
P1B(ok, , 1)), and N3 answers itself with an undefined value,
as it has never accepted any value yet (message
P1B(ok, , 0)). This way, if some value has been already
decided upon, *any* proposer that convinces a quorum to switch to its
round would receive the decided value from some of the acceptors in the quorum
(recall that two quorums have a non-empty intersection). That is, N3 picks the
returned by N2 as the candidate value, and in Phase 2 it manages that
the quorum N2 and N3 accepts at round (messages
P2A(, 3) and P2B(ok)). N3 succeeds in making a new
decision, but the decided value remains the same, and, therefore, the safety
requirements of a consensus protocol are satisfied.

## 3 The Faithful Deconstruction of SD-Paxos

We now recall the faithfull deconstruction of SD-Paxos in [2, 3], which we take as the reference architecture for the implementations that we aim to verify. We later show how each module of the deconstruction can be verified separately.

The deconstruction is depicted on the left of Figure 2,
which consists of modules *Paxos*, *Round-Based Consensus* and
*Round-Based Register*. These modules correspond to the ones in Figure 4
of [2], with the exception of *Weak Leader Election*. We assume
that a correct process that is trusted by every other correct process always
exists, and omit the details of the leader election. Leaders take the role of
proposers and invoke the interface of *Paxos*. Each module uses the
interface provided by the module below it.

The entry module *Paxos* implements SD-Paxos. Its specification (right of
Figure 2) keeps a variable vP that stores the
decided value (initially undefined) and provides the operation
proposeP that takes a proposed value v0 and returns
vP if some value was already decided, or otherwise it returns
v0. The code of the operation runs *atomically*, which we
emphasise via angle brackets . We define this specification
so it meets the safety requirements of a consensus, therefore, any
implementation whose entry point refines this specification will have to meet
the same safety requirements.

In this work we present both specifications and implementations in pseudo-code for an imperative WHILE-like language with basic arithmetic and primitive types, where val is some user-defined type for the values decided by Paxos, and undef is a literal that denotes an undefined value. The pseudo-code is self-explanatory and we restraint ourselves from giving formal semantics to it, which could be done in standard fashion if so wished [30]. At any rate, the pseudo-code is ultimately a vehicle for illustration and we stick to this informal presentation.

The implementation of the modules is depicted in
Figures 3–5. We
describe the modules following a bottom-up approach, which better fits the
purpose of conveying the connection between the deconstruction and
SD-Paxos. We start with module *Round-Based Register*, which offers
operations read and write
(Figure 3) and implements the
replicated processes that adopt the role of acceptors
(Figure 4). We adapt the
wait-free, crash-stop implementation of *Round-Based Register* in
Figure 5 of [2] by adding loops for the explicit reception of each
individual message and by counting acknowledgement messages one by
one. Processes are identified by integers from to , where is the
number of processes in the system. Proposers and acceptors exchange read and
write requests, and their corresponding acknowledgements and
non/acknowledgements. We assume a type msg for messages and let the
message vocabulary to be as follows. Read requests [RE, k] carry the
proposer’s round k. Write requests [WR, k, v] carry the
proposer’s round k and the proposed value v. Read
acknowledgements [ackRE, k, v, k’] carry the proposer’s round
k, the acceptor’s value v, and the round k’ at
which v was accepted. Read non-acknowledgements [nackRE, k]
carry the proposer’s round k, and so do carry write acknowledgements
[ackWR, k] and write non/acknowledgements [nackWR, K].

In the pseudo-code, we use _

for a wildcard that could take any literal value. In the pattern-matching primitives, the literals specify the pattern against which an expression is being matched, and operator

@ turns a variable into a literal with the variable’s value. Compare the case [ackRE, @k, v, kW]: in Figure 3, where the value of k specifies the pattern and v and kW get some values assigned, with the case [RE, k]: in Figure 4, where k gets some value assigned.We assume the network ensures that messages are neither created, modified,
deleted, nor duplicated, and that they are always delivered but with an
arbitrarily large transmission delay.^{1}^{1}1We allow creation and
duplication of [RE, k] messages in
Section 5, where we obtain Multi-Paxos from SD-Paxos
by a series of transformations of the network semantics. Primitive send takes the destination j and the message
m, and its effect is to send m from the current process to
the process j. Primitive receive takes no arguments, and its
effect is to receive at the current process a message m from origin
i, after which it delivers the pair (i, m) of identifier
and message. We assume that send is non-blocking and that
receive blocks and suspends the process until a message is available,
in which case the process awakens and resumes execution.

Each acceptor (Figure 4) keeps a
value v, a current round r (called the *read round*),
and the round w at which the acceptor’s value was last accepted
(called the *write round*). Initially, v is undef and
both r and w are .

Phase 1 of SD-Paxos is implemented by operation read on the left of Figure 3. When a proposer issues a read, the operation requests each acceptor’s promise to only accept values at a round greater or equal than k by sending [RE, k] (lines 4–5). When an acceptor receives a [RE, k] (lines 5–7 of Figure 4) it acknowledges the promise depending on its read round. If k is strictly less than r then the acceptor has already made a promise to another proposer with greater round and it sends [nackRE, k] back (line 8). Otherwise, the acceptor updates r to k and acknowledges by sending [ackRE, k, v, w] (line 9). When the proposer receives an acknowledgement (lines 8–10 of Figure 3) it counts acknowledgements up (line 10) and calculates the greatest write round at which the acceptors acknowledging so far accepted a value, and stores this value in maxV (lines 11–12). If a majority of acceptors acknowledged, the operation succeeds and returns (true, maxV) (lines 15–16). Otherwise, if the proposer received some [nackRE, k] the operation fails, returning (false, _) (lines 13–14).

Phase 2 of SD-Paxos is implemented by operation write on the right of
Figure 3. After having collected
promises from a majority of acceptors, the proposer picks the candidate value
vW and issues a write. The operation requests each acceptor
to accept the candidate value by sending [WR, k, vW]
(lines 20–21). When an acceptor receives [WR, k, vW] (line 10 of
Figure 4) it accepts the value
depending on its read round. If k is strictly less than r,
then the acceptor never promised to accept at such round and it sends
[nackWR, k] back (line 11). Otherwise, the acceptor fullfils its
promise and updates both w and r to k and assigns
vW to its value v, and acknowledges by sending
[ackWR, k] (line 12). Finally, when the proposer receives an
acknowledgement (lines 23–25 of
Figure 3) it counts
acknowledgements up (line 26) and checks whether a majority of acceptors
acknowledged, in which case vW is decided and the operation succeeds
and returns true (lines 29–30). Otherwise, if the proposer received
some [nackWR, k] the operation fails and returns false
(lines 27–28).^{2}^{2}2For the implementation to be correct with our
shared-memory-concurrency approach, the update of the data in acceptors must
happen atomically with the sending of acknowledgements in lines 9 and 12 of
Figure 4.

Next, we describe module *Round-Based Consensus* on the left of
Figure 5. The module offers an
operation proposeRC that takes a round k and a proposed
value v0, and returns a pair (res, v) of Boolean and value,
where res informs of the success of the operation and v is
the decided value in case res is true. We have taken the
implementation from Figure 6 in [2] but adapted to our pseudo-code
conventions. *Round-Based Consensus* carries out Phase 1 and Phase 2 of
SD-Paxos as explained in Section 2. The operation
proposeRC calls read (line 3) and if it succeeds then
chooses a candidate value between the proposed value v0 or the value
v returned by read (line 5). Then, the operation calls
write with the candidate value and returns (true, v) if
write succeeds, or fails and returns (false, _)
(line 8) if either the read or the write fails.

Finally, the entry module *Paxos* on the right of
Figure 5 offers an operation
proposeP that takes a proposed value v0 and returns the
decided value. We assume that the system primitive pid() returns the
process identifier of the current process. We have come up with this
straightforward implementation of operation proposeP, which calls
proposeRC with increasing round until the call succeeds, starting at
a round equal to the process identifier pid() and increasing it by
the number of processes in each iteration. This guarantees that the round
used in each invocation to proposeRC is unique.

#### 3.0.1 The Challenge of Verifying the Deconstruction of Paxos.

Verifying each module of the deconstruction separately is cumbersome because of the distributed character of the algorithm and the nature of a linearisation proof. A process may not be aware of the information that will flow from itself to other processes, but this future information flow may dictate whether some operation has to be linearised at the present. Figure 6 illustrates this challenge.

Let N1, N2 and N3 adopt both the roles of acceptors and proposers, which
propose values , and with rounds , and
respectively. Consider the history on the top of the figure. N2 issues a read
with round and gets acknowledgements from all but one acceptors in a
quorum. (Let us call this one acceptor A.) None of these acceptors have
accepted anything yet and they all return as the last accepted value at
round . In parallel, N3 issues a read with round (third line in the
figure) and gets acknowledgements from a quorum in which A does not
occur. This read succeeds as well and returns (true, undef). Then N3
issues a write with round and value . Again, it gets acknowledgements
from a quorum in which A does not occur, and the write succeeds deciding value
and returns true. Later on, and in real time order with the
write by N3 but in parallel with the read by N2, node N1 issues a write with
round and value (first line in the figure). This write is to fail
because the value was already decided with round . However, the write
manages to “contaminate” acceptor A with value , which now acknowledges
N2 and sends as its last accepted value at round . Now N2 has gotten
acknowledgements from a quorum, and since the other acceptors in the quorum
returned as the round of their last accepted value, the read will catch
value accepted at round , and the operation succeeds and returns
(true, ). This history linearises by moving N2’s read after
N1’s write, and by respecting the real time order for the rest of the
operations. (The linearisation ought to respect the information flow order
between N1 and N2 as well, *i.e.*, N1 contaminates A with value , which is
read by N2.)

In the figure, a segment ending in an indicates that the operation fails. The value returned by a successful read operation is depicted below the end of the segment. The linearisation points are depicted with a thick vertical line, and the dashed arrow indicates that two operations are in the information flow order.

The variation of this scenario on the bottom of Figure 6 is also possible, where N1’s write and N2’s read happen concurrently, but where N2’s read is shifted backwards to happen before in real time order with N3’s read and write. Since N1’s write happens before N2’s read in the information flow order, then N1’s write has to inexorably linearise before N3’s operations, which are the ones that will “steal” N1’s valid round.

These examples give us three important hints for designing the specifications
of the modules. First, after a decision is committed it is *not enough*
to store only the decided value, since a posterior write may contaminate some
acceptor with a value different from the decided one. Second, a read operation
*may succeed* with some round even if by that time other operation has
already succeeded with a higher round. And third, a write with a valid round
*may fail* if its round will be “stolen” by a concurrent operation. The
non-deterministic specifications that we introduce next allow one to model
execution histories as the ones in Figure 6.

## 4 Modularly Verifying SD-Paxos

In this section, we provide non-deterministic specifications for
*Round-Based Consensus* and *Round-Based Register* and show that
each implementation refines its specification [9].
To do so, we instrument the implementations of all the modules with
*linearisation-point* annotations and use Rely/Guarantee
reasoning [26].

This time we follow a top-down order and start with the entry module
*Paxos*.

####
4.0.1 Module *Paxos*.

In order to prove that the implementation on the right of Figure 5 refines its specification on the right of Figure 2, we introduce the instrumented implementation in Figure 7, which uses the helping mechanism for external linearisation points of [18]. We assume that each proposer invokes proposeP with a unique proposed value. The auxiliary pending thread pool ptp[] is an array of pairs of Booleans and values of length , where is the number of processes in the system. A cell ptp[] containing a pair (true, ) signals that the process proposed value and the invocation proposeP() by process awaits to be linearised. Once this invocation is linearised, the cell ptp[] is updated to the pair (false, ). A cell ptp[] containing undef signals that the process never proposed any value yet. The array abs_resP[] of Boolean single-assignment variables stores the abstract result of each proposer’s invocation. A linearisation-point annotation lin() takes a process identifier and performs atomically the abstract operation invoked by proposer and assigns its result to abs_resP[]. The abstract state is modelled by variable abs_vP, which corresponds to variable vP in the specification on the right of Figure 2. One invocation of proposeP may help linearise other invocations as follows. The linearisation point is together with the invocation to proposeRC (line 6). If proposeRC committed with some value v, the instrumented implementation traverses ptp and linearises all the proposers which were proposing value v (the proposer may linearise itself in this traversal) (lines 8–9). Then, the current proposer linearises itself if its proposed value v0 is different from v (line 10), and the operation returns v (line 12). All the annotations and code in lines 6–10 are executed inside an atomic block, together with the invocation to proposeRC(k, v0).

####
4.0.2 Module *Round-Based Consensus*.

The top of Figure 8 shows the non-deterministic module’s specification. Global variable vRC is the decided value, initially undef. Global variable roundRC is the highest round at which some value was decided, initially ; a global set of values valsRC (initially empty) contains values that may have been proposed by proposers. The specification is non-deterministic in that local value vD and Boolean b are unspecified, which we model by assigning random values to them. We assume that the current process identifier is , which is consistent with how rounds are assigned to each process and incremented in the code of proposeP on the right of Figure 5. If the unspecified value vD is neither in the set valsRC nor equal to v0 then the operation returns (false, _) (line 11). This models that the operation fails without contaminating any acceptor. Otherwise, the operation may contaminate some acceptor and the value vD is added to the set valsRC (line 6). Now, if the unspecified Boolean b is false, then the operation returns (false, _) (lines 7 and 10), which models that the round will be stolen by a posterior operation. Finally, the operation succeeds if k is greater or equal than roundRC (line 7), and roundRC and vRC are updated and the operation returns (true, vRC) (lines 7–9).

In order to prove that the implementation in Figure 5 linearises with respect to the specification on the top of Figure 8, we use the instrumented implementation on the bottom of the same figure, where the abstract state is modelled by variables abs_vRC, abs_roundRC and abs_valsRC in lines 1–2, the local single-assignment variable abs_resRC stores the result of the abstract operation, and the linearisation-point annotations linRC(vD, b) take a value and a Boolean parameters and invoke the non-deterministic abstract operation and disambiguate it by assigning the parameters to the unspecified vD and b of the specification. There are two linearisation points together with the invocations of read (line 6) and write (line 8). If read fails, then we linearise forcing the unspecified vD to be undef (line 6), which ensures that the abstract operation fails without adding any value to abs_valsRC nor updating the round abs_roundRC. Otherwise, if write succeeds with value v, then we linearise forcing the unspecified value vD and Boolean b to be v and true respectively (line 8). This ensures that the abstract operation succeeds and updates the round abs_roundRC to k and assigns v to the decided value abs_vRC. If write fails then we linearise forcing the unspecified vD and b to be v and false respectively (line 9). This ensures that the abstract operation fails.

####
4.0.3 Module *Round-Based Register*.

Figure 9 shows the module’s non-deterministic specification. Global variable vRR represents the decided value, initially undef. Global variable roundRR represents the current round, initially , and global set of values valsRR, initially containing undef, stores values that may have been proposed by some proposer. The specification is non-deterministic in that method read has unspecified local Boolean b and local value vD (we assume that vD is valsRR), and method write has unspecified local Boolean b. We assume the current process identifier is .

Let us explain the specification of the read operation. The operation can succeed regardless of the proposer’s round k, depending on the value of the unspecified Boolean b. If b is true and the proposer’s round k is valid (line 8), then the read round is updated to k (line 9) and the operation returns (true, v) (line 14), where v is the read value, which coincides with the decided value if some decision was committed already or with vD otherwise. Now to the specification of operation write. The value vW is always added to the set valsRR (line 25). If the unspecified Boolean b is false (the round will be stolen by a posterior operation) or if the round k is non-valid, then the operation returns false (lines 26 and 30). Otherwise, the current round is updated to k, and the decided value vRR is updated to vW and the operation returns true (lines 27–29).

In order to prove that the implementation in Figures 3 and 4 linearises with respect to the specification in Figure 9, we use the instrumented implementation in Figures 10 and 11, which uses prophecy variables [1, 26] that “guess” whether the execution of the method will reach a particular program location or not. The instrumented implementation also uses external linearisation points. In particular, the code of the acceptors may help to linearise some of the invocations to read and write, based on the prophecies and on auxiliary variables that count the number of acknowledgements sent by acceptors after each invocation of a read or a write. The next paragraphs elaborate on our use of prophecy variables and on our helping mechanism.

Variables abs_vRR, abs_roundRR and abs_valsRR in
Figure 10 model the abstract state. They are
initially set to undef, and the set containing undef respectively. Variable abs_res_r[] is an infinite array of
single-assignment pairs of Boolean and value that model the abstract results
of the invocations to read. (Think of an infinite array as a map from
integers to some type; we use the array notation for convenience.) Similarly,
variable abs_res_w[] is an infinite array of single-assignment Booleans
that models the abstract results of the invocations to write. All the
cells in both arrays are initially undef (*e.g.* the initial maps are
empty). Variables count_r[] and count_w[] are infinite arrays of
integers that model the number of acknowledgements sent (but not necessarily
received yet) from acceptors in response to respectively read or write
requests. All cells in both arrays are initially . The variable
proph_r[] is an infinite array of single-assignment pairs
bool val, modelling the prophecy for the invocations of
read, and variable proph_w[] is an infinite array of
single-assignment Booleans modelling the prophecy for the invocations of
write.

The linearisation-point annotations linRE(k, vD, b) for read take the proposer’s round k, a value vD and a Boolean b, and they invoke the abstract operation and disambiguate it by assigning the parameters to the unspecified vD and b of the specification on the left of Figure 9. At the beginning of a read(k) (lines 11–14 of Figure 10), the prophecy proph_r[k] is set to (true, ) if the invocation reaches PL: RE_SUCC in line 26. The is defined to coincide with maxV at the time when that location is reached. That is, is the value accepted at the greatest round by the acceptors acknowledging so far, or undefined if no acceptor ever accepted any value. If the operation reaches PL: RE_FAIL in line 24 instead, the prophecy is set to (false, _). (If the method never returns, the prophecy is left undef since it will never linearise.) A successful read(k) linearises in the code of the acceptor in Figure 11, when the th acceptor sends [ackRE, k, v, w], and only if the prophecy is (true, ) and the operation was not linearised before (lines 10–14). We force the unspecified vD and b to be and true respectively, which ensures that the abstract operation succeeds and returns (true, ). A failing read(k) linearises at the return in the code of read (lines 23–24 of Figure 10), after the reception of [nackRE, k] from one acceptor. We force the unspecified vD and b to be undef and false respectively, which ensures that the abstract operation fails.

The linearisation-point annotations linWR(k, vW, b) for write take the proposer’s round k and value vW, and a Boolean b, and they invoke the abstract operation and disambiguate it by assigning the parameter to the unspecified b of the specification on the right of Figure 9. At the beginning of a write(k, vW) (lines 31–33 of Figure 10), the prophecy proph_r[k] is set to true if the invocation reaches PL: WR_SUCC in line 45, or to false if it reaches PL: WR_FAIL in line 43 (or it is left undef if the method never returns). A successfully write(k, vW) linearises in the code of the acceptor in Figure 11, when the th acceptor sends [ackWR, k], and only if the prophecy is true and the operation was not linearised before (lines 17–24). We force the unspecified b to be true, which ensures that the abstract operation succeeds deciding value vW and updates roundRR to k. A failing write(k, vW) may linearise either at the return in its own code (lines 41–43 of Figure 10) if the proposer received one [nackWR, k] and no acceptor sent any [ackWR, k] yet, or at the code of the acceptor, when the first acceptor sends [ackWR, k], and only if the prophecy is false and the operation was not linearised before. In both cases, we force the unspecified b to be false, which ensures that the abstract operation fails.

## 5 Multi-Paxos via Network Transformations

We now turn to more complicated distributed protocols that build upon the idea
of Paxos consensus. Our ultimate goal is to reuse the verification result from
the Sections 3–4, as
well as the high-level round-based register interface. In this section, we
will demonstrate how to reason about an implementation of Multi-Paxos as of an
array of *independent* instances of the *Paxos* module defined
previously, despite the subtle dependencies between its sub-components, as
present in Multi-Paxos’s “canonical”
implementations [27, 15, 5].
While an abstraction of Multi-Paxos to an array of independent shared
“single-shot” registers is almost folklore, what appears to be inherently
difficult is to verify a Multi-Paxos-based consensus (*wrt.* to the array-based
abstraction) by means of *reusing* the proof of a SD-Paxos. All proofs of
Multi-Paxos we are aware of are, thus, *non-modular* with respect to
underlying SD-Paxos
instances [22, 5, 24], *i.e.*, they
require one to redesign the invariants of the *entire* consensus
protocol.

This proof modularity challenge stems from the optimised nature of a classical
Multi-Paxos protocol, as well as its real-world
implementations [6].
In this part of our work is to distil such protocol-aware optimisations into
a separate *network semantics layer*, and show that each of them refines
the semantics of a Cartesian product-based view, *i.e.*, exhibits the very same
client-observable behaviours.
To do so, we will establishing the refinement between the optimised
implementations of Multi-Paxos and a simple Cartesian product abstraction, which
will allow to extend the register-based abstraction, explored before in this
paper, to what is considered to be a canonical amortised Multi-Paxos implementation.

### 5.1 Abstract Distributed Protocols

We start by presenting the formal definitions of encoding distributed protocols (including Paxos), their message vocabularies, protocol-based network semantics, and the notion of an observable behaviours.

##### Protocols and messages.

Figure 12 provides basic definitions of the distributed protocols
and their components.
Each protocol is a tuple
. is a set of
local states, which can be assigned to each of the participating nodes, also
determining the node’s role via an additional tag,^{3}^{3}3We leave out
implicit the consistency laws for the state, that are protocol-specific. if
necessary (*e.g.*, an acceptor and a proposer states in Paxos are different).
is a “message vocabulary”, determining the set of messages that can
be used for communication between the nodes.

Messages can be thought of as JavaScript-like dictionaries, pairing unique
fields (isomorphic to strings) with their values. For the sake of a uniform
treatment, we assume that each message has at least two fields,
and that point to the source and the destination node of a
message, correspondingly. In addition to that, for simplicity we will assume
that each message carries a Boolean field , which is set to
when the message is sent and is set to when the message is received
by its destination node. This flag is required to keep history information
about messages sent in the past, which is customary in frameworks for
reasoning about distributed
protocols [23, 10, 28].
We assume that a “message soup” is a multiset of messages (*i.e.* a set
with zero or more copies of each message) and we consider that each copy of
the same message in the multiset has its own “identity”, and we write
to represent that and are not the same copy of a
particular message.

Finally, are step-relations that correspond to the internal changes in the local state of a node (), as well as changes associated with sending () and receiving () messages by a node, as allowed by the protocol. Specifically, relates a local node state before and after the allowed internal change; relates the initial state and an incoming message with the resulting state; relates the internal state, the output state and the set of atomically sent messages. For simplicity we will assume that .

In addition, we consider —the set of the
allowed *initial* states, in which the system can be present at the very
beginning of its execution.
The global state of the network is a map from node
identifiers () to local states from the set of states ,
defined by the protocol.

##### Simple network semantics.

The simple initial operational semantics of the network
()
is parametrised by a protocol and relates the initial *configuration*
(*i.e.*, the global state and the set of messages) with the resulting
configuration. It is defined via as a reflexive closure of the union of three
relations , their rules are
given in Figure 13.

The rule StepInt corresponds to a node picked non-deterministically from the domain of a global state , executing an internal transition, thus changing its local state from to . The rule StepReceive non-deterministically picks a message from a message soup , changes the state using the protocol’s receive-step relation at the corresponding host node , and updates its local state accordingly in the common mapping (). Finally, the rule StepSend, non-deterministically picks a node , executes a send-step, which results in updating its local state emission of a set of messages , which is added to the resulting soup. In order to “bootstrap” the execution, the initial states from the set are assigned to the nodes.

We next define the observable protocol behaviours *wrt.* the simple network
semantics as the prefix-closed set of all system’s configuration traces.

###### Definition 1 (Protocol behaviours)

That is, the set of behaviours captures all possible configurations of initial states for a fixed set of nodes . In this case, the set of nodes is an implicit parameter of the definition, which we fix in the remainder of this section.

###### Example 1 (Encoding SD-Paxos)

An abstract distributed protocol for SD-Paxos can be extracted from the pseudo-code of Section 3 by providing a suitable small-step operational semantics à la Winskel [30]. We restraint ourselves from giving such formal semantics, but in Appendix D of the extended version of the paper we outline how the distributed protocol would be obtained from the given operational semantics and from the code in Figures 3, 4 and 5.

### 5.2 Out-of-Thin-Air Semantics.

We now introduce an intermediate version of a simple protocol-aware semantics
that generates messages “out of thin air” according to a certain predicate
, which determines whether the
network generates a certain message without exercising the corresponding
send-transition. The rule is as follows:
*[Lab=OTASend,width=10cm]
n ∈dom(σ)

δ= σ(n)

P(δ, m)

M’ = M ∪{m}
⟨σ, M⟩ ota p,P⇒ ⟨σ, M’⟩
That is, a random message can be sent at any moment in the
semantics described by , given
that the node , “on behalf of which” the message is sent is in a
state , such that holds.

###### Example 2

In the context of Single-Decree Paxos, we can define as follows:

In other words, if a node is a *Proposer* currently operating with a
round , the network semantics can always send another
request “on its behalf”, thus generating the message “out-of-thin-air”.
Importantly, the last conjunct in the definition of is in terms of
, rather than equality. This means that the predicate is intentionally
loose, allowing for sending even “stale” messages, with expired rounds that
are smaller than what currently holds (no harm in that!).

By definition of single-decree Paxos protocol, the following lemma holds:

### 5.3 Slot-Replicating Network Semantics.

With the basic definitions at hand, we now proceed to describing alternative
network behaviours that make use of a specific protocol
, which we will consider
to be fixed for the remainder of this section, so we will be at times
referring to its components (*e.g.*, , , *etc*) without a
qualifier.

Figure 14 describes a semantics of a *slot-replicating*
(SR) network that exercises multiple copies of the *same* protocol
instance for , some, possibly infinite, set of indices, to
which we will be also referring as *slots*. Multiple copies of the
protocol are incorporated by enhancing the messages from ’s vocabulary
with the corresponding indices, and implementing the on-site dispatch
of the indexed messages to corresponding protocol instances at each node. The
local protocol state of each node is, thus, no longer a single element being
updated, but rather an *array*, mapping into —the
corresponding local state component.
The small-step relation for SR semantics is denoted by . The
rule SRStepInt is similar to StepInt of the simple
semantics, with the difference that it picks not only a node but also an index
, thus referring to a specific component as and
updating it correspondingly .
For the remaining transitions, we postulate that the messages from ’s
vocabulary are enhanced to have a dedicated field , which
indicates a protocol copy at a node, to which the message is directed.
The receive-rule SRStepReceive is similar
to StepReceive but takes into the account the value of
in the received message , thus redirecting it to the
corresponding protocol instance and updating the local state
appropriately. Finally, the rule SRStepSend can be now
executed for any slot , reusing most of the logic of the
initial protocol and otherwise mimicking its simple network semantic
counterpart StepSend.

Importantly, in this semantics, for two different slots , such that
, the corresponding “projections” of the state behave
*independently* from each other. Therefore, transitions and messages in
the protocol instances indexed by at different nodes *do not
interfere* with those indexed by .
This observation can be stated formally. In order to do so we first defined
the behaviours of slot-replicating networks and their projections as follows:

###### Definition 2 (Slot-replicating protocol behaviours)

That is, the slot-replicated behaviours are merely behaviours with respect to
networks, whose nodes hold *multiple instances* of the same protocol,
indexed by slots .
For a slot , we define *projection* as a set
of global state traces, where each node’s local states is restricted only to
its th component. The following simulation lemma holds naturally,
connecting the state-replicating network semantics and simple network
semantics.

###### Lemma 2 (Slot-replicating simulation)

For all , .

###### Example 3 (Slot-replicating semantics and Paxos)

Given our representation of Paxos using roles (acceptors/proposers) encoded via the corresponding parts of the local state , we can construct a “naïve” version of Multi-Paxos by using the SR semantics for the protocol. In such, every slot will correspond to a SD Paxos instance, not interacting with any other slots. From the practical perspective, such an implementation is rather non-optimal, as it does not exploit dependencies between rounds accepted at different slots.

### 5.4 Widening Network Semantics.

We next consider a version of the SR semantics, extended with a new rule for
handling received messages. In the new semantics, dubbed *widening*, a
node, upon receiving a message , where , for a
slot , *replicates* it for all slots from the index set , for the
very same node. The new rule is as follows:
*[Lab=WStepReceiveT,width=11cm]
m ∈M

m.active

m.to∈dom(σ)

δ= σ(m.to)[m.slot]

⟨δ, m, δ’⟩ ∈p.S_rcv

m’ = m[active↦False]

σ’ = σ(n)[m.slot↦δ’]

ms= if (m ∈T) then {m’ — m’ = m[slot↦j], j ∈I} else ∅
⟨σ, M⟩ rcv ∇⇒ ⟨σ’, (M ∖{m}) ∪{m’} ∪ms⟩
At first, this semantics seems rather unreasonable: it might create more
messages than the system can “consume”. However, it is possible to prove
that, under certain conditions on the protocol , the set of behaviours
observed under this semantics (*i.e.*, with SRStepReceive replaced by
WStepReceiveT) is *not larger* than as given
by Definition 2.
To state this formally we first relate the set of “triggering” messages
from WStepReceiveT to a specific predicate .

###### Definition 3 (OTA-compliant message sets)

The set of messages is OTA-compliant with the predicate iff for any and , if , then .

In other words, the protocol is relaxed enough to “justify” the presence
of in the soup at *any* execution, by providing the predicate
, relating the message to the corresponding sender’s state.
Next, we use this definition to slot-replicating and widening semantics via
the following definition.

###### Definition 4 (-monotone protocols)

A protocol is -monotone iff for any, , , , , and , if then we have that , where “removes” the field from .

Less formally, Definition 4 ensures that in a slot-replicated
product of a protocol , different components cannot perform “out of
sync” *wrt.* . Specifically, if a node in th projection is related to a certain
message via , then any other projection of the same
node will be -related to this message, as well.

###### Example 4

This is a “non-example”. A version of slot-replicated SD-Paxos, where we
allow for arbitrary increments of the round *per-slot* at a same
proposer node (*i.e.*, out of sync), would not be monotone *wrt.* from
Example 2.
In contrast, a slot-replicated product of SD-Paxos instances with
fixed rounds is monotone *wrt.* the same .

###### Lemma 3

If from WStepReceiveT is OTA-compliant with predicate , such that and is -monotone, then .

### 5.5 Optimised Widening Semantics.

Our next step towards a realistic implementation of Multi-Paxos out of SD-Paxos
instances is enabled by an observation that in the widening semantics, the
replicated messages are *always* targeting the same node, to which the
initial message was addressed. This means that we can optimise the
receive-step, making it possible to execute multiple receive-transitions of
the core protocol in batch. The following rule OWStepReceiveT
captures this intuition formally:
*[Lab=OWStepReceiveT,width=13cm]
m ∈M

m.active

m.to∈dom(σ)

⟨σ’, ms⟩ = receiveAndAct(σ, n, m)
⟨σ, M⟩ rcv ∇^*⇒ ⟨σ’, M ∖{m}
∪{m[active↦False]} ∪ms⟩

In essence, the rule OWStepReceiveT blends several steps of the widening semantics together for a single message: (a) it first receives the message and replicates it for all slots at a destination node; (b) performs receive-steps for the message’s replicas at each slot; (c) takes a number of internal steps, allowed by the protocol’s ; and (d) takes a send-transition, eventually sending all emitted message, instrumented with the corresponding slots.

###### Example 6

Continuing Example 5, with the same parameters, the
optimising semantics will execute the transitions of an acceptor, *for
all slots*, triggered by receiving a single [RE, k] message for a
particular slot, sending back *all* the results for all the slots, which
might either agree to accept the value or reject it.

The following lemma relates the optimising and the widening semantics.

###### Lemma 4 (Refinement for OW semantics)

For any there exists , such that can be obtained from by replacing sequences of configurations that have just a single node , whose local state is affected in , by .

That is, behaviours in the optimised semantics are the same as in the widening semantics, modulo some sequences of locally taken steps that are being “compressed” to just the initial and the final configurations.

### 5.6 Bunching Semantics.

*[Lab=BStepRecvB,width=5cm]
m ∈M
m.active m.to∈dom(σ)

⟨σ’, ms⟩ = receiveAndAct(σ, n, m)

M’ = M ∖{m}
∪{m[active↦False]}

m’ = bunch(ms, m.to, m.from)
⟨σ, M⟩ rcv B⇒ ⟨σ’, M’∪{m’}⟩
*[Lab=BStepRecvU,width=5cm]
m ∈M
m.active m.to∈dom(σ)

m.msgs= ms M’ = M ∖{m} ∪ms
⟨σ, M⟩ rcv B⇒ ⟨σ, M’⟩

As the last step towards Multi-Paxos, we introduce the final network
semantics that optimises executions according to
described in previous section even further by making a simple addition
to the message vocabulary of a slot-replicated SD
Paxos—*bunched messages*.
A bunched message simply packages together several messages, obtained
typically as a result of a “compressed” execution via the optimised
semantics from Section 5.5. We define two new rules
for packaging and “unpackaging” certain messages in Figure 15.
The two new rules can be added to enhance either of the versions of
the slot-replicating semantics shown before. In essence, the only
effect they have is to combine the messages resulting in the execution
of the corresponding steps of an optimised widening (via
BStepRecvB), and to unpackage the messages from a
bunching message, adding them back to the soup
(BStepRecvU).
The following natural refinement result holds:

###### Lemma 5

For any there exists , such that can be obtained from by replacing all bunched messages in by their -component.

The rule BStepRecvU enables effective local caching of the
bunched messages, so they are processed *on demand* on the recipient side
(*i.e.*, by the per-slot proposers), allowing the implementation to *skip*
an entire round of Phase 1.

### 5.7 The Big Picture.

What exactly have we achieved by introducing the described above family of semantics? As illustrated in Figure 16, all behaviours of the leftmost-topmost, bunching semantics, which corresponds precisely to an implementation of Multi-Paxos with an “amortised” Phase 1, can be transitively related to the corresponding behaviours in the rightmost, vanilla slot-replicated version of a simple semantics (via the correspondence from Lemma 1) by constructing the corresponding refinement mappings [1], delivered by the proofs of Lemmas 3–5.

From the perspective of Rely/Guarantee reasoning, which was employed in
Section 4, the refinement result from Figure 16
justifies the replacement of a semantics on the right of the diagram by one to
the left of it, as all program-level assertions will remain substantiated by
the corresponding system configurations, as long as they are *stable*
(*i.e.*, resilient *wrt.* transitions taken by nodes different from the one being
verified), which they are in our case.

## 6 Putting It All Together

We culminate our story of faithfully deconstructing and abstracting
Paxos via a round-based register, as well as recasting Multi-Paxos via a
series of network transformations, by showing how to *implement*
the register-based abstraction from
Section 3 in tandem with the network
semantics from Section 5 in order to deliver
provably correct, yet efficient, implementation of Multi-Paxos.

The crux of the composition of the two results—a register-based abstraction
of SD Paxos and a family of semantics-preserving network transformations—is
a convenient interface for the end client, so she could interact with a
consensus instance via the proposeM method in lines 1–4 of
Figure 17, no matter with which particular slot of a
Multi-Paxos implementation she is interacting.
To do so, we propose to introduce a *register provider*—a service that
would give a client a “reference” to the consensus object to interact
with. Lines 6–7 of Figure 17 illustrate the
interaction with the service provider, where the client requests two specific
slots, 1 and 2, of Multi-Paxos by invoking getR and providing a slot
parameter. In both cases the client proposes the very same value v in
the two instances that run the same machinery. (Notice that, except for the
reference to the consensus object, proposeM is identical to the
proposeP on the right of Figure 2, which we have
verified *wrt.* linearisability in Section 3.)

The implementation of Multi-Paxos that we have in mind resembles the one in Figures 3, 4 and 5 of Section 3, but where all the global data is provided by the register provider and passed by reference. What differs in this implementation with respect to the one in Section 3 and is hidden from the client is the semantics of the network layer used by the bottom layer (cf. left part of Figure 2) of the register-based implementation. The Multi-Paxos instances run (without changing the register’s code) over this network layer, which “overloads” the meaning of the send/receive primitives from Figures 3 and 4 to follow the bunching network semantics, described in Section 5.6.

###### Theorem 6.1

The implementation of Multi-Paxos that uses a register provider and bunching network semantics refines the specification in Figure 17.

We implemented the register/network semantics in a proof-of-concept prototype
written in Scala/Akka.^{4}^{4}4The code is available
at https://github.com/certichain/protocol-combinators.
We relied on the abstraction mechanisms of Scala, allowing us to implement the
register logic, verified in Section 4, separately from the
network middle-ware, which has provided a family of Semantics from
Section 5.
Together, they provide a family of provably correct, modularly verified
*distributed* implementations, coming with a simple *shared
memory-like* interface.

## 7 Related Work

##### Proofs of Linearisability via Rely/Guarantee.

Our work builds on the results of Boichat *et al.* [3], who
were first to propose to a systematic deconstruction of Paxos into
read/write operations of a *round-based register* abstraction.
We extend and harness those abstractions, by intentionally introducing
more non-determinism into them, which allows us to provide the first
modular (*i.e.*, mutually independent) proofs of Proposer and Acceptor
using Rely/Guarantee with linearisation points and prophecies.
While several logics have been proposed recently to prove linearisability of
concurrent implementations using Rely/Guarantee
reasoning [18, 26, 19, 14],
none of them considers message-passing distributed systems or consensus
protocols.

##### Verification of Paxos-family Algorithms.

Formal verification of different versions of Paxos-family protocols *wrt.* inductive invariants and liveness has been a focus of multiple verification
efforts in the past fifteen years.
To name just a few, Lamport has specified and verified Fast
Paxos [17] using TLA+ and its accompanying model
checker [32].
Chand *et al.* used TLA+ to specify and verify Multi-Paxos implementation,
similar to the one we considered in this work [5].
A version of SD-Paxos has been verified by Kellomaki using
the PVS theorem prover [13].
Jaskelioff and Merz have verified Disk Paxos in
Isabelle/HOL [12].
More recently, Rahli *et al.* formalised an executable version of
Multi-Paxos in EventML [24], a dialect of NuPRL.
Dragoi *et al.* [8] implemented and verified SD-Paxos in the
PSync framework, which implements a partially synchronised
model [7], supporting automated proofs of system
invariants.
Padon *et al.* have proved the system invariants and the consensus property of
both simple Paxos and Multi-Paxos using the verification tool
Ivy [23, 22].

Unlike all those verification efforts that consider
(Multi-/Disk/Fast/)Paxos as a *single monolithic protocol*, our
approach provides the first *modular* verification of single-decree Paxos
using Rely/Guarantee framework, as well as the first verification of Multi-Paxos that directly reuses the proof of SD-Paxos.

##### Compositional Reasoning about Distributed Systems.

Several recent works have partially addressed modular formal
verification of distributed systems.
The IronFleet framework by Hawblitzel *et al.* has been used to verify
both safety and liveness of a real-world implementation of a
Paxos-based replicated state machine library and a lease-based shared
key-value store [10]. While the proof is
structured in a modular way by composing specifications in a way
similar to our decomposition in
Sections 3–4,
that work does not address the linearisability and does not provide
composition of proofs about complex protocols (*e.g.*, Multi-Paxos) from
proofs about its subparts

The Verdi framework for deductive verification of distributed
systems [29, 31] suggests the idea of
*Verified System Transformers* (VSTs), as a way to provide
*vertical composition* of distributed system implementation.
While Verdi’s VSTs are similar in its purpose and idea to our
network transformations, they *do not* exploit the properties of
the protocol, which was crucial for us to verify Multi-Paxos’s
implementation.

The Disel framework [25, 28]
addresses the problem of *horizontal composition* of distributed
protocols and their client applications. While we do not compose Paxos
with any clients in this work, we believe its register-based
specification could be directly employed for verifying applications
that use Paxos as its sub-component, which is what is demonstrated by
our prototype implementation.

## 8 Conclusion and Future Work

We have proposed and explored two complementary mechanisms for modular
verification of Paxos-family consensus protocols [15]: (a)
non-deterministic register-based specifications in the style of
Boichat *et al.* [3], which allow one to decompose the proof
of protocol’s linearisability into separate independent “layers”, and (b) a
family of protocol-aware transformations of network semantics, making it
possible to reuse the verification efforts.
We believe that the applicability of these mechanisms spreads beyond reasoning
about Paxos and its variants and that they can be used for verifying other
consensus protocols, such as Raft [21] and
PBFT [4]. We are also going to employ network
transformations to verify implementations of Mencius [20]

, and accommodate more protocol-specific optimisations, such as implementation of master leases and epoch numbering

[6].##### Acknowledgements.

We thank the ESOP 2018 reviewers for their feedback. This work by was supported by ERC Starting Grant H2020-EU 714729 and EPSRC First Grant EP/P009271/1.

## References

- [1] M. Abadi and L. Lamport. The existence of refinement mappings. In LICS, pages 165–175. IEEE Computer Society, 1988.
- [2] R. Boichat, P. Dutta, S. Frølund, and R. Guerraoui. Deconstructing Paxos, 2001. OAI-PMH server at infoscience.epfl.ch, record 52373 (http://infoscience.epfl.ch/record/52373).
- [3] R. Boichat, P. Dutta, S. Frølund, and R. Guerraoui. Deconstructing paxos. SIGACT News, 34(1):47–67, 2003.
- [4] M. Castro and B. Liskov. Practical Byzantine Fault Tolerance. In OSDI, pages 173–186. USENIX Association, 1999.
- [5] S. Chand, Y. A. Liu, and S. D. Stoller. Formal Verification of Multi-Paxos for Distributed Consensus. In FM, volume 9995 of LNCS, pages 119–136, 2016.
- [6] T. Chandra, R. Griesemer, and J. Redstone. Paxos made live: an engineering perspective. In PODC, pages 398–407. ACM, 2007.
- [7] B. Charron-Bost and S. Merz. Formal verification of a consensus algorithm in the heard-of model. Int. J. Software and Informatics, 3(2-3):273–303, 2009.
- [8] C. Dragoi, T. A. Henzinger, and D. Zufferey. PSync: a partially synchronous language for fault-tolerant distributed algorithms. In POPL, pages 400–415. ACM, 2016.
- [9] I. Filipovic, P. W. O’Hearn, N. Rinetzky, and H. Yang. Abstraction for concurrent objects. Theor. Comput. Sci., 411(51-52):4379–4398, 2010.
- [10] C. Hawblitzel, J. Howell, M. Kapritsos, J. R. Lorch, B. Parno, M. L. Roberts, S. T. V. Setty, and B. Zill. IronFleet: proving practical distributed systems correct. In SOSP, pages 1–17. ACM, 2015.
- [11] M. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3):463–492, 1990.
- [12] M. Jaskelioff and S. Merz. Proving the correctness of disk paxos. Archive of Formal Proofs, 2005, 2005.
- [13] P. Kellomäki. An Annotated Specification of the Consensus Protocol of Paxos Using Superposition in PVS. Technical Report Report 36, Tampere University of Technology. Institute of Software Systems, 2004.
- [14] A. Khyzha, A. Gotsman, and M. J. Parkinson. A Generic Logic for Proving Linearizability. In FM, volume 9995 of LNCS, pages 426–443. Springer, 2016.
- [15] L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, 1998.
- [16] L. Lamport. Paxos made simple. SIGACT News, 32, 2001.
- [17] L. Lamport. Fast paxos. Distributed Computing, 19(2):79–103, 2006.
- [18] H. Liang and X. Feng. Modular verification of linearizability with non-fixed linearization points. In PLDI, pages 459–470. ACM, 2013.
- [19] H. Liang and X. Feng. A program logic for concurrent objects under fair scheduling. In POPL, pages 385–399. ACM, 2016.
- [20] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: Building Efficient Replicated State Machine for WANs. In OSDI, pages 369–384. USENIX Association, 2008.
- [21] D. Ongaro and J. K. Ousterhout. In search of an understandable consensus algorithm. In 2014 USENIX Annual Technical Conference, pages 305–319, 2014.
- [22] O. Padon, G. Losa, M. Sagiv, and S. Shoham. Paxos made EPR: decidable reasoning about distributed protocols. PACMPL, 1(OOPSLA):108:1–108:31, 2017.
- [23] O. Padon, K. L. McMillan, A. Panda, M. Sagiv, and S. Shoham. Ivy: safety verification by interactive generalization. In PLDI, pages 614–630. ACM, 2016.
- [24] V. Rahli, D. Guaspari, M. Bickford, and R. L. Constable. Formal specification, verification, and implementation of fault-tolerant systems using EventML. In AVOCS. EASST, 2015.
- [25] I. Sergey, J. R. Wilcox, and Z. Tatlock. Programming and proving with distributed protocols. PACMPL, 2(POPL):28:1–28:30, 2018.
- [26] V. Vafeiadis. Modular fine-grained concurrency verification. PhD thesis, University of Cambridge, 2007.
- [27] R. van Renesse and D. Altinbuken. Paxos Made Moderately Complex. ACM Comput. Surv., 47(3):42:1–42:36, 2015.
- [28] J. R. Wilcox, I. Sergey, and Z. Tatlock. Programming Language Abstractions for Modularly Verified Distributed Systems. In SNAPL, volume 71 of LIPIcs, pages 19:1–19:12. Schloss Dagstuhl, 2017.
- [29] J. R. Wilcox, D. Woos, P. Panchekha, Z. Tatlock, X. Wang, M. D. Ernst, and T. E. Anderson. Verdi: a framework for implementing and formally verifying distributed systems. In PLDI, pages 357–368. ACM, 2015.
- [30] G. Winskel. The Formal Semantics of Programming Languages. The MIT Press, Cambridge, Massachusetts, 1993.
- [31] D. Woos, J. R. Wilcox, S. Anton, Z. Tatlock, M. D. Ernst, and T. E. Anderson. Planning for change in a formal verification of the Raft Consensus Protocol. In CPP, pages 154–165. ACM, 2016.
- [32] Y. Yu, P. Manolios, and L. Lamport. Model Checking TLA Specifications. In CHARME, volume 1703 of LNCS, pages 54–66. Springer, 1999.

##
Appendix 0.A Proof Outline of Module *Paxos*

###### Proof (Theorem 4.1)

By the following proof of linearisation. The following predicates state the relation that connects the concrete with the abstract state and the invariant of Paxos:

We consider actions (ProposeP1)

(ProposeP2)

(ProposeP3)

and (ProposeP4)

The guarantee relation for proposeP(v0) is

where pid() is the process identifier of the proposer, and the rely relation is

∎

##
Appendix 0.B Proof Outline of Module *Round-Based Consensus*

###### Proof (Theorem 4.2)

By the following proof of linearisation. The following predicates state the abstract relation between the concrete and the abstract state in the instrumented implementation of Figure 8.

Variables vRR, roundRR and valsRR are respectively the decided value, the
round and the set of values from the module *Round-Based
Register*. Predicate ensures that the abstract abs_vRC and
concrete vRR coincide, that the abstract round abs_roundRC is less or
equal than the concrete roundRR, and that the abstract abs_valsRC
corresponds to the concrete valsRR minus undef.

The following predicate states the invariant of Round-Based Consensus.

The invariant ensures that either no value has been decided yet (*i.e.* ), or otherwise a value has been
decided (*i.e.* ) and the abstract abs_valsRC contains .

Now we define the rely and guarantee relations. We consider the actions (ProposeRC1)