1 Introduction
Distributed systems comprise multiple software and hardware components that are bound to eventually fail [10]. Such failures can cause service malfunction or unavailability, incurring significant costs to missioncritical systems, e.g., automation systems and online transactions. The failures’ impact can be minimized by protocols that let systems agree on values and actions despite failures. As a consequence, many variants of the agreement or the consensus problem [29] under different assumptions have been studied. Of particular importance are synchrony and failure model assumptions, as they determine the problem’s complexity.
In the simplest failure model, often called the crashstop model, a process fails by stopping to execute its protocol and never recovers. In an asynchronous system, i.e., a system without bounds on execution delays or message latency, assuming the crashstop failure model, makes it impossible to distinguish a crashed from a very slow process. This renders consensuslike problems unsolvable deterministically [16], already in this very simple failure model. To circumvent this impossibility, previous works have investigated ways to relax the underlying asynchrony assumption either explicitly, e.g., by using partial synchrony [12], or implicitly, by defining oracles that encapsulate time, e.g., failure detectors [8]. The result is a large and rich body of literature that builds on top of the former and latter techniques to solve consensuslike problems in the presence of crashstop failures. Typically, the respective proofs rely on assumptions of the “eventually forever” form: the correct nodes staying up forever, incorrect nodes eventually crashing and remaining down forever, and failure detectors producing wrong output in the beginning, but eventually providing correct results forever.
However, such "eventually forever" assumptions are not met by real distributed systems. In reality, processes may crash but their processors reboot, and the recovered process rejoins the computation. Communication might also fail at any point in time, but get restored later. Hence, the failure and recovery modes of processes as well as communication links are in reality probabilistic and temporary [13, 15, 31], especially in systems incorporating many unreliable offtheshelf lowcost devices and communication technologies. This led to the development of crashrecovery models, where processes repeatedly leave and join the computation unannounced. This requires new failure detector definitions and new consensus algorithms built on top of these failure detectors [1, 24, 11, 21]
as well as completely new solutions (without failure detectors) that consider different classes of failures, namely classified according to how many times a process can crash and recover
[25]. However, such solutions eliminate the "eventually forever" assumptions only on the processes’ level and not for the communication and failure detectors. Moreover, they do not tell us whether crashstop algorithms can be “ported unchanged” to crashrecovery settings.To this end, this paper investigates how to reuse consensus algorithms defined for the crashstop model with reliable links and failure detectors in a more realistic crashrecovery model, where processes and links can crash and recover probabilistically and for an unbounded number of times. Our models allow unstable
nodes, i.e., nodes that fail and recover infinitely often. These are often excluded or limited in number in other models. In contrast, we explicitly allow unstable behavior of any number of processes and links, by modeling communication problems and crashrecovery behaviors as probabilistic and temporary, rather than deterministic and perpetual. Our system model, similar to existing models that rely on probabilistic factors, e.g., coin flips, comes with the tradeoff of solving consensus (namely the termination property), with probability 1, rather than deterministically.
However, unlike existing solutions that incorporate probabilistic behavior, our approach does not aim at inventing new consensus algorithms but rather focuses on using existing deterministic ones to solve consensus with probability 1. Our approach is modular: we build a wrapper that interacts with a crashstop algorithm as a black box, exchanges messages with other wrappers and transforms these messages into messages that the crashstop algorithm understands. We then formally define classes of algorithms and safety properties for which we prove that our wrapper constructs a system that preserves these properties. Additionally, we show that termination with probability 1 is guaranteed for wrapped algorithms of this class. Moreover, this class is wide and includes the celebrated ChandraToueg algorithm [8] as well as the instantiation of the indulgent framework with failure detectors from [20] . Our work allows such algorithms to be ported unchanged to our crashrecovery model. Hence applications built on top of such algorithms can run in real systems with crashrecovery behavior by simply using our wrapper.
Contributions: To summarize, our main contributions are:

New system models that capture probabilistic and temporary failures and recoveries of processes and communication links in real distributed systems (described in Section 3)

A wrapper framework that allows a wide class of crashstop consensus algorithms to be used unchanged in our more realistic models (described in Section 4)
2 Related Work
Several works addressed the impossibility of asynchronous consensus. One direction exploits the concept of partial synchrony [12], in which an asynchronous system becomes synchronous after some unknown global stabilization time (GST) for a bounded number of rounds. For the same model, ASAP [3] is a consensus algorithm where every process decides no later than round (optimal). Another direction augments asynchronous systems with failure detector oracles, and builds asynchronous consensus algorithms on top [8]. These detectors typically behave erratically at first, but eventually start behaving correctly forever. Like with partial synchrony, the failure detectors must behave correctly for only "sufficiently long" instead of forever [8]; however, quantifying "sufficiently long" is not expressible in a purely asynchronous model [9]. Both lines of work initially investigated crashstop failures of processes. In real systems processes as well as network links crash and recover multiple times and sometimes even indefinitely. This gave rise to a large body of literature that studied how to adapt the two lines of work to crashrecovery behavior of processes and links. We next survey some of this literature.
Failure detectors and consensus algorithms for crash recovery: Dolev et al. [11] consider an asynchronous environment where communication links first lose messages arbitrarily, but eventually communication stabilizes such that a majority of processes forms a forever strongly connected component. Processes belonging to such a stronglyconnected component are termed correct, and the others faulty. Process state is always (fully) persisted in stable storage. The authors propose a failure detector that allows the correct processes to reach a consensus decision and show that the rotating coordinator algorithm [8] works unchanged in their setting, as long as all messages are constantly retransmitted. This relies on piggybacking all previous messages onto the last message, and regularly retransmitting the last message. As this yields very large messages, they also propose a modification of [8] for which no piggybacking is necessary. While our results also rely on strongly connected components, we do not require their existence to be deterministic nor perpetual. We also do not require piggybacking in order for algorithms like [8] to be used unchanged.
Oliveira et al. [28] consider a crashrecovery setting with correct processes that may crash only finitely many times (and thus eventually stay up forever) and faulty processes that permanently crash or crash infinitely often. As in [8], the authors note that correct processes only need to stay up for long enough periods in practice (rather than forever), but this cannot be expressed in the asynchronous model. The authors take the consensus algorithm of [30] which uses stubborn links and transform it to work in the crashrecovery setting by logging every step into stable storage and adding a fastforward mechanism for skipping rounds. Hurfin et al. [22] describe an algorithm using the detector in the crashrecovery case. The notions of correct/faulty processes and of failure detectors are the same as in Oliviera et al [28]. Their algorithm is however more efficient when using stable storage compared to [28]: there is only one write per round (of multiple data), and the required network buffer capacity for each channel (connecting a pair of processes) is one. Compared to [28] and [22] our system does not regard processes that crash and recover infinitely often as faulty and hence we allow such “unstable” processes to implement consensus.
Aguilera et al. [1] consider a crashrecovery system with lossy links. They show that previously proposed failure detectors for the crashrecovery setting have anomalous behaviors even in synchronous systems when considering unstable processes, i.e., processes that crash and recover infinitely often. The authors propose new failure detectors to mitigate this drawback. They also determine the necessary conditions regarding stable storage that allow consensus to be solved in the crashrecovery model, and provide two efficient consensus algorithms: one with, and one without using stable storage. Unlike [1], we do not exclude unstable processes from implementing consensus, thus our model tolerates a wider variety of node behavior. Furthermore, our wrapper requires no modifications to the existing crashstop consensus algorithms, as it treats them as blackboxes.
Modular CrashRecovery Approaches: Similar to [1], Freiling et al. [18] investigate the solvability of consensus in the crashrecovery model under varying assumptions, regarding the number of unstable and correct processes and what is persisted in stable storage. They reuse existing algorithms from the crashstop model in a modular way (without changing them) or semimodular way, with some modifications to the algorithm (as in the case of [8]). Similar to our work, they provide algorithms to emulate a crashstop system on top of a crashrecovery system. Our work, however, always reuses algorithms in a fully modular way, and we define a wide class of algorithms for which such reuse is possible. Furthermore, as we model message losses, processes crashes, and process recoveries probabilistically, our results also apply if processes are unstable, i.e., crash and recover infinitely often.
Randomized Consensus Algorithms: Besides the literature that studied deterministic consensus algorithms, existing works have also explored randomized algorithms to solve “consensus with probability 1”. These include, for example, techniques based on using random coinflips [2, 5, 17] or probabilistic schedulers [7]. In systems with dynamic communication failures, multiple randomized algorithms [27, 26] addressed the consensus problem, which requires only processes to eventually decide. Moniz et al. [27] considered a system with correct processes and a bound on the number of faulty transmission. In a wireless setting, where multiple processes share a communication channel, Moniz et al. [26] devise an algorithm tolerating up to Byzantine processes and requires a bound on the number of omission faults affecting correct processes. In comparison, our work in this paper does not use randomization in the algorithm itself: we focus on using existing deterministic algorithms to solve consensus (with probability 1) in networks with probabilistic failure and recovery patterns of processes and links.
3 System Models
We start with defining the notation we use, and then define general concepts common to all of our models. Then, we define each of our models in turn.
Notation: Given a set , we define to be the set , where is a distinguished element not present in . The set of finite sequences over a set is denoted by . We also call sequences words, when customary. The empty sequence is denoted by . Given a nonempty sequence, defines its first element, the remainder of the sequence, and, if the sequence is finite, its last element. Given two sequences and , where is finite, denotes their concatenation. For a word , denotes the length of . Letting be the th letter of , we say that is a subword of if there exists a strictly monotone function such that , for if is finite, and for all if is infinite. Analogously, is a superword of .
We denote the space of partial functions between sets and by . Note that . Given any function , its graph, written , is the relation . The range of , written , is the set .
Common concepts: We consider a fixed finite set of processes , and a fixed countable set of values, denoted . For each algorithm there is an algorithmspecific countable set of local states , for each process . For simplicity, we restrict ourselves to algorithms where . Note that this does not exclude algorithms that take decisions based on identifiers. We define the global state space . Given a , we define as the projection of to its th component.
A property over an alphabet is a set of infinite words over . We use standard definitions of liveness and safety properties [4]. A property is a safety property if, for every infinite word , there exists a finite prefix of such that the concatenation for all infinite words . Intuitively, the prefix is “bad” and not recoverable from. A property is a liveness property if for any finite word there exists an infinite word such that . Intuitively, “good” things can always happen later.
In this paper, we are interested in preserving properties over the alphabet between the crashstop and crashrecovery versions of an algorithm. In particular, we assume that the local states are records, with two distinguished fields: of type and of type . Intuitively, a value of indicates that the process has not decided yet. For an infinite word over the alphabet , let denote the local state of the process at the th letter of the word. Let us state the standard safety properties of consensus in our notation.
 Validity.

Decided values must come from the set of input values. Formally, validity describes the set of words such that
 Integrity.

Processes do not change their decisions. Formally, integrity describes the set of words such that
 (Uniform) Agreement.

No two processes ever make different non decisions. Formally,
To simplify our preservation results for safety properties, our models store information about process failures separately from . As a consequence, the standard crashstop termination property cannot be expressed as a property over : it is conditioned on a process not failing. However, we do not directly use the crashstop notion of termination and we omit this definition here. Instead, we will prove the following property for the algorithms in our probabilistic crashrecovery model:
 Probabilistic crashrecovery termination.

With probability , all processes eventually decide.
3.1 The crashstop model
Our definition of the crashstop model is standard and closely follows [8]. We assume an asynchronous environment, with processes taking steps in an interleaved fashion. Processes communicate using reliable links, and can query failure detectors.
Failure detectors: A failure pattern is an infinite word over the alphabet . Intuitively, each letter is the set of failed processes in a transition step of a run of a transition system. A failure detector with range is a function from failure patterns to properties over the alphabet .^{1}^{1}1This definition does not distinguish which process received the output, which is sufficient for . The definition can be easily extended to other failure detectors like . A failure detector is unreliable if is a liveness property for all . Intuitively, a detector constrains how the failure detector outputs (the values) must depend on the failure pattern of a run, and unreliable detectors can produce arbitrary outputs in the beginning. We write for the set of all detectors with range .
Algorithms and algorithm steps: The type of crashstop steps over a message space and a failure detector range , written is defined as a pair of functions of types:
Intuitively, given zero or one messages received from some other processes and an output of the failure detector, a step maps the current process state to a new state, and maps the new state to a set of messages to be sent, with zero or one messages sent to each process.
A crashstop algorithm over , and is a tuple where:

is the finite set of initial states,

is the step function, and

is a failure detector.

is the resilience condition, i.e., the number of failures tolerated by the algorithm (recall that we consider a fixed ).
We refer to the components of an algorithm by , , and . As
is finite, it will admit a uniform distribution in our probabilistic model in Section
3.3.Configurations: As noted earlier, we focus on preserving properties over between crashstop and crashrecovery models. However, contains insufficient information to model the algorithm’s crashstop executions (runs). In particular, to account for
asynchronous message delivery and
process failures, we must extend states to configurations. A crashstop configuration is a triple where:

is the (global) state,

is the set of inflight messages, where represents a message that was sent to by . For simplicity, is a set, i.e., we assume that each message is sent at most once between any pair of processes during the entire execution of the algorithm. This suffices for roundbased algorithms that tag messages with their round numbers, and exchange messages once per round.

is the set of failed processes.
As with algorithms, we refer to the components of a configuration by , and .
Step labels and transitions:
While the algorithm steps are deterministic, the asynchronous transition system is not: any
(nonfailed) process can take a step at any point in time, with
different possible received messages, and different failure detector
outputs. Accessing this nondeterminism information is useful in proofs, so we extract it as follows.
A crashstop step label is a quadruple ,
where:

is the process taking the step,

is the message receives in the step ( modeling a missing message),

is the set of processes failed at the end of the step,

is ’s output of the failure detector.
A crashstop step of the algorithm is a triple , where and are configurations and is a label. Crashstop steps must satisfy the following properties:

, i.e., is not failed at the start of the step.

With and denoting process ’s state in and respectively, , and for . That is, takes a step according to the label and the algorithm’s rules, and the other processes do not move.

If , then , i.e., if a message was received, then it was in flight.

Letting , then where is if and otherwise. That is, the received message is removed from the set of inflight messages, and the produced messages are added.

. That is, failed processes do not recover in the crashstop setting.

.
Algorithm runs: A finite, respectively infinite crashstop run of is a finite, respectively infinite alternating sequence of configurations and labels, that ends in a configuration if finite, such that

, i.e., the initial state is allowed by the algorithm.

For each , and in the sequence, is a step of .

For each in the sequence, (the resilience condition is satisfied).

There is a failure pattern such that the sequence is a prefix of , and the sequence is a prefix of . That is, the output of the failure detector satisfies the condition of the failure detector.
Such a run has reliable links if, whenever , and for all , then there exists a such that for some and . That is, all inflight messages eventually get delivered, unless the sender or the receiver is faulty. The crashstop system of the algorithm is the sequence of all crashstop runs of . The crashstop system with reliable links of the algorithm is the set of all crash stop runs with reliable links.
As mentioned before, we are interested in properties that are sequences of global states. In this sense, runs contain too much information (e.g., inflight messages). Thus, given a run , we define its state trace , obtained by removing the labels and projecting configurations onto just the states. We introduce a notion of a state property: an infinite sequence of (global) states. The crashstop system (with or without reliable links) satisfies a state property if for every run of the system, . We later show that our crashrecovery wrappers for crashstop protocols preserve important state properties of crashstop algorithms. Lastly, we note down some simple properties of crashstop runs.
[Reliable links irrelevant for prefixes] Let be a finite crashstop run of . Then, can be extended to an infinite crashstop run of that has reliable links, by extending to include infinite message retransmissions for all sent messages.
Summary of time and failure assumptions
Time. Processes are asynchronous and have no notion of time. Links are asynchronous.
Failures. Processes can fail by halting forever, while links interconnecting them do not fail.
3.2 The lossy synchronous crashrecovery model
We next define our first crashrecovery model. Formally, this a lossy synchronous crashrecovery model, with nondeterministic, but not probabilistic losses, crashes, and recoveries. We use it to prove the preservation of safety properties without taking probabilities into account, since they are not used in such arguments. In this model, we will not distinguish between volatile and persistent memory of a process. Instead, we assume that all memory is persistent. This can be emulated in practice by persisting all volatile memory before taking any actions with sideeffects (such as sending network messages). Finally, while the model is formally synchronous, in that all processes take steps simultaneously, it also captures processing delays, as a slow process behaves like a process that crashes and later recovers.
Algorithms and algorithm steps: A crashrecovery step over a message space , written is defined as:
In other words, a step determines the new state based on the current state and the set of received messages. Given the new state, a process sends a message to every other processes (including itself). Compared to the asynchronous setting (Section 3.1), in this model:

A process can receive multiple messages simultaneously (rather than receiving at most one message in a step).

Every process sends a message to every other process at each step. We use this in later sections to send heartbeat messages, if there is nothing else to exchange. We do not require any guarantees on the delivery of the sent messages in this model.

No failure detector oracle is specified. The synchrony assumption of this model inherently provides spurious failure detection: each process can suspect all peers it did not hear from in the last message exchange. This is in fact exactly what we will use to provide failure detector outputs to the “wrapped” crashstop algorithms run in this setting^{2}^{2}2Similar to Gafni’s roundbyround fault detectors [19]; in our case, the detectors are “stepbystep”.
A crashrecovery algorithm over , is a pair where:

is a finite set of initial states, and

is the step function.
Configurations: As in the crashstop case, we require more than just the global states to model algorithm executions; we hence introduce configurations. These, however, differ from those for the crashstop setting. As communication is synchronous, we need not store the inflight messages; they are either delivered by the end of a step, or they are gone. Furthermore, as processes take steps synchronously, we can introduce a global step number. ^{3}^{3}3We use global step numbers later in the probabilistic model, to assign failure probabilities for processes and links (e.g., a probability of a message from to getting through, if sent in the th step).
A crashrecovery configuration is a tuple where

is the step number,

is the (global) state,

is the set of failed processes.
We denote the set of all crashrecovery configurations by
. Note that this set is countable. This will allow us to impose a Markov chain structure on the system in the later model.
Step labels and transitions: As in the crashstop setting, we use labels to capture all sources of nondeterminism in a step. We will use these labels to assign probabilities to different state transitions in the probabilistic model of the next section.
A crashrecovery step label is a pair , where:

denotes the message received in the step; is the message received by on the channel from to . Note that the function is partial. As we assume that always attempts to send a message to , if is undefined ( is a partial function), then either the message on this channel was lost in the step, or the sender has failed.

is the set of processes that are failed at the end of the step.
The crashrecovery steps (or transitions) of , written , is the set of all triples , where, letting and letting and denote the state of the process in and , we have:

if , then ; that is, processes that are up handle their messages. We assume that the state change is atomic; this can be implemented, since we assume that all process memory is persistent.

if , then . Failed processes do not change their state.

if is defined, then , and . That is, only the sent step messages are received (no message corruption), and messages from failed senders are not received.

.
Algorithm runs: A finite, resp. infinite crashrecovery run of is a finite, resp. infinite alternating sequence of (crashrecovery) configurations and labels, ending in a configuration if finite, such that:

, i.e., the initial state is allowed by the algorithm

for all , that is, each step is a valid crashrecovery transition of .
The crashrecovery system of algorithm is the sequence of all crashrecovery runs of .
Summary of time and failure assumptions
Time. Processes are synchronous and operate in a timetriggered fashion. Links are synchronous (all delivered messages respect a timing upper bound on delivery).
Failure. Processes can fail and recover infinitely often. In every timestep, a link can be either crashed or correct. A crashed link drops all sent messages (if any).
3.3 The probabilistic crashrecovery model
We now extend the lossy synchronous crashrecovery model to a probabilistic model, where both the successful delivery of messages and failures follow a distribution that can vary with time. A probabilistic network is a function of type , such that Intuitively, is the delivery probability for a message sent from to at time (step number) . A probabilistic failure pattern is a function , such that Intuitively, gives the probability of being up at time .^{4}^{4}4Considering infinite time, the upper and lower bounds on ensure that, with probability 1, there is a time when process is up.
Given a crashrecovery algorithm , a probabilistic network and failure pattern , a probabilistic crashrecovery system is the Markov chain [6] with:

The set of states , i.e., the crashrecovery configuration set.

The transition probabilities defined as , where is the normalization factor for and is defined as:
Here, maps the Boolean values true and false to and respectively. Intuitively, a transition from to is only possible if it is possible in the lossy synchronous crashrecovery model. The probability of this transition is calculated by summing over all labels that lead from to , and giving each such label a weight. Note that the only nondeterminism in the transitions of comes exactly from the behavior we deem probabilistic: messages being dropped by the network, and process failures.
It is easy to see that is well defined for all , as for a fixed configuration , is nonzero for only finitely many configurations .

The distribution over the initial states defined by , where

,

and normalizes the probabilities. Note that normalization is possible, since we assumed that after fixing , each algorithm comes with a finite set of initial states.

Summary of time and failure assumptions
Time. Processes are synchronous and operate in a timetriggered fashion. Links are synchronous (all delivered messages respect a timing upper bound on delivery).
Failure. Processes and links can fail and recover infinitely often. At the beginning of any timestep a crashed process/link can recover with positive probability and a correct process/link can fail with positive probability.
4 Wrapper for CrashStop Algorithms
We now define the transformation of a crashstop algorithm into a crashrecovery algorithm . Intuitively, we do this by (also illustrated in Figure 1):

[noitemsep,topsep=0pt]

Generating a synchronous crashrecovery step using a series of crashstop steps. Each step in the series handles one individual received message, allowing us to iteratively handle multiple simultaneously incoming messages and bridge the synchrony mismatch between the crashstop and crashrecovery models.

Using roundbyround failure detectors to produce the failure detector outputs to be fed to the crashstop algorithm. These outputs are from the set .^{5}^{5}5We could instead produce outputs that never suspect anyone, since no process crashes forever in our probabilistic model. However for a weaker model considered later (where processes are allowed to crash forever), we need failure detectors that suspect processes.

Providing reliable links, as required by the crashstop algorithm. During each crashrecovery step, we buffer all outgoing messages of a process, and send them repeatedly in the subsequent crashrecovery steps, until an acknowledgment is received.
We first define the message and state spaces of the crashrecovery version of a given crashstop algorithm as follows:

In , we send a pair of messages to each process in each step: the actual payload message (from ), replaced by a special heartbeat message being sent when no payload needs to be sent; and an acknowledgment message, confirming the receipt of the last message on the channel in the opposite direction,

The local state of a process has three components : (1) stores the state of in the target crashstop algorithm; (2) buff represents ’s outgoing message buffers, with one buffer for each process (including one for ); and (3) records the last message that received from . The buffers are LIFO, a choice which proves crucial for our termination proof (Section 6).
Next, given a crashrecovery state , a process , and the messages received by in the given round, we define , ’s local step unfolding for and . We define as the sequence of intermediate steps takes. Said differently, is a sequence of crashrecovery states and crashstop labels , where the intermediate state represents the state of after processing the message of the th process. In a crashrecovery run, does not actually transition through the states . These states are listed here separately to intuitively show how the next “real” state to which will transition is computed. The unfolding also allows to relate traces of and more easily in our proofs, as we produce a crashstop run from a crashrecovery run when proving properties of the wrapper. The content of ’s buffers changes as we progress through the states of , as the wrapper routes the messages to and receives new ones from it. The failure detector output (recorded in the labels ) remains constant through the unfolding: all processes from whom no message was received in the crashrecovery step are suspected. Finally, the set of failed processes in each label is defined to be empty. We emulate the process recovery that is possible in the crashrecovery model by crashstop runs in which no processes fail.
Finally, given a crashstop algorithm with an unreliable failure detector and with the perprocess state space , we define its crashrecovery version where:

the initial states of constitute the set of crashrecovery configurations such that there exists a crashstop configuration satisfying the following: (i) the initial states of and correspond to each other and in all buffers are empty, (ii) no messages are acknowledged, and (iii) the failed processes in and are the same.

the next state of a process is computed by unfolding, based on the messages received in this round.

the message that sends to a process pairs the first element of ’s (LIFO) buffer for with the acknowledgment for the last message that received from .

the execution is shortcircuited as soon as a process decides. This is achieved by broadcasting a message to all other processes, announcing that it has decided. When processes receives such a message, they immediately decide and shortcircuits their execution.
Shortcircuiting behavior is a common pattern for consensus algorithms [8, 20]. It can be applied in a blackbox way and it is sound for any crashstop consensus algorithm.
A more formal description of the wrapper can be found in Appendix A. Given a run of crashrecovery version of an algorithm , we define its state trace, , as the sequence of configurations obtained by removing the labels, projecting each configuration onto , and projecting each local state onto . We overload the function symbol to work on both crashstop and crashrecovery runs. Note that both crashstop and crashrecovery state properties are sequences of states from the same state space .
5 Preservation Results
As our first main result, we show that crashrecovery versions of algorithms produced by our wrapper preserve a wide class of safety properties. The class includes the safety properties of consensus: validity, integrity and agreement (Section 3). In other words, if a trace of a crashrecovery version of an algorithm violates a property, then some crashstop trace of the same algorithm also violates that property. We show this in the nonprobabilistic crashrecovery model. However, the result also translates to the probabilistic model, since all allowed traces of the probabilistic model are also traces of the nonprobabilistic one.
Preserving all safety properties for all algorithms and failure detectors would be too strong of a requirement, for two reasons. First, as our crashrecovery model assumes nothing about link or process reliability, in finite runs we can give no guarantees about the accuracy of the simulated failure detectors. Second, the crashrecovery model is synchronous, meaning that different processes take steps simultaneously. This is impossible in the crashstop model, which is asynchronous. Thus, the following simple safety property defined by “the local state of at most one process changes between two successive states in a trace” holds in the crashstop model, but not in the crashrecovery model (equivalently, we can find a crashrecovery trace, but not a crashstop trace that violates the property).
We work around the first problem by assuming that the crashstop algorithms use unreliable failure detectors. For the second problem, we restrict the class of safety properties that we wish to preserve as follows. Consider a property . Let and be runs with the same initial states (i.e., ) such that is a subword of (recall we define subwords earlier). then belongs to the class of properties not repairable through detours if for all such and . Intuitively, this means that the sequence of states represented by inherently violates ; so adding “detours” by the means of additional intermediate states (forming ) does not help satisfy .
The property is an example of a safety property that is repairable through detours: we can take any word that violates and extend it to a word that does not violate . However, we can easily show that the following safety properties are not repairable through detours:

The safety properties of consensus. E.g., consider the validity property: given a word such that is a noninitial and a non value, adding further states between the initial state and does not change the fact that is neither initial nor .

State invariant properties, defined by a set of “good” states, such that for a trace , only if . Equivalently, these properties rule out traces which reach the “bad” states in the complement of . Intuitively, if a bad state is reached, we cannot fix it by adding more states before or after the bad state.
We establish the following lemma, which is essential to the ensuing preservation theorem.
[Crashrecovery traces have crashstop superwords] Let be a crashstop algorithm with an unreliable failure detector. Let be a finite run of the crashrecovery wrapper . Then, there exists a finite run of such that is a subword of , , , no processes fail in , and inflight messages of match the messages in the buffers of .
[Preservation of detourirreparable safety properties] Let be a crashstop algorithm with an unreliable failure detector, and let be a safety state property that is not repairable through detours. If satisfies , then so does .
Proof.
We prove the theorem’s statement by proving its contrapositive. Assume violates . By the definition of safety properties [4], there exists a finite run of such that no continuation of is in . By Lemma 5, there exists a run of such that is a subword of . By Lemma 3.1, can be extended to an infinite run of . By the choice of , we have that . As is a subword of , and since is not detour repairable, then also . Thus, also violates . ∎
Proving that the safety properties of consensus are detourirrepairable, means that these properties are preserved by our wrapper. Since state invariants are also detourirreparable, they too are preserved by our wrapper. This makes our wrapper potentially useful for reusing other kinds of crashstop algorithms in a crashrecovery setting, not just the consensus ones.
If satisfies the safety properties of consensus, then so does . If satisfies a state invariant, then so does .
6 Probabilistic Termination
Termination of consensus algorithms depends on stable periods, during which communication is reliable and no crashes or recoveries occur. In this section, we first state a general result about socalled selective stable periods for our probabilistic crashrecovery model. We then define a generic class of crashstop consensus algorithms, which we call bounded algorithms. We prove that termination for these algorithms is guaranteed in our probabilistic model when run under the wrapper. Namely, we prove that, with probability 1, all processes eventually decide. We also show that the class of bounded algorithms covers a wide spectrum of existing algorithms including the celebrated ChandraToueg [8] and the instantiation of the indulgent framework of [20] that uses failure detectors.
6.1 Selective Stable Periods
Similar to [11], our proofs will rely on forming stronglyconnected communication components between particular sets of processes. However, we will require their existence only for bounded periods of time, which we call selective stable periods.
[Selective stable period] Fix a crashrecovery algorithm . A selectivestable period of of length for a crashrecovery configuration and a set of processes , written , is the set of all sequences of crashrecovery configurations such that we have and there exist such that is a step of and is defined .
Such selective stable periods must occur in runs of a crashrecovery algorithm .
[Selective stable periods are mandatory] Fix a crashrecovery algorithm , a positive integer and a selection function , mapping crashrecovery configurations to process sets. Then, the set of crashrecovery runs
The next section shows how our wrapper exploits such periods to construct crashrecovery algorithms from existing crashstop ones in a blackbox manner. For future work it might also be interesting to devise consensus algorithm directly on top of this property.
6.2 Bounded Algorithms
We next define the class of bounded crashstop algorithms for which our wrapper guarantees termination in the crashrecovery setting. This class comprises algorithms which operate in rounds, with an upper bound on the number of messages exchanged per round as well the number of rounds correct processes can be apart. More formally, they are defined as follows. [Bounded algorithms] A crashstop consensus algorithm (using reliable links and a failure detector [8]) is said to be bounded if it satisfies all properties below:

Communicationclosed rounds: processes operate in rounds. The rounds must be communicationclosed [14]: only the messages from the current round are considered.

Externally triggered state changes: After the first step of every round, the processes change state only upon receipt of round messages, or on a change of the failure detector output.

Bounded round messages: There exists a bound such that, in any round, a process sends at most messages to any other process.

Bounded round gap: Let , that is, the number of correct processes according to the algorithm’s resilience criterion. Then, there exists a bound , such that the fastest processes are at most rounds apart.

Bounded termination: There exists a bound such that for any reachable configuration where any fastest processes in are correct, the other processes are faulty, and the failure detector output at these processes is perfect after , then all of these processes decide before any of them reaches the round .
We can check that an algorithm satisfies property (i) B4 by checking under what condition(s) a process increments its round number, and (ii) B5 by observing the algorithm’s termination under perfect failure detection given a quorum of correct processes. Section 6.3 shows an example of how to check that properties B4 and B5 hold. Using this definition, we can prove that bounded algorithms terminate in selective stable periods of sufficient length.
[Bounded selective stable period termination] Let be a bounded algorithm and its wrapped crashrecovery version. Let be a reachable crashrecovery configuration of the , and let be some set of fastest processes in . Then, there exists a bound , such that, for any selective stable period of length for and , all processes in decide in . Moreover, the bound is independent of the configuration .
Proof.
We partition the processes from into the set (initially ) and (initially ). The processes in will advance their rounds further, and the processes in will not advance, but will have already decided. Let be some slowest process from in . We first claim that advances or decides in any selective stable period for and of length at most , defined as . Denote the period configurations by . Then, using Lemma 5, we obtain a crashstop configuration from that satisfies the conditions of bounded termination, with the processes from being correct, and the others faulty. We consider two cases.
First, if advances in the crashstop model after receiving all the inflight round messages in from other processes in , then it also advances in the crashrecovery model after receiving these messages. Moreover, our wrapper delivers all such messages within steps, as
it uses LIFO buffers;
by Lemma 5, the same bounds also apply to .
Second, if does not advance in the crashstop model, then, since the failure detector output remains stable, and since no further round messages will be delivered to in the crashstop model, the requirement ii ensures that will not advance further in the crashstop setting. Moreover, requirements v and ii ensure that must decide after receiving all of its round messages; we move to the set .
We have thus established that the slowest process from can move to either or advance its round after steps. Next, we claim that we can repeat this procedure by picking the slowest member of again. This is because the procedure ensures that the processes in always have round numbers lower than the processes in . Thus, due to i and ii, the processes in cannot rely on those from for changing their state.
Lastly, we note that this procedure needs to be repeated at most times before all processes move to the round , by which point v guarantees that all processes terminate. Thus gives us the required bound . ∎
The main result of this paper shows that the wrapper guarantees all consensus properties for wrapped bounded algorithms, including termination.
[CR consensus preservation] If a bounded algorithm solves consensus in the crashstop setting, then also solves consensus in the probabilistic crashrecovery setting.
Proof.
By Corollary 5, we conclude that solves the safety properties of consensus in the crashrecovery setting. For (probabilistic) termination, the result follows from Theorem 6.2 and Lemma 6.1, using as the selection function for Lemma 6.1
any function that selects some fastest processes in a configuration if no process has decided yet and
all processes, if some process has decided. The latter allows us to propagate the decision to all processes, due to shortcircuiting in . ∎
6.3 Examples of Bounded Algorithms
We next give two prominent examples of bounded algorithms: the ChandraToueg (CT) algorithm [8] and the instantiation of the indulgent framework of [20] that uses failure detectors. For these algorithms, rounds are composite, and consist of the combination of what the authors refer to as rounds and phases. Checking that the algorithms then satisfy conditions i–iii is straightforward. iv holds for the CT algorithm with : take the fastest process in a crashstop configuration; if its CT round number is or less, the claim is immediate. Otherwise, must have previously moved out of the phase 2 of the last round in which it was the coordinator, which implies that at least processes have also already executed . Since CT uses the rotating coordinator paradigm, ; as each round consists of phases, . For the algorithm from [20], processes only advance to the next round (which consists of two phases) when they receive messages from other processes. Thus, . Finally, proving the requirement v is similar to, but simpler than the original termination proofs for the algorithms, since it only requires termination under conditions which includes perfect failure detector output. For space reasons, we do not provide the full proofs here, but we note that for both algorithms, where is the number of phases per algorithm round. Intuitively, within this many rounds the execution hits a round where a correct processor is the coordinator; since we assume perfect failure detection for this period, no process will suspect this coordinator, and thus no process will move out of this round without deciding. The wrapped versions of ChandraToueg’s algorithm [8] and the instantiation of the indulgent framework of [20] using failure detectors solve consensus in the probabilistic crashrecovery model.
7 Concluding Remarks
This paper introduced new system models that closely capture the messy reality of distributed systems. Unlike the usual distributed computing models, we account for failure and recovery patterns of processes and links in a probabilistic and temporary, rather than a deterministic and perpetual manner. Our models allow an unbounded number of processes and communication links to probabilistically fail and recover, potentially for an infinite number of times. We showed how and under what conditions we can reuse existing crashstop distributed algorithms in our crashrecovery systems. We presented a wrapper that allows crashstop algorithms to be deployed unchanged in our crashrecovery models. The wrapper preserves the correctness of a wide class of consensus algorithms.
Our work opens several new directions for future investigations. First, we currently model failures of processes as well as communication links individually and independently, with a nonzero probability of failing/recovering at any point in time. In Appendix D
, we sketch how our results can be extended to systems where some processes may never even recover from failure. It is interesting to investigate what results can be established with more complicated probability distributions, e.g., if the model is weakened to allow processes and links to fail/recover on average with some nonzero probability
[15]. Second, our wrapper fully persists the processes state. Studying how to minimize the amount of persisted state while still allowing our results (or similar ones) to hold is another promising direction. Finally, we focus on algorithms that depend on the reliability of message delivery. Some algorithms, notably Paxos [23], do not. Finding a modular link abstraction for the crashstop setting that identifies these algorithms is another interesting topic. For those algorithms, we speculate that preserving termination in the crashrecovery model is simpler.References
 [1] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detection and consensus in the crashrecovery model. Distributed computing, 13(2):99–125, 2000.
 [2] Dan Alistarh, James Aspnes, Valerie King, and Jared Saia. Communicationefficient randomized consensus. In Distributed Computing, pages 61–75, 2014.
 [3] Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers. How to solve consensus in the smallest window of synchrony. In DISC. Springer, 2008.
 [4] Bowen Alpern and Fred Schneider. Defining Liveness. Information Processing Letters, 21:181–185, June 1985.
 [5] James Aspnes, Hagit Attiya, and Keren Censor. Combining sharedcoin algorithms. J. Parallel Distrib. Comput., 70(3):317–322, 2010.
 [6] Christel Baier and JoostPieter Katoen. Principles of model checking. MIT Press, 2008.
 [7] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4), 1985.
 [8] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267, 1996.
 [9] Bernadette CharronBost, Martin Hutle, and Josef Widder. In search of lost time. Information Processing Letters, 110(21), 2010.
 [10] Flavin Cristian. Understanding faulttolerant distributed systems. Commun. ACM, 34(2):56–78, 1991.
 [11] Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments. In PODC, pages 286–, 1997.
 [12] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288–323, 1988.
 [13] Dacfey Dzung, Rachid Guerraoui, David Kozhaya, and Yvonne Anne Pignolet. Never say never  probabilistic and temporal failure detectors. In IEEE International Parallel and Distributed Processing Symposium, IPDPS, pages 679–688, 2016.
 [14] Tzilla Elrad and Nissim Francez. Decomposition of distributed programs into communicationclosed layers. Science of Computer Programming, 2:155–173, 1982.
 [15] Christof Fetzer, Ulrich Schmid, and Martin Susskraut. On the possibility of consensus in asynchronous systems with finite average response times. In 25th IEEE International Conference on Distributed Computing Systems, pages 271–280, 2005.
 [16] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374–382, 1985.
 [17] Pierre Fraigniaud, Mika Göös, Amos Korman, Merav Parter, and David Peleg. Randomized distributed decision. LNCS, 27, 2014.
 [18] Felix C. Freiling, Christian Lambertz, and Mila MajsterCederbaum. Modular Consensus Algorithms for the CrashRecovery Model. In 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 287–292, December 2009.
 [19] Eli Gafni. Roundbyround fault detectors: Unifying synchrony and asynchrony. In PODC, pages 143–152, 1998.
 [20] Rachid Guerraoui and Michel Raynal. A generic framework for indulgent consensus. In ICDCS, pages 88–, 2003.
 [21] Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. Consensus in asynchronous systems where processes can crash and recover. In Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems, pages 280–, 1998.
 [22] Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. A versatile family of consensus protocols based on chandratoueg’s unreliable failure detectors. IEEE Transactions on Computers, 51(4):395–408, 2002.
 [23] Leslie Lamport. The Parttime Parliament. ACM Trans. Comput. Syst., 16(2):133–169, 1998.
 [24] Mikel Larrea, Cristian Martín, and Iratxe Soraluze. Communicationefficient leader election in crash–recovery systems. Journal of Systems and Software, 84(12):2186 – 2195, 2011.
 [25] Neeraj Mittal, Kuppahalli L. Phaneesh, and Felix C. Freiling. Safe termination detection in an asynchronous distributed system when processes may crash and recover. Theor. Comput. Sci., 410(67):614–628, February 2009.
 [26] H. Moniz, N.F. Neves, and M. Correia. Turquois: Byzantine consensus in wireless ad hoc networks. In DSN, 2010.
 [27] Henrique Moniz, NunoFerreira Neves, Miguel Correia, and Paulo Veríssimo. Randomization can be a healer: Consensus with dynamic omission failures. In LNCS, volume 5805. 2009.
 [28] Rui Oliveira, Rachid Guerraoui, and André Schiper. Consensus in the crashrecover model. 1997.
 [29] Marshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. Journal of the ACM (JACM), 27(2):228–234, 1980.
 [30] André Schiper. Early consensus in an asynchronous system with a weak failure detector. Distrib. Comput., 10(3):149–157, 1997.
 [31] Ulrich Schmid, Bettina Weiss, and Idit Keidar. Impossibility results and lower bounds for consensus under link failures. SIAM Journal on Computing, 38(5):1912–1951, 2009.
Appendix A Wrapper Description
In this section we provide the formal description of the wrapper that produces the crashrecovery version of a crashstop algorithm . The message and local state spaces are denoted by and respectively.
The message space of is defined as . is the message set of which we extend with a special heartbeat message being sent when no message needs to be sent. The second part of each message represents an acknowledgment, confirming the receipt of the last message on the channel in the opposite direction. The message indicates that no messages (including acknowledgment messages) have been received so far.
The types of the components of a record are

of type storing the state of in the target crashstop algorithm.

buff of type , where we recall that stands for the set of finite sequences of messages from . These are ’s outgoing message buffers, with one buffer for each process in the system (including one for ).

acks of type . This records, for each process , the last message that received from (if any). This will be used for acknowledgments.
Given a crashrecovery state , a process , and the partial function of messages received by in the given round, we define the p’s local step unfolding for and , written as follows. First, let:

be if , and let be otherwise. That is, is the message from that receives, using to indicate that no message was received.

Let . That is, all processes from whom no message was received in this step are suspected.
Then, unless has decided yet and broadcasts the decision, the following sequence of intermediate steps is taken. I.e., defines a sequence of crashrecovery states and crashstop labels , where the intermediate state represents the state record of after processing the message of the th process, defined as follows:


Recalling that we number the processes from to , and are computed as follows, for :

Unpack the message from if one has been received, Let if , and let and otherwise. If and we check that the message has not been acknowledged yet, , in which case we feed the message to ’s next function: . We also need the next state if no message (or a duplicate) has been received, as needs to be a transition of the crashstop system. Hence, if or , we feed in and not , i.e., .

we set or accordingly. In both cases, the set of failed processes in this label is empty.

We remove the acknowledged message (if any) from the head of the outgoing buffer for this process. More precisely, let be the buffer obtained as follows. First, copy the buffer of all other processes except .
for all .Next, if there are messages to send to process , i.e., , and if and is equal to , then we let be ; otherwise, let .
To add a potential new message to the buffer, let . These are the new messages that wishes to send, at most one destined to each process in the system. We define , if is undefined, and otherwise. Notice that we add the new message at the head of the list; as we will also remove messages from the head when in the function, our buffers are LIFO.

If and , then , and for .

Finally, given a crashstop algorithm using an unreliable failure detector with range and with the perprocess state space , we define its crashrecovery version where:

is the set of crashrecovery configurations such that there exists a crashstop configuration such that:

for each :


for all . That is, initially, no messages are buffered.

for all . Initially, no messages are acknowledged.




.

is:

if , and

, otherwise, i.e., if we have nothing else to send.

Appendix B Markov chains
Our probabilistic crashrecovery model uses Markov chains [6]. We recall the basic notions here which are relevant for our proofs.
A (discretetime) Markov chain is a tuple where:

is a countable, nonempty set of states

is the transition probability function such that for all states :

is the initial distribution, such that .
For a Markov chain, the cylinder set spanned by a finite word over is defined as the set: These sets serve as the basis events of the algebras of Markov chains. If , then
Appendix C Omitted Proofs
c.1 Proof of Lemma 5
Informally, we obtain by unfolding the crashrecovery steps of . In other words, we extend each crashrecovery step to its corresponding crashstop steps. Recall that a single step in the crash recovery setting may abstract the receipt of multiple messages, while every received message in the crashstop setting is step of the algorithm.
The only difficulty is handling process recovery, which does not exist in the crashstop system. Here, we exploit the fact that, in asynchronous systems, crashed processes are indistinguishable from delayed ones. This enables us to keep free from failed processes.
We now state our claim more formally, and then prove it by induction. We claim that there exists a finite run such that:

is a subword of , with the same first and last character: and .

No failures occur in , for all .

Inflight messages of correspond to messages in the buffers of ,
. Here, we overload , writing even though buff is a sequence.
Proving the base case is easy, as is an initial state of , and both the inflight set of messages and all the buffers are empty for initial configurations (follows from our wrapper’s definition).
We now prove the step case. Let and be as above, let:

,

,

, and

be the extension of with .
In other words, , is a next “letter” we could add to a valid run of . Thus, we will prove that we can extend with a series of transitions, to obtain a run such that the claim holds for and . The extension is obtained by concatenating the unfoldings of each process’s local step. More precisely, we define the sequences (fragments) where is the unfolding of process ’s local steps:


if , then

if , then is the sequence such that:


is

for and (the state of other processes does not change)

, for (messages delivered to are removed the inflight message and new messages sent by are added)

for

Since the successive fragments overlap at one state, we define
Comments
There are no comments yet.