You Only Live Multiple Times: A Blackbox Solution for Reusing Crash-Stop Algorithms In Realistic Crash-Recovery Settings

11/12/2018
by   David Kozhaya, et al.
ABB
0

Distributed agreement-based algorithms are often specified in a crash-stop asynchronous model augmented by Chandra and Toueg's unreliable failure detectors. In such models, correct nodes stay up forever, incorrect nodes eventually crash and remain down forever, and failure detectors behave correctly forever eventually, However, in reality, nodes as well as communication links both crash and recover without deterministic guarantees to remain in some state forever. In this paper, we capture this realistic temporary and probabilitic behaviour in a simple new system model. Moreover, we identify a large algorithm class for which we devis a property-preserving transformation. Using this transformation, many algorithms written for the asynchronous crash-stop model run correctly and unchanged in real systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/26/2020

Asynchronous Byzantine Agreement in Incomplete Networks [Technical Report]

The Byzantine agreement problem is considered to be a core problem in di...
10/22/2018

RCanopus: Making Canopus Resilient to Failures and Byzantine Faults

Distributed consensus is a key enabler for many distributed systems incl...
02/17/2020

In Search for a Linear Byzantine Agreement

The long-standing byzantine agreement problem gets more attention in rec...
04/23/2018

Eigenvector Computation and Community Detection in Asynchronous Gossip Models

We give a simple distributed algorithm for computing adjacency matrix ei...
03/26/2021

Verification of Eventual Consensus in Synod Using a Failure-Aware Actor Model

Successfully attaining consensus in the absence of a centralized coordin...
05/20/2020

Consensus Driven Learning

As the complexity of our neural network models grow, so too do the data ...
12/18/2020

Partitionable Asynchronous Cryptocurrency Blockchain

We consider operation of blockchain-based cryptocurrency in case of part...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distributed systems comprise multiple software and hardware components that are bound to eventually fail [10]. Such failures can cause service malfunction or unavailability, incurring significant costs to mission-critical systems, e.g., automation systems and on-line transactions. The failures’ impact can be minimized by protocols that let systems agree on values and actions despite failures. As a consequence, many variants of the agreement or the consensus problem [29] under different assumptions have been studied. Of particular importance are synchrony and failure model assumptions, as they determine the problem’s complexity.

In the simplest failure model, often called the crash-stop model, a process fails by stopping to execute its protocol and never recovers. In an asynchronous system, i.e., a system without bounds on execution delays or message latency, assuming the crash-stop failure model, makes it impossible to distinguish a crashed from a very slow process. This renders consensus-like problems unsolvable deterministically [16], already in this very simple failure model. To circumvent this impossibility, previous works have investigated ways to relax the underlying asynchrony assumption either explicitly, e.g., by using partial synchrony [12], or implicitly, by defining oracles that encapsulate time, e.g., failure detectors [8]. The result is a large and rich body of literature that builds on top of the former and latter techniques to solve consensus-like problems in the presence of crash-stop failures. Typically, the respective proofs rely on assumptions of the “eventually forever” form: the correct nodes staying up forever, incorrect nodes eventually crashing and remaining down forever, and failure detectors producing wrong output in the beginning, but eventually providing correct results forever.

However, such "eventually forever" assumptions are not met by real distributed systems. In reality, processes may crash but their processors reboot, and the recovered process rejoins the computation. Communication might also fail at any point in time, but get restored later. Hence, the failure and recovery modes of processes as well as communication links are in reality probabilistic and temporary [13, 15, 31], especially in systems incorporating many unreliable off-the-shelf low-cost devices and communication technologies. This led to the development of crash-recovery models, where processes repeatedly leave and join the computation unannounced. This requires new failure detector definitions and new consensus algorithms built on top of these failure detectors [1, 24, 11, 21]

as well as completely new solutions (without failure detectors) that consider different classes of failures, namely classified according to how many times a process can crash and recover 

[25]. However, such solutions eliminate the "eventually forever" assumptions only on the processes’ level and not for the communication and failure detectors. Moreover, they do not tell us whether crash-stop algorithms can be “ported unchanged” to crash-recovery settings.

To this end, this paper investigates how to re-use consensus algorithms defined for the crash-stop model with reliable links and failure detectors in a more realistic crash-recovery model, where processes and links can crash and recover probabilistically and for an unbounded number of times. Our models allow unstable

nodes, i.e., nodes that fail and recover infinitely often. These are often excluded or limited in number in other models. In contrast, we explicitly allow unstable behavior of any number of processes and links, by modeling communication problems and crash-recovery behaviors as probabilistic and temporary, rather than deterministic and perpetual. Our system model, similar to existing models that rely on probabilistic factors, e.g., coin flips, comes with the trade-off of solving consensus (namely the termination property), with probability 1, rather than deterministically.

However, unlike existing solutions that incorporate probabilistic behavior, our approach does not aim at inventing new consensus algorithms but rather focuses on using existing deterministic ones to solve consensus with probability 1. Our approach is modular: we build a wrapper that interacts with a crash-stop algorithm as a black box, exchanges messages with other wrappers and transforms these messages into messages that the crash-stop algorithm understands. We then formally define classes of algorithms and safety properties for which we prove that our wrapper constructs a system that preserves these properties. Additionally, we show that termination with probability 1 is guaranteed for wrapped algorithms of this class. Moreover, this class is wide and includes the celebrated Chandra-Toueg algorithm [8] as well as the instantiation of the indulgent framework with failure detectors from [20] . Our work allows such algorithms to be ported unchanged to our crash-recovery model. Hence applications built on top of such algorithms can run in real systems with crash-recovery behavior by simply using our wrapper.

Contributions: To summarize, our main contributions are:

  • New system models that capture probabilistic and temporary failures and recoveries of processes and communication links in real distributed systems (described in Section 3)

  • A wrapper framework that allows a wide class of crash-stop consensus algorithms to be used unchanged in our more realistic models (described in Section 4)

  • Formal properties describing which crash-stop consensus algorithms benefit from our framework and hence can be reused to solve consensus in crash-recovery settings (described Sections 5 and 6)

In addition to the sections presenting our contributions, we discuss related work in Section 2 and conclude the paper in Section 7.

2 Related Work

Several works addressed the impossibility of asynchronous consensus. One direction exploits the concept of partial synchrony [12], in which an asynchronous system becomes synchronous after some unknown global stabilization time (GST) for a bounded number of rounds. For the same model, ASAP [3] is a consensus algorithm where every process decides no later than round (optimal). Another direction augments asynchronous systems with failure detector oracles, and builds asynchronous consensus algorithms on top [8]. These detectors typically behave erratically at first, but eventually start behaving correctly forever. Like with partial synchrony, the failure detectors must behave correctly for only "sufficiently long" instead of forever [8]; however, quantifying "sufficiently long" is not expressible in a purely asynchronous model [9]. Both lines of work initially investigated crash-stop failures of processes. In real systems processes as well as network links crash and recover multiple times and sometimes even indefinitely. This gave rise to a large body of literature that studied how to adapt the two lines of work to crash-recovery behavior of processes and links. We next survey some of this literature.

Failure detectors and consensus algorithms for crash recovery: Dolev et al. [11] consider an asynchronous environment where communication links first lose messages arbitrarily, but eventually communication stabilizes such that a majority of processes forms a forever strongly connected component. Processes belonging to such a strongly-connected component are termed correct, and the others faulty. Process state is always (fully) persisted in stable storage. The authors propose a failure detector that allows the correct processes to reach a consensus decision and show that the rotating coordinator algorithm [8] works unchanged in their setting, as long as all messages are constantly retransmitted. This relies on piggybacking all previous messages onto the last message, and regularly retransmitting the last message. As this yields very large messages, they also propose a modification of [8] for which no piggybacking is necessary. While our results also rely on strongly connected components, we do not require their existence to be deterministic nor perpetual. We also do not require piggybacking in order for algorithms like [8] to be used unchanged.

Oliveira et al. [28] consider a crash-recovery setting with correct processes that may crash only finitely many times (and thus eventually stay up forever) and faulty processes that permanently crash or crash infinitely often. As in [8], the authors note that correct processes only need to stay up for long enough periods in practice (rather than forever), but this cannot be expressed in the asynchronous model. The authors take the consensus algorithm of [30] which uses stubborn links and transform it to work in the crash-recovery setting by logging every step into stable storage and adding a fast-forward mechanism for skipping rounds. Hurfin et al. [22] describe an algorithm using the detector in the crash-recovery case. The notions of correct/faulty processes and of failure detectors are the same as in Oliviera et al [28]. Their algorithm is however more efficient when using stable storage compared to [28]: there is only one write per round (of multiple data), and the required network buffer capacity for each channel (connecting a pair of processes) is one. Compared to [28] and [22] our system does not regard processes that crash and recover infinitely often as faulty and hence we allow such “unstable” processes to implement consensus.

Aguilera et al. [1] consider a crash-recovery system with lossy links. They show that previously proposed failure detectors for the crash-recovery setting have anomalous behaviors even in synchronous systems when considering unstable processes, i.e., processes that crash and recover infinitely often. The authors propose new failure detectors to mitigate this drawback. They also determine the necessary conditions regarding stable storage that allow consensus to be solved in the crash-recovery model, and provide two efficient consensus algorithms: one with, and one without using stable storage. Unlike [1], we do not exclude unstable processes from implementing consensus, thus our model tolerates a wider variety of node behavior. Furthermore, our wrapper requires no modifications to the existing crash-stop consensus algorithms, as it treats them as black-boxes.

Modular Crash-Recovery Approaches: Similar to [1], Freiling et al. [18] investigate the solvability of consensus in the crash-recovery model under varying assumptions, regarding the number of unstable and correct processes and what is persisted in stable storage. They reuse existing algorithms from the crash-stop model in a modular way (without changing them) or semi-modular way, with some modifications to the algorithm (as in the case of [8]). Similar to our work, they provide algorithms to emulate a crash-stop system on top of a crash-recovery system. Our work, however, always reuses algorithms in a fully modular way, and we define a wide class of algorithms for which such reuse is possible. Furthermore, as we model message losses, processes crashes, and process recoveries probabilistically, our results also apply if processes are unstable, i.e., crash and recover infinitely often.

Randomized Consensus Algorithms: Besides the literature that studied deterministic consensus algorithms, existing works have also explored randomized algorithms to solve “consensus with probability 1”. These include, for example, techniques based on using random coin-flips [2, 5, 17] or probabilistic schedulers [7]. In systems with dynamic communication failures, multiple randomized algorithms [27, 26] addressed the -consensus problem, which requires only processes to eventually decide. Moniz et al. [27] considered a system with correct processes and a bound on the number of faulty transmission. In a wireless setting, where multiple processes share a communication channel, Moniz et al. [26] devise an algorithm tolerating up to Byzantine processes and requires a bound on the number of omission faults affecting correct processes. In comparison, our work in this paper does not use randomization in the algorithm itself: we focus on using existing deterministic algorithms to solve consensus (with probability 1) in networks with probabilistic failure and recovery patterns of processes and links.

3 System Models

We start with defining the notation we use, and then define general concepts common to all of our models. Then, we define each of our models in turn.

Notation: Given a set , we define to be the set , where is a distinguished element not present in . The set of finite sequences over a set is denoted by . We also call sequences words, when customary. The empty sequence is denoted by . Given a non-empty sequence, defines its first element, the remainder of the sequence, and, if the sequence is finite, its last element. Given two sequences and , where is finite, denotes their concatenation. For a word , denotes the length of . Letting be the -th letter of , we say that is a subword of if there exists a strictly monotone function such that , for if is finite, and for all if is infinite. Analogously, is a superword of .

We denote the space of partial functions between sets and by . Note that . Given any function , its graph, written , is the relation . The range of , written , is the set .

Common concepts: We consider a fixed finite set of processes , and a fixed countable set of values, denoted . For each algorithm there is an algorithm-specific countable set of local states , for each process . For simplicity, we restrict ourselves to algorithms where . Note that this does not exclude algorithms that take decisions based on identifiers. We define the global state space . Given a , we define as the projection of to its -th component.

A property over an alphabet is a set of infinite words over . We use standard definitions of liveness and safety properties [4]. A property is a safety property if, for every infinite word , there exists a finite prefix of such that the concatenation for all infinite words . Intuitively, the prefix is “bad” and not recoverable from. A property is a liveness property if for any finite word there exists an infinite word such that . Intuitively, “good” things can always happen later.

In this paper, we are interested in preserving properties over the alphabet between the crash-stop and crash-recovery versions of an algorithm. In particular, we assume that the local states are records, with two distinguished fields: of type and of type . Intuitively, a value of indicates that the process has not decided yet. For an infinite word over the alphabet , let denote the local state of the process at the -th letter of the word. Let us state the standard safety properties of consensus in our notation.

Validity.

Decided values must come from the set of input values. Formally, validity describes the set of words such that

Integrity.

Processes do not change their decisions. Formally, integrity describes the set of words such that

(Uniform) Agreement.

No two processes ever make different non- decisions. Formally,

To simplify our preservation results for safety properties, our models store information about process failures separately from . As a consequence, the standard crash-stop termination property cannot be expressed as a property over : it is conditioned on a process not failing. However, we do not directly use the crash-stop notion of termination and we omit this definition here. Instead, we will prove the following property for the algorithms in our probabilistic crash-recovery model:

Probabilistic crash-recovery termination.

With probability , all processes eventually decide.

3.1 The crash-stop model

Our definition of the crash-stop model is standard and closely follows [8]. We assume an asynchronous environment, with processes taking steps in an interleaved fashion. Processes communicate using reliable links, and can query failure detectors.

Failure detectors: A failure pattern is an infinite word over the alphabet . Intuitively, each letter is the set of failed processes in a transition step of a run of a transition system. A failure detector with range is a function from failure patterns to properties over the alphabet .111This definition does not distinguish which process received the output, which is sufficient for . The definition can be easily extended to other failure detectors like . A failure detector is unreliable if is a liveness property for all . Intuitively, a detector constrains how the failure detector outputs (the values) must depend on the failure pattern of a run, and unreliable detectors can produce arbitrary outputs in the beginning. We write for the set of all detectors with range .

Algorithms and algorithm steps: The type of crash-stop steps over a message space and a failure detector range , written is defined as a pair of functions of types:

Intuitively, given zero or one messages received from some other processes and an output of the failure detector, a step maps the current process state to a new state, and maps the new state to a set of messages to be sent, with zero or one messages sent to each process.

A crash-stop algorithm over , and is a tuple where:

  • is the finite set of initial states,

  • is the step function, and

  • is a failure detector.

  • is the resilience condition, i.e., the number of failures tolerated by the algorithm (recall that we consider a fixed ).

We refer to the components of an algorithm by , , and . As

is finite, it will admit a uniform distribution in our probabilistic model in Section 

3.3.

Configurations: As noted earlier, we focus on preserving properties over between crash-stop and crash-recovery models. However, contains insufficient information to model the algorithm’s crash-stop executions (runs). In particular, to account for

asynchronous message delivery and

process failures, we must extend states to configurations. A crash-stop configuration is a triple where:

  • is the (global) state,

  • is the set of in-flight messages, where represents a message that was sent to by . For simplicity, is a set, i.e., we assume that each message is sent at most once between any pair of processes during the entire execution of the algorithm. This suffices for round-based algorithms that tag messages with their round numbers, and exchange messages once per round.

  • is the set of failed processes.

As with algorithms, we refer to the components of a configuration by , and .

Step labels and transitions: While the algorithm steps are deterministic, the asynchronous transition system is not: any (non-failed) process can take a step at any point in time, with different possible received messages, and different failure detector outputs. Accessing this non-determinism information is useful in proofs, so we extract it as follows.
A crash-stop step label is a quadruple , where:

  • is the process taking the step,

  • is the message receives in the step ( modeling a missing message),

  • is the set of processes failed at the end of the step,

  • is ’s output of the failure detector.

A crash-stop step of the algorithm is a triple , where and are configurations and is a label. Crash-stop steps must satisfy the following properties:

  • , i.e., is not failed at the start of the step.

  • With and denoting process ’s state in and respectively, , and for . That is, takes a step according to the label and the algorithm’s rules, and the other processes do not move.

  • If , then , i.e., if a message was received, then it was in flight.

  • Letting , then where is if and otherwise. That is, the received message is removed from the set of in-flight messages, and the produced messages are added.

  • . That is, failed processes do not recover in the crash-stop setting.

  • .

Algorithm runs: A finite, respectively infinite crash-stop run of is a finite, respectively infinite alternating sequence of configurations and labels, that ends in a configuration if finite, such that

  • , i.e., the initial state is allowed by the algorithm.

  • For each , and in the sequence, is a step of .

  • For each in the sequence, (the resilience condition is satisfied).

  • There is a failure pattern such that the sequence is a prefix of , and the sequence is a prefix of . That is, the output of the failure detector satisfies the condition of the failure detector.

Such a run has reliable links if, whenever , and for all , then there exists a such that for some and . That is, all in-flight messages eventually get delivered, unless the sender or the receiver is faulty. The crash-stop system of the algorithm is the sequence of all crash-stop runs of . The crash-stop system with reliable links of the algorithm is the set of all crash stop runs with reliable links.

As mentioned before, we are interested in properties that are sequences of global states. In this sense, runs contain too much information (e.g., in-flight messages). Thus, given a run , we define its state trace , obtained by removing the labels and projecting configurations onto just the states. We introduce a notion of a state property: an infinite sequence of (global) states. The crash-stop system (with or without reliable links) satisfies a state property if for every run of the system, . We later show that our crash-recovery wrappers for crash-stop protocols preserve important state properties of crash-stop algorithms. Lastly, we note down some simple properties of crash-stop runs.

[Reliable links irrelevant for prefixes] Let be a finite crash-stop run of . Then, can be extended to an infinite crash-stop run of that has reliable links, by extending to include infinite message retransmissions for all sent messages.

Summary of time and failure assumptions

Time. Processes are asynchronous and have no notion of time. Links are asynchronous.

Failures. Processes can fail by halting forever, while links interconnecting them do not fail.

3.2 The lossy synchronous crash-recovery model

We next define our first crash-recovery model. Formally, this a lossy synchronous crash-recovery model, with non-deterministic, but not probabilistic losses, crashes, and recoveries. We use it to prove the preservation of safety properties without taking probabilities into account, since they are not used in such arguments. In this model, we will not distinguish between volatile and persistent memory of a process. Instead, we assume that all memory is persistent. This can be emulated in practice by persisting all volatile memory before taking any actions with side-effects (such as sending network messages). Finally, while the model is formally synchronous, in that all processes take steps simultaneously, it also captures processing delays, as a slow process behaves like a process that crashes and later recovers.

Algorithms and algorithm steps: A crash-recovery step over a message space , written is defined as:

In other words, a step determines the new state based on the current state and the set of received messages. Given the new state, a process sends a message to every other processes (including itself). Compared to the asynchronous setting (Section 3.1), in this model:

  1. A process can receive multiple messages simultaneously (rather than receiving at most one message in a step).

  2. Every process sends a message to every other process at each step. We use this in later sections to send heartbeat messages, if there is nothing else to exchange. We do not require any guarantees on the delivery of the sent messages in this model.

  3. No failure detector oracle is specified. The synchrony assumption of this model inherently provides spurious failure detection: each process can suspect all peers it did not hear from in the last message exchange. This is in fact exactly what we will use to provide failure detector outputs to the “wrapped” crash-stop algorithms run in this setting222Similar to Gafni’s round-by-round fault detectors [19]; in our case, the detectors are “step-by-step”.

A crash-recovery algorithm over , is a pair where:

  • is a finite set of initial states, and

  • is the step function.

Configurations: As in the crash-stop case, we require more than just the global states to model algorithm executions; we hence introduce configurations. These, however, differ from those for the crash-stop setting. As communication is synchronous, we need not store the in-flight messages; they are either delivered by the end of a step, or they are gone. Furthermore, as processes take steps synchronously, we can introduce a global step number. 333We use global step numbers later in the probabilistic model, to assign failure probabilities for processes and links (e.g., a probability of a message from to getting through, if sent in the -th step).

A crash-recovery configuration is a tuple where

  • is the step number,

  • is the (global) state,

  • is the set of failed processes.

We denote the set of all crash-recovery configurations by

. Note that this set is countable. This will allow us to impose a Markov chain structure on the system in the later model.

Step labels and transitions: As in the crash-stop setting, we use labels to capture all sources of non-determinism in a step. We will use these labels to assign probabilities to different state transitions in the probabilistic model of the next section.

A crash-recovery step label is a pair , where:

  • denotes the message received in the step; is the message received by on the channel from to . Note that the function is partial. As we assume that always attempts to send a message to , if is undefined ( is a partial function), then either the message on this channel was lost in the step, or the sender has failed.

  • is the set of processes that are failed at the end of the step.

The crash-recovery steps (or transitions) of , written , is the set of all triples , where, letting and letting and denote the state of the process in and , we have:

  • if , then ; that is, processes that are up handle their messages. We assume that the state change is atomic; this can be implemented, since we assume that all process memory is persistent.

  • if , then . Failed processes do not change their state.

  • if is defined, then , and . That is, only the sent step messages are received (no message corruption), and messages from failed senders are not received.

  • .

Algorithm runs: A finite, resp. infinite crash-recovery run of is a finite, resp. infinite alternating sequence of (crash-recovery) configurations and labels, ending in a configuration if finite, such that:

  • , i.e., the initial state is allowed by the algorithm

  • for all , that is, each step is a valid crash-recovery transition of .

The crash-recovery system of algorithm is the sequence of all crash-recovery runs of .

Summary of time and failure assumptions

Time. Processes are synchronous and operate in a time-triggered fashion. Links are synchronous (all delivered messages respect a timing upper bound on delivery).

Failure. Processes can fail and recover infinitely often. In every time-step, a link can be either crashed or correct. A crashed link drops all sent messages (if any).

3.3 The probabilistic crash-recovery model

We now extend the lossy synchronous crash-recovery model to a probabilistic model, where both the successful delivery of messages and failures follow a distribution that can vary with time. A probabilistic network is a function of type , such that Intuitively, is the delivery probability for a message sent from to at time (step number) . A probabilistic failure pattern is a function , such that Intuitively, gives the probability of being up at time .444Considering infinite time, the upper and lower bounds on ensure that, with probability 1, there is a time when process is up.

Given a crash-recovery algorithm , a probabilistic network and failure pattern , a probabilistic crash-recovery system is the Markov chain [6] with:

  • The set of states , i.e., the crash-recovery configuration set.

  • The transition probabilities defined as , where is the normalization factor for and is defined as:

    Here, maps the Boolean values true and false to and respectively. Intuitively, a transition from to is only possible if it is possible in the lossy synchronous crash-recovery model. The probability of this transition is calculated by summing over all labels that lead from to , and giving each such label a weight. Note that the only non-determinism in the transitions of comes exactly from the behavior we deem probabilistic: messages being dropped by the network, and process failures.

    It is easy to see that is well defined for all , as for a fixed configuration , is non-zero for only finitely many configurations .

  • The distribution over the initial states defined by , where

    • ,

    and normalizes the probabilities. Note that normalization is possible, since we assumed that after fixing , each algorithm comes with a finite set of initial states.

Summary of time and failure assumptions

Time. Processes are synchronous and operate in a time-triggered fashion. Links are synchronous (all delivered messages respect a timing upper bound on delivery).

Failure. Processes and links can fail and recover infinitely often. At the beginning of any time-step a crashed process/link can recover with positive probability and a correct process/link can fail with positive probability.

Important: All our results (Section 4, 5, and 6) hold for both crash-recovery models (Section 3.2 and 3.3 ). However, probabilities are only needed to prove results of Section 6.

4 Wrapper for Crash-Stop Algorithms

We now define the transformation of a crash-stop algorithm into a crash-recovery algorithm . Intuitively, we do this by (also illustrated in Figure 1):

Figure 1: Wrapper concept.
  • [noitemsep,topsep=0pt]

  • Generating a synchronous crash-recovery step using a series of crash-stop steps. Each step in the series handles one individual received message, allowing us to iteratively handle multiple simultaneously incoming messages and bridge the synchrony mismatch between the crash-stop and crash-recovery models.

  • Using round-by-round failure detectors to produce the failure detector outputs to be fed to the crash-stop algorithm. These outputs are from the set .555We could instead produce outputs that never suspect anyone, since no process crashes forever in our probabilistic model. However for a weaker model considered later (where processes are allowed to crash forever), we need failure detectors that suspect processes.

  • Providing reliable links, as required by the crash-stop algorithm. During each crash-recovery step, we buffer all outgoing messages of a process, and send them repeatedly in the subsequent crash-recovery steps, until an acknowledgment is received.

We first define the message and state spaces of the crash-recovery version  of a given crash-stop algorithm  as follows:

  • In , we send a pair of messages to each process in each step: the actual payload message (from ), replaced by a special heartbeat message being sent when no payload needs to be sent; and an acknowledgment message, confirming the receipt of the last message on the channel in the opposite direction,

  • The local state of a process has three components : (1) stores the state of in the target crash-stop algorithm; (2) buff represents ’s outgoing message buffers, with one buffer for each process (including one for ); and (3) records the last message that received from . The buffers are LIFO, a choice which proves crucial for our termination proof (Section 6).

Next, given a crash-recovery state , a process , and the messages received by in the given round, we define , ’s local step unfolding for and . We define as the sequence of intermediate steps takes. Said differently, is a sequence of crash-recovery states and crash-stop labels , where the intermediate state represents the state of after processing the message of the -th process. In a crash-recovery run, does not actually transition through the states . These states are listed here separately to intuitively show how the next “real” state to which will transition is computed. The unfolding also allows to relate traces of and more easily in our proofs, as we produce a crash-stop run from a crash-recovery run when proving properties of the wrapper. The content of ’s buffers changes as we progress through the states of , as the wrapper routes the messages to and receives new ones from it. The failure detector output (recorded in the labels ) remains constant through the unfolding: all processes from whom no message was received in the crash-recovery step are suspected. Finally, the set of failed processes in each label is defined to be empty. We emulate the process recovery that is possible in the crash-recovery model by crash-stop runs in which no processes fail.

Finally, given a crash-stop algorithm with an unreliable failure detector and with the per-process state space , we define its crash-recovery version where:

  • the initial states of constitute the set of crash-recovery configurations such that there exists a crash-stop configuration satisfying the following: (i) the initial states of and correspond to each other and in all buffers are empty, (ii) no messages are acknowledged, and (iii) the failed processes in and are the same.

  • the next state of a process is computed by unfolding, based on the messages received in this round.

  • the message that sends to a process pairs the first element of ’s (LIFO) buffer for with the acknowledgment for the last message that received from .

  • the execution is short-circuited as soon as a process decides. This is achieved by broadcasting a message to all other processes, announcing that it has decided. When processes receives such a message, they immediately decide and short-circuits their execution.

Short-circuiting behavior is a common pattern for consensus algorithms [8, 20]. It can be applied in a black-box way and it is sound for any crash-stop consensus algorithm.

A more formal description of the wrapper can be found in Appendix A. Given a run of crash-recovery version of an algorithm , we define its state trace, , as the sequence of configurations obtained by removing the labels, projecting each configuration onto , and projecting each local state onto . We overload the function symbol to work on both crash-stop and crash-recovery runs. Note that both crash-stop and crash-recovery state properties are sequences of states from the same state space .

5 Preservation Results

As our first main result, we show that crash-recovery versions of algorithms produced by our wrapper preserve a wide class of safety properties. The class includes the safety properties of consensus: validity, integrity and agreement (Section 3). In other words, if a trace of a crash-recovery version of an algorithm violates a property, then some crash-stop trace of the same algorithm also violates that property. We show this in the non-probabilistic crash-recovery model. However, the result also translates to the probabilistic model, since all allowed traces of the probabilistic model are also traces of the non-probabilistic one.

Preserving all safety properties for all algorithms and failure detectors would be too strong of a requirement, for two reasons. First, as our crash-recovery model assumes nothing about link or process reliability, in finite runs we can give no guarantees about the accuracy of the simulated failure detectors. Second, the crash-recovery model is synchronous, meaning that different processes take steps simultaneously. This is impossible in the crash-stop model, which is asynchronous. Thus, the following simple safety property defined by “the local state of at most one process changes between two successive states in a trace” holds in the crash-stop model, but not in the crash-recovery model (equivalently, we can find a crash-recovery trace, but not a crash-stop trace that violates the property).

We work around the first problem by assuming that the crash-stop algorithms use unreliable failure detectors. For the second problem, we restrict the class of safety properties that we wish to preserve as follows. Consider a property . Let and be runs with the same initial states (i.e., ) such that is a subword of (recall we define subwords earlier). then belongs to the class of properties not repairable through detours if for all such and . Intuitively, this means that the sequence of states represented by inherently violates ; so adding “detours” by the means of additional intermediate states (forming ) does not help satisfy .

The property is an example of a safety property that is repairable through detours: we can take any word that violates and extend it to a word that does not violate . However, we can easily show that the following safety properties are not repairable through detours:

  • The safety properties of consensus. E.g., consider the validity property: given a word such that is a non-initial and a non- value, adding further states between the initial state and does not change the fact that is neither initial nor .

  • State invariant properties, defined by a set of “good” states, such that for a trace , only if . Equivalently, these properties rule out traces which reach the “bad” states in the complement of . Intuitively, if a bad state is reached, we cannot fix it by adding more states before or after the bad state.

We establish the following lemma, which is essential to the ensuing preservation theorem.

[Crash-recovery traces have crash-stop superwords] Let be a crash-stop algorithm with an unreliable failure detector. Let be a finite run of the crash-recovery wrapper . Then, there exists a finite run of such that is a subword of , , , no processes fail in , and in-flight messages of match the messages in the buffers of .

[Preservation of detour-irreparable safety properties] Let be a crash-stop algorithm with an unreliable failure detector, and let be a safety state property that is not repairable through detours. If satisfies , then so does .

Proof.

We prove the theorem’s statement by proving its contrapositive. Assume violates . By the definition of safety properties [4], there exists a finite run of such that no continuation of is in . By Lemma 5, there exists a run of such that is a subword of . By Lemma 3.1, can be extended to an infinite run of . By the choice of , we have that . As is a subword of , and since is not detour repairable, then also . Thus, also violates . ∎

Proving that the safety properties of consensus are detour-irrepairable, means that these properties are preserved by our wrapper. Since state invariants are also detour-irreparable, they too are preserved by our wrapper. This makes our wrapper potentially useful for reusing other kinds of crash-stop algorithms in a crash-recovery setting, not just the consensus ones.

If satisfies the safety properties of consensus, then so does . If satisfies a state invariant, then so does .

6 Probabilistic Termination

Termination of consensus algorithms depends on stable periods, during which communication is reliable and no crashes or recoveries occur. In this section, we first state a general result about so-called selective stable periods for our probabilistic crash-recovery model. We then define a generic class of crash-stop consensus algorithms, which we call bounded algorithms. We prove that termination for these algorithms is guaranteed in our probabilistic model when run under the wrapper. Namely, we prove that, with probability 1, all processes eventually decide. We also show that the class of bounded algorithms covers a wide spectrum of existing algorithms including the celebrated Chandra-Toueg [8] and the instantiation of the indulgent framework of [20] that uses failure detectors.

6.1 Selective Stable Periods

Similar to [11], our proofs will rely on forming strongly-connected communication components between particular sets of processes. However, we will require their existence only for bounded periods of time, which we call selective stable periods.

[Selective stable period] Fix a crash-recovery algorithm . A selective-stable period of of length for a crash-recovery configuration and a set of processes , written , is the set of all sequences of crash-recovery configurations such that we have and there exist such that is a step of and is defined .

Such selective stable periods must occur in runs of a crash-recovery algorithm .

[Selective stable periods are mandatory] Fix a crash-recovery algorithm , a positive integer and a selection function , mapping crash-recovery configurations to process sets. Then, the set of crash-recovery runs

The next section shows how our wrapper exploits such periods to construct crash-recovery algorithms from existing crash-stop ones in a blackbox manner. For future work it might also be interesting to devise consensus algorithm directly on top of this property.

6.2 Bounded Algorithms

We next define the class of bounded crash-stop algorithms for which our wrapper guarantees termination in the crash-recovery setting. This class comprises algorithms which operate in rounds, with an upper bound on the number of messages exchanged per round as well the number of rounds correct processes can be apart. More formally, they are defined as follows. [Bounded algorithms] A crash-stop consensus algorithm (using reliable links and a failure detector [8]) is said to be bounded if it satisfies all properties below:

  1. Communication-closed rounds: processes operate in rounds. The rounds must be communication-closed [14]: only the messages from the current round are considered.

  2. Externally triggered state changes: After the first step of every round, the processes change state only upon receipt of round messages, or on a change of the failure detector output.

  3. Bounded round messages: There exists a bound such that, in any round, a process sends at most messages to any other process.

  4. Bounded round gap: Let , that is, the number of correct processes according to the algorithm’s resilience criterion. Then, there exists a bound , such that the fastest processes are at most rounds apart.

  5. Bounded termination: There exists a bound such that for any reachable configuration where any fastest processes in are correct, the other processes are faulty, and the failure detector output at these processes is perfect after , then all of these processes decide before any of them reaches the round .

We can check that an algorithm satisfies property (i) B4 by checking under what condition(s) a process increments its round number, and (ii) B5 by observing the algorithm’s termination under perfect failure detection given a quorum of correct processes. Section 6.3 shows an example of how to check that properties B4 and B5 hold. Using this definition, we can prove that bounded algorithms terminate in selective stable periods of sufficient length.

[Bounded selective stable period termination] Let be a bounded algorithm and its wrapped crash-recovery version. Let be a reachable crash-recovery configuration of the , and let be some set of fastest processes in . Then, there exists a bound , such that, for any selective stable period of length for and , all processes in decide in . Moreover, the bound is independent of the configuration .

Proof.

We partition the processes from into the set (initially ) and (initially ). The processes in will advance their rounds further, and the processes in will not advance, but will have already decided. Let be some slowest process from in . We first claim that advances or decides in any selective stable period for and of length at most , defined as . Denote the period configurations by . Then, using Lemma 5, we obtain a crash-stop configuration from that satisfies the conditions of bounded termination, with the processes from being correct, and the others faulty. We consider two cases.

First, if advances in the crash-stop model after receiving all the in-flight round messages in from other processes in , then it also advances in the crash-recovery model after receiving these messages. Moreover, our wrapper delivers all such messages within steps, as

it uses LIFO buffers;

bounds and apply to by iii and iv; and

by Lemma 5, the same bounds also apply to .

Second, if does not advance in the crash-stop model, then, since the failure detector output remains stable, and since no further round messages will be delivered to in the crash-stop model, the requirement ii ensures that will not advance further in the crash-stop setting. Moreover, requirements v and ii ensure that must decide after receiving all of its round messages; we move to the set .

We have thus established that the slowest process from can move to either or advance its round after steps. Next, we claim that we can repeat this procedure by picking the slowest member of again. This is because the procedure ensures that the processes in always have round numbers lower than the processes in . Thus, due to i and ii, the processes in cannot rely on those from for changing their state.

Lastly, we note that this procedure needs to be repeated at most times before all processes move to the round , by which point v guarantees that all processes terminate. Thus gives us the required bound . ∎

The main result of this paper shows that the wrapper guarantees all consensus properties for wrapped bounded algorithms, including termination.

[CR consensus preservation] If a bounded algorithm solves consensus in the crash-stop setting, then also solves consensus in the probabilistic crash-recovery setting.

Proof.

By Corollary 5, we conclude that solves the safety properties of consensus in the crash-recovery setting. For (probabilistic) termination, the result follows from Theorem 6.2 and Lemma 6.1, using as the selection function for Lemma 6.1

any function that selects some fastest processes in a configuration if no process has decided yet and

all processes, if some process has decided. The latter allows us to propagate the decision to all processes, due to short-circuiting in . ∎

6.3 Examples of Bounded Algorithms

We next give two prominent examples of bounded algorithms: the Chandra-Toueg (CT) algorithm [8] and the instantiation of the indulgent framework of [20] that uses failure detectors. For these algorithms, rounds are composite, and consist of the combination of what the authors refer to as rounds and phases. Checking that the algorithms then satisfy conditions iiii is straightforward. iv holds for the CT algorithm with : take the fastest process in a crash-stop configuration; if its CT round number is or less, the claim is immediate. Otherwise, must have previously moved out of the phase 2 of the last round in which it was the coordinator, which implies that at least processes have also already executed . Since CT uses the rotating coordinator paradigm, ; as each round consists of phases, . For the algorithm from [20], processes only advance to the next round (which consists of two phases) when they receive messages from other processes. Thus, . Finally, proving the requirement v is similar to, but simpler than the original termination proofs for the algorithms, since it only requires termination under conditions which includes perfect failure detector output. For space reasons, we do not provide the full proofs here, but we note that for both algorithms, where is the number of phases per algorithm round. Intuitively, within this many rounds the execution hits a round where a correct processor is the coordinator; since we assume perfect failure detection for this period, no process will suspect this coordinator, and thus no process will move out of this round without deciding. The wrapped versions of Chandra-Toueg’s algorithm [8] and the instantiation of the indulgent framework of [20] using failure detectors solve consensus in the probabilistic crash-recovery model.

7 Concluding Remarks

This paper introduced new system models that closely capture the messy reality of distributed systems. Unlike the usual distributed computing models, we account for failure and recovery patterns of processes and links in a probabilistic and temporary, rather than a deterministic and perpetual manner. Our models allow an unbounded number of processes and communication links to probabilistically fail and recover, potentially for an infinite number of times. We showed how and under what conditions we can reuse existing crash-stop distributed algorithms in our crash-recovery systems. We presented a wrapper that allows crash-stop algorithms to be deployed unchanged in our crash-recovery models. The wrapper preserves the correctness of a wide class of consensus algorithms.

Our work opens several new directions for future investigations. First, we currently model failures of processes as well as communication links individually and independently, with a non-zero probability of failing/recovering at any point in time. In Appendix D

, we sketch how our results can be extended to systems where some processes may never even recover from failure. It is interesting to investigate what results can be established with more complicated probability distributions, e.g., if the model is weakened to allow processes and links to fail/recover on average with some non-zero probability 

[15]. Second, our wrapper fully persists the processes state. Studying how to minimize the amount of persisted state while still allowing our results (or similar ones) to hold is another promising direction. Finally, we focus on algorithms that depend on the reliability of message delivery. Some algorithms, notably Paxos [23], do not. Finding a modular link abstraction for the crash-stop setting that identifies these algorithms is another interesting topic. For those algorithms, we speculate that preserving termination in the crash-recovery model is simpler.

References

  • [1] Marcos Kawazoe Aguilera, Wei Chen, and Sam Toueg. Failure detection and consensus in the crash-recovery model. Distributed computing, 13(2):99–125, 2000.
  • [2] Dan Alistarh, James Aspnes, Valerie King, and Jared Saia. Communication-efficient randomized consensus. In Distributed Computing, pages 61–75, 2014.
  • [3] Dan Alistarh, Seth Gilbert, Rachid Guerraoui, and Corentin Travers. How to solve consensus in the smallest window of synchrony. In DISC. Springer, 2008.
  • [4] Bowen Alpern and Fred Schneider. Defining Liveness. Information Processing Letters, 21:181–185, June 1985.
  • [5] James Aspnes, Hagit Attiya, and Keren Censor. Combining shared-coin algorithms. J. Parallel Distrib. Comput., 70(3):317–322, 2010.
  • [6] Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT Press, 2008.
  • [7] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4), 1985.
  • [8] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267, 1996.
  • [9] Bernadette Charron-Bost, Martin Hutle, and Josef Widder. In search of lost time. Information Processing Letters, 110(21), 2010.
  • [10] Flavin Cristian. Understanding fault-tolerant distributed systems. Commun. ACM, 34(2):56–78, 1991.
  • [11] Danny Dolev, Roy Friedman, Idit Keidar, and Dahlia Malkhi. Failure detectors in omission failure environments. In PODC, pages 286–, 1997.
  • [12] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288–323, 1988.
  • [13] Dacfey Dzung, Rachid Guerraoui, David Kozhaya, and Yvonne Anne Pignolet. Never say never - probabilistic and temporal failure detectors. In IEEE International Parallel and Distributed Processing Symposium, IPDPS, pages 679–688, 2016.
  • [14] Tzilla Elrad and Nissim Francez. Decomposition of distributed programs into communication-closed layers. Science of Computer Programming, 2:155–173, 1982.
  • [15] Christof Fetzer, Ulrich Schmid, and Martin Susskraut. On the possibility of consensus in asynchronous systems with finite average response times. In 25th IEEE International Conference on Distributed Computing Systems, pages 271–280, 2005.
  • [16] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374–382, 1985.
  • [17] Pierre Fraigniaud, Mika Göös, Amos Korman, Merav Parter, and David Peleg. Randomized distributed decision. LNCS, 27, 2014.
  • [18] Felix C. Freiling, Christian Lambertz, and Mila Majster-Cederbaum. Modular Consensus Algorithms for the Crash-Recovery Model. In 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 287–292, December 2009.
  • [19] Eli Gafni. Round-by-round fault detectors: Unifying synchrony and asynchrony. In PODC, pages 143–152, 1998.
  • [20] Rachid Guerraoui and Michel Raynal. A generic framework for indulgent consensus. In ICDCS, pages 88–, 2003.
  • [21] Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. Consensus in asynchronous systems where processes can crash and recover. In Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems, pages 280–, 1998.
  • [22] Michel Hurfin, Achour Mostéfaoui, and Michel Raynal. A versatile family of consensus protocols based on chandra-toueg’s unreliable failure detectors. IEEE Transactions on Computers, 51(4):395–408, 2002.
  • [23] Leslie Lamport. The Part-time Parliament. ACM Trans. Comput. Syst., 16(2):133–169, 1998.
  • [24] Mikel Larrea, Cristian Martín, and Iratxe Soraluze. Communication-efficient leader election in crash–recovery systems. Journal of Systems and Software, 84(12):2186 – 2195, 2011.
  • [25] Neeraj Mittal, Kuppahalli L. Phaneesh, and Felix C. Freiling. Safe termination detection in an asynchronous distributed system when processes may crash and recover. Theor. Comput. Sci., 410(6-7):614–628, February 2009.
  • [26] H. Moniz, N.F. Neves, and M. Correia. Turquois: Byzantine consensus in wireless ad hoc networks. In DSN, 2010.
  • [27] Henrique Moniz, NunoFerreira Neves, Miguel Correia, and Paulo Veríssimo. Randomization can be a healer: Consensus with dynamic omission failures. In LNCS, volume 5805. 2009.
  • [28] Rui Oliveira, Rachid Guerraoui, and André Schiper. Consensus in the crash-recover model. 1997.
  • [29] Marshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. Journal of the ACM (JACM), 27(2):228–234, 1980.
  • [30] André Schiper. Early consensus in an asynchronous system with a weak failure detector. Distrib. Comput., 10(3):149–157, 1997.
  • [31] Ulrich Schmid, Bettina Weiss, and Idit Keidar. Impossibility results and lower bounds for consensus under link failures. SIAM Journal on Computing, 38(5):1912–1951, 2009.

Appendix A Wrapper Description

In this section we provide the formal description of the wrapper that produces the crash-recovery version of a crash-stop algorithm . The message and local state spaces are denoted by and respectively.

The message space of is defined as . is the message set of which we extend with a special heartbeat message being sent when no message needs to be sent. The second part of each message represents an acknowledgment, confirming the receipt of the last message on the channel in the opposite direction. The message indicates that no messages (including acknowledgment messages) have been received so far.

The types of the components of a record are

  • of type storing the state of in the target crash-stop algorithm.

  • buff of type , where we recall that stands for the set of finite sequences of messages from . These are ’s outgoing message buffers, with one buffer for each process in the system (including one for ).

  • acks of type . This records, for each process , the last message that received from (if any). This will be used for acknowledgments.

Given a crash-recovery state , a process , and the partial function of messages received by in the given round, we define the p’s local step unfolding for and , written as follows. First, let:

  • be if , and let be otherwise. That is, is the message from that receives, using to indicate that no message was received.

  • Let . That is, all processes from whom no message was received in this step are suspected.

Then, unless has decided yet and broadcasts the decision, the following sequence of intermediate steps is taken. I.e., defines a sequence of crash-recovery states and crash-stop labels , where the intermediate state represents the state record of after processing the message of the -th process, defined as follows:

  1. Recalling that we number the processes from to , and are computed as follows, for :

    1. Unpack the message from if one has been received, Let if , and let and otherwise. If and we check that the message has not been acknowledged yet, , in which case we feed the message to ’s next function: . We also need the next state if no message (or a duplicate) has been received, as needs to be a transition of the crash-stop system. Hence, if or , we feed in and not , i.e., .

    2. we set or accordingly. In both cases, the set of failed processes in this label is empty.

    3. We remove the acknowledged message (if any) from the head of the outgoing buffer for this process. More precisely, let be the buffer obtained as follows. First, copy the buffer of all other processes except .
      for all .

      Next, if there are messages to send to process , i.e., , and if and is equal to , then we let be ; otherwise, let .

      To add a potential new message to the buffer, let . These are the new messages that wishes to send, at most one destined to each process in the system. We define , if is undefined, and otherwise. Notice that we add the new message at the head of the list; as we will also remove messages from the head when in the function, our buffers are LIFO.

    4. If and , then , and for .

Finally, given a crash-stop algorithm using an unreliable failure detector with range and with the per-process state space , we define its crash-recovery version where:

  • is the set of crash-recovery configurations such that there exists a crash-stop configuration such that:

    1. for each :

      1. for all . That is, initially, no messages are buffered.

      2. for all . Initially, no messages are acknowledged.

  • .

  • is:

    • if , and

    • , otherwise, i.e., if we have nothing else to send.

Appendix B Markov chains

Our probabilistic crash-recovery model uses Markov chains [6]. We recall the basic notions here which are relevant for our proofs.

A (discrete-time) Markov chain is a tuple where:

  • is a countable, non-empty set of states

  • is the transition probability function such that for all states :

  • is the initial distribution, such that .

For a Markov chain, the cylinder set spanned by a finite word over is defined as the set: These sets serve as the basis events of the -algebras of Markov chains. If , then

Appendix C Omitted Proofs

c.1 Proof of Lemma 5

Informally, we obtain by unfolding the crash-recovery steps of . In other words, we extend each crash-recovery step to its corresponding crash-stop steps. Recall that a single step in the crash recovery setting may abstract the receipt of multiple messages, while every received message in the crash-stop setting is step of the algorithm.

The only difficulty is handling process recovery, which does not exist in the crash-stop system. Here, we exploit the fact that, in asynchronous systems, crashed processes are indistinguishable from delayed ones. This enables us to keep free from failed processes.

We now state our claim more formally, and then prove it by induction. We claim that there exists a finite run such that:

  1. is a subword of , with the same first and last character: and .

  2. No failures occur in , for all .

  3. In-flight messages of correspond to messages in the buffers of ,
    . Here, we overload , writing even though buff is a sequence.

Proving the base case is easy, as is an initial state of , and both the in-flight set of messages and all the buffers are empty for initial configurations (follows from our wrapper’s definition).

We now prove the step case. Let and be as above, let:

  • ,

  • ,

  • , and

  • be the extension of with .

In other words, , is a next “letter” we could add to a valid run of . Thus, we will prove that we can extend with a series of transitions, to obtain a run such that the claim holds for and . The extension is obtained by concatenating the unfoldings of each process’s local step. More precisely, we define the sequences (fragments) where is the unfolding of process ’s local steps:

  • if , then

  • if , then is the sequence such that:

    • is

    • for and (the state of other processes does not change)

    • , for (messages delivered to are removed the in-flight message and new messages sent by are added)

    • for

Since the successive fragments overlap at one state, we define