Fault-Tolerant Consensus with an Abstract MAC Layer

In this paper, we study fault-tolerant distributed consensus in wireless systems. In more detail, we produce two new randomized algorithms that solve this problem in the abstract MAC layer model, which captures the basic interface and communication guarantees provided by most wireless MAC layers. Our algorithms work for any number of failures, require no advance knowledge of the network participants or network size, and guarantee termination with high probability after a number of broadcasts that are polynomial in the network size. Our first algorithm satisfies the standard agreement property, while our second trades a faster termination guarantee in exchange for a looser agreement property in which most nodes agree on the same value. These are the first known fault-tolerant consensus algorithms for this model. In addition to our main upper bound results, we explore the gap between the abstract MAC layer and the standard asynchronous message passing model by proving fault-tolerant consensus is impossible in the latter in the absence of information regarding the network participants, even if we assume no faults, allow randomized solutions, and provide the algorithm a constant-factor approximation of the network size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/11/2017

Fault Tolerant Consensus Agreement Algorithm

Recently a new fault tolerant and simple mechanism was designed for solv...
04/07/2021

Self-stabilizing Multivalued Consensus in Asynchronous Crash-prone Systems

The problem of multivalued consensus is fundamental in the area of fault...
03/01/2019

ε-differential agreement: A Parallel Data Sorting Mechanism for Distributed Information Processing System

The order of the input information plays a very important role in a dist...
07/08/2019

A Topological Perspective on Distributed Network Algorithms

More than two decades ago, combinatorial topology was shown to be useful...
03/14/2019

Fault Tolerant Network Constructors

In this work we examine what graphs (networks) can be stably and distrib...
05/16/2019

On the complexity of fault-tolerant consensus

The paper studies the problem of reaching agreement in a distributed mes...
11/26/2019

LogPlayer: Fault-tolerant Exactly-once Delivery using gRPC Asynchronous Streaming

In this paper, we present the design of our LogPlayer that is a componen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consensus provides a fundamental building block for developing reliable distributed systems [24, 23, 25]. Accordingly, it is well studied in many different system models [36]. Until recently, however, little was known about solving this problem in distributed systems made up of devices communicating using commodity wireless cards. Motivated by this knowledge gap, this paper studies consensus in the abstract MAC layer model, which abstracts the basic behavior and guarantees of standard wireless MAC layers. In recent work [43], we proved deterministic fault-tolerant consensus is impossible in this setting. In this paper, we describe and analyze the first known randomized fault-tolerant consensus algorithms for this well-motivated model.

The Abstract MAC Layer. Most existing work on distributed algorithms for wireless networks assumes low-level synchronous models that force algorithms to directly grapple with issues caused by contention and signal fading. Some of these models describe the network topology with a graph (c.f., [8, 28, 32, 39, 16, 20]), while others use signal strength calculations to determine message behavior (c.f., [40, 38, 21, 26, 27, 17]).

As also emphasized in [43], these models are useful for asking foundational questions about distributed computation on shared channels, but are not so useful for developing algorithmic strategies suitable for deployment. In real systems, algorithms typically do not operate in synchronous rounds and they are not provided unmediated access to the radio. They must instead operate on top of a general-purpose MAC layer which is responsible for many network functions, including contention management, rate control, and co-existence with other network traffic.

Motivated by this reality, in this paper we adopt the abstract MAC layer model [34], an asynchronous broadcast-based communication model that captures the basic interfaces and guarantees provided by common existing wireless MAC layers. In more detail, if you provide the abstract MAC layer a message to broadcast, it will eventually be delivered to nearby nodes in the network. The specific means by which contention is managed—e.g., CSMA, TDMA, uniform probabilistic routines such as DECAY [8]—is abstracted away by the model. At some point after the contention management completes, the abstract MAC layer passes back an acknowledgment indicating that it is ready for the next message. This acknowledgment contains no information about the number or identities of the message recipient.

(In the case of the MAC layer using CSMA, for example, the acknowledgment would be generated after the MAC layer detects a clear channel. In the case of TDMA, the acknowledgment would be generated after the device’s turn in the TDMA schedule. In the case of a probabilistic routine such as DECAY, the acknowledgment would be generated after a sufficient number of attempts to guarantee successful delivery to all receivers with high probability.)

The abstract MAC abstraction, of course, does not attempt to provide a detailed representation of any specific existing MAC layer. Real MAC layers offer many more modes and features then is captured by this model. In addition, the variation studied in this paper assumes messages are always delivered, whereas more realistic variations would allow for occasional losses.

This abstraction, however, still serves to capture the fundamental dynamics of real wireless application design in which the lower layers dealing directly with the radio channel are separated from the higher layers executing the application in question. An important goal in studying this abstract MAC layer, therefore, is attempting to uncover principles and strategies that can close the gap between theory and practice in the design of distributed systems deployed on standard layered wireless architectures.

Our Results. In this paper, we studied randomized fault-tolerant consensus algorithms in the abstract MAC layer model. In more detail, we study binary consensus and assume a single-hop network topology. Notice, our use of randomization is necessary, as deterministic consensus is impossible in the abstract MAC layer model in the presence of even a single fault (see our generalization of FLP from [43]).

To contextualize our results, we note that the abstract MAC layer model differs from standard asynchronous message passing models in two main ways: (1) the abstract MAC layer model provides the algorithm no advance information about the network size or membership, requiring nodes to communicate with a blind broadcast primitive instead of using point-to-point channels, (2) the abstract MAC layer model provides an acknowledgment to the broadcaster at some point after its message has been delivered to all of its neighbors. This acknowledgment, however, contains no information about the number or identity of these neighbors (see above for more discussion of this fundamental feature of standard wireless MAC layers).

Most randomized fault-tolerant consensus algorithms in the asynchronous message passing model strongly leverage knowledge of the network. A strategy common to many of these algorithms, for example, is to repeatedly collect messages from at least nodes in a network of size with at most crash failures (e.g., [9]). This strategy does not work in the abstract MAC layer model as nodes do not know .

To overcome this issue, we adapt an idea introduced in early work on fault-tolerant consensus in the asynchronous shared memory model: counter racing (e.g., [12, 5]). At a high-level, this strategy has nodes with initial value advance a shared memory counter associated with , while nodes with initial value advance a counter associated with . If a node sees one counter get ahead of the other, they adopt the initial value associated with the larger counter, and if a counter gets sufficiently far ahead, then nodes can decide.

Our first algorithm (presented in Section 3) implements a counter race of sorts using the acknowledged blind broadcast primitive provided by the model. Roughly speaking, nodes continually broadcast their current proposal and counter, and update both based on the pairs received from other nodes. Proving safety for this type of strategy in shared memory models is simplified by the atomic nature of register accesses. In the abstract MAC layer model, by contrast, a broadcast message is delivered non-atomically to its recipients, and in the case of a crash, may not arrive at some recipients at all.111We note that register simulations are also not an option in our model for two reasons: standard simulation algorithms require knowledge of and a majority correct nodes, whereas we assume no knowledge of and wait-freedom. Our safety analysis, therefore, requires novel analytical tools that tame a more diverse set of possible system configurations.

To achieve liveness, we use a technique loosely inspired by the randomized delay strategy introduced by Chandra in the shared memory model [12] . In more detail, nodes probabilistically decide to replace certain sequences of their counter updates with placeholders. We show that if these probabilities are adapted appropriately, the system eventually arrives at a state where it becomes likely for only a single node to be broadcasting updates, allowing progress toward termination.

Formally, we prove that with high probability in the network size , the algorithm terminates after broadcasts are scheduled. This holds regardless of which broadcasts are scheduled (i.e., we do not impose a fairness condition), and regardless of the number of faults. The algorithm, as described, assumes nodes are provided unique IDs that we treat as comparable black boxes (to prevent them from leaking network size information). We subsequently show how to remove that assumption by describing an algorithm that generates unique IDs in this setting with high probability.

Our second algorithm (presented in Section 4) trades a looser agreement guarantee for more efficiency. In more detail, we describe and analyze a solution to almost-everywhere agreement [18], that guarantees most nodes agree on the same value. This algorithm terminates after

broadcasts, which is a linear factor faster than our first algorithm (ignoring log factors). The almost-everywhere consensus algorithm consists of two phases. The first phase is used to ensure that almost all nodes obtain a good approximation of the network size. In the second phase, nodes use this estimate to perform a sequence of broadcasts meant to help spread their proposal to the network. Nodes that did not obtain a good estimate in Phase 1 will leave Phase 2 early. The remaining nodes, however, can leverage their accurate network size estimates to probabilistically sample a subset to actively participate in each round of broadcasts. To break ties between simultaneously active nodes, each chooses a random rank using the estimate obtained in Phase 1. We show that with high probability, after not too long, there exists a round of broadcasts in which the first node receiving its acknowledgment is both active and has the minimum rank among other active nodes—allowing its proposal to spread to all remaining nodes.

Finally, we explore the gap between the abstract MAC layer model and the related asynchronous message passage passing model. We prove (in Section 5) that fault-tolerant consensus is impossible in the asynchronous message passing model in the absence of knowledge of network participants, even if we assume no faults, allow randomized algorithms, and provide a constant-factor approximation of . This differs from the abstract MAC layer model where we solve this problem without network participant or network size information, and assuming crash failures. This result implies that the fact that broadcasts are acknowledged in the abstract MAC layer model is crucial to overcoming the difficulties induced by limited network information.

Related Work. Consensus provides a fundamental building block for reliable distributed computing [24, 23, 25]. It is particularly well-studied in asynchronous models [35, 46, 42, 2].

The abstract MAC layer approach222There is no one abstract MAC layer model. Different studies use different variations. They all share, however, the same general commitment to capturing the types of interfaces and communication/timing guarantees that are provided by standard wireless MAC layers to modeling wireless networks was introduced in [33] (later expanded to a journal version [34]), and has been subsequently used to study several different problems [14, 29, 30, 15, 43]. The most relevant of this related work is [43], which was the first paper to study consensus in the abstract MAC layer model. This previous paper generalized the seminal FLP [19] result to prove deterministic consensus is impossible in this model even in the presence of a single failure. It then goes on to study deterministic consensus in the absence of failures, identifying the pursuit of fault-tolerant randomized solutions as important future work—the challenge taken up here.

We note that other researchers have also studied consensus using high-level wireless network abstractions. Vollset and Ezhilchelvan [47], and Alekeish and Ezhilchelvan [4], study consensus in a variant of the asynchronous message passing model where pairwise channels come and go dynamically—capturing some behavior of mobile wireless networks. Their correctness results depend on detailed liveness guarantees that bound the allowable channel changes. Wu et al. [48] use the standard asynchronous message passing model (with unreliable failure detectors [13]) as a stand-in for a wireless network, focusing on how to reduce message complexity (an important metric in a resource-bounded wireless setting) in solving consensus.

A key difficulty for solving consensus in the abstract MAC layer model is the absence of advance information about network participants or size. These constraints have also been studied in other models. Ruppert [45], and Bonnet and Raynal [10], for example, study the amount of extra power needed (in terms of shared objects and failure detection, respectively) to solve wait-free consensus in anonymous versions of the standard models. Attiya et al. [6] describe consensus solutions for shared memory systems without failures or unique ids. A series of papers [11, 22, 3], starting with the work of Cavin et al. [11], study the related problem of consensus with unknown participants (CUPs), where nodes are only allowed to communicate with other nodes whose identities have been provided by a participant detector formalism.

Closer to our own model is the work of Abboud et al. [1], which also studies a single hop network where nodes broadcast messages to an unknown group of network participants. They prove deterministic consensus is impossible in these networks under these assumptions without knowledge of network size. In this paper, we extend these existing results by proving this impossibility still holds even if we assume randomized algorithms and provided the algorithm a constant-factor approximation of the network size. This bound opens a sizable gap with our abstract MAC layer model in which consensus is solvable without this network information.

We also consider almost-everywhere (a.e.) agreement [18], a weaker variant of consensus, where a small number of nodes are allowed to decide on conflicting values, as long as a sufficiently large majority agrees. Recently, a.e. agreement has been studied in the context of peer-to-peer networks (c.f. [31, 7]), where the adversary can isolate small parts of the network thus rendering (everywhere) consensus impossible. We are not aware of any prior work on a.e. agreement in the wireless settings.

2 Model and Problem

In this paper, we study a variation of the abstract MAC layer model, which describes system consisting of a single hop network of computational devices (called nodes in the following) that communicate wirelessly using communication interfaces and guarantees inspired by commodity wireless MAC layers.

In this model, nodes communicate with a primitive that guarantees to eventually deliver the broadcast message to all the other nodes (i.e., the network is single hop). At some point after a given has succeeded in delivering a message to all other nodes, the broadcaster receives an informing it that the broadcast is complete (as detailed in the introduction, this captures the reality that most wireless contention management schemes have a definitive point at which they know a message broadcast is complete). This acknowledgment contains no information about the number or identity of the receivers.

We assume a node can only broadcast one message at a time. That is, once it invokes , it cannot broadcast another message until receiving the corresponding (formally, overlapping messages are discarded by the MAC layer). We also assume any number of nodes can permanently stop executing due to crash failures. As in the classical message passing models, a crash can occur during a broadcast, meaning that some nodes might receive the message while others do not.

This model is event-driven with the relevant events scheduled asynchronously by an arbitrary scheduler. In more detail, for each node , there are four event types relevant to that can be scheduled: (which occurs at the beginning of an execution and allows to initialize), (which indicates that has received message broadcast from another node), (which indicates that the message broadcast by has been successfully delivered), and (which indicates that is crashed for the remainder of the execution).

A distributed algorithm specifies for each node a finite collection of steps to execute for each of the non- event types. When one of these events is scheduled by the scheduler, we assume the corresponding steps are executed atomically at the point that the event is scheduled. Notice that one of the steps that a node can take in response to these events is to invoke a primitive for some message . When an event includes a primitive we say it is combined with a broadcast.333Notice, we can assume without loss of generality, that the steps executed in response to an event never invoke more than a single primitive, as any additional broadcasts invoked at the same time would lead to the messages being discarded due to the model constraint that a node must receive an for the current message before broadcasting a new message.

We place the following constraints on the scheduler. It must start each execution by scheduling an event for each node; i.e., we study the setting where all participating nodes are activated at the beginning of the execution. If a node invokes a valid primitive, then for each that is not crashed when the broadcast primitive is invoked, the scheduler must subsequently either schedule a single or event at . At some point after these events are scheduled, it must then eventually schedule an event at . These are the only and events it schedules (i.e., it cannot create new messages from scratch or cause messages to be received/acknowledged multiple times). If the scheduler schedules a event, it cannot subsequently schedule any future events for .

We assume that in making each event scheduling decision, the scheduler can use the schedule history as well as the algorithm definition, but it does not know the nodes’ private states (which includes the nodes’ random bits). When the scheduler schedules an event that triggers a broadcast (making it a combined event), it is provided this information so that it knows it must now schedule receive events for the message. We assume, however, that the scheduler does not learn the contents of the broadcast message.444This adversary model is sometimes called message oblivious and it is commonly considered a good fit for schedulers that control network behavior. This follows because it allows the scheduler to adapt the schedule based on the number of messages being sent and their sources—enabling it to model contention and load factors. One the other hand, there is not good justification for the idea that this schedule should somehow also depend on the specific bits contained in the messages sent. Notice, our liveness proof specifically leverages the message oblivious assumption as it prevents the scheduler from knowing which nodes are sending updates and which are sending messages.

Given an execution , we say the message schedule for , also indicated , is the sequence of message events (i.e., , , and ) scheduled in the execution. We assume that a message schedule includes indications of which events are combined with broadcasts.

The Consensus Problem. In this paper, we study binary consensus with probabilistic termination. In more detail, at the beginning of an execution each node is provided an initial value from as input. Each node has the ability to perform a single irrevocable action for either value or . To solve consensus, an algorithm must guarantee the following three properties: (1) agreement: no two nodes decide different values; (2) validity: if a node decides value , then at least one node started with initial value ; and (3) termination (probabilistic): every non-crashed node decides with probability in the limit.

Studying finite termination bounds is complicated in asynchronous models because the scheduler can delay specific nodes taking steps for arbitrarily long times. In this paper, we circumvent this issue by proving bounds on the number of scheduled events before the system reaches a termination state in which every non-crashed node has: (a) decided; or (b) will decide whenever the scheduler gets around to scheduling its next event.

Finally, in addition to studying consensus with standard agreement, we also study almost-everywhere agreement, in which only a specified majority fraction (typically a fraction of the total nodes) must agree.

Initialization:
bcast
On Receiving :
if  then
     decide and halt
else
     
     
      max counter in paired with value (default to if no such elements)
      max counter in paired with value (default to if no such elements)
     if  then
     else if  then      
     if  or  then
     else if  or  then      
     if  then
         if  and  then
         else if  then          
         update element in with new and
               
     if  then with probability otherwise      
     if  or  then
         bcast
     else
         bcast      
On Receiving Message :
updateEstimate
if  then
     
else if  then
     if  such that  then
         remove from      
     add to
Algorithm 1 Counter Race Consensus (for node with UID and initial value )
if  contains a UID and network size estimate  then
     
     
Algorithm 2 The updateEstimate subroutine called by Counter Race Consensus during event.

3 Consensus Algorithm

Here we describe analyze our randomized binary consensus algorithm: counter race consensus (see Algorithms and for pseudocode, and Section 3.1 for a high-level description of its behavior). This algorithm assumes no advance knowledge of the network participants or network size. Nodes are provided unique IDs, but these are treated as comparable black boxes, preventing them from leaking information about the network size. (We will later discuss how to remove the unique ID assumption.) It tolerates any number of crash faults.

3.1 Algorithm Description

The counter race consensus algorithm is described in pseudocode in the figures labeled Algorithm and . Here we summarize the behavior formalized by this pseudocode.

The core idea of this algorithm is that each node maintains a counter (initialized to ) and a proposal (initialized to its consensus initial value). Node repeatedly broadcasts and , updating these values before each broadcast. That is, during the event for its last broadcast of and , node will apply a set of update rules to these values. It then concludes the event by broadcasting these updated values. This pattern repeats until arrives at a state where it can safely commit to deciding a value.

The update rules and decision criteria applied during the event are straightforward. Each node first calculates , the largest counter value it has sent or received in a message containing proposal value , and , the largest counter value it has sent or received in a message containing proposal value .

If , then sets , and if , then sets . That is, adopts the proposal that is currently “winning” the counter race (in case of a tie, it does not change its proposal).

Node then checks to see if either value is winning by a large enough margin to support a decision. In more detail, if , then commits to deciding , and if , then commits to deciding .

What happens next depends on whether or not committed to a decision. If did not commit to a decision (captured in the if then conditional), then it must update its counter value. To do so, it compares its current counter to and . If is smaller than one of these counters, it sets . Otherwise, if is the largest counter that has sent or received so far, it will set . Either way, its counter increases. At this point, can complete the event by broadcasting a message containing its newly updated and values.

On the other hand, if committed to deciding value , then it will send a message to inform the other nodes of its decision. On subsequently receiving an for this message, will decide and halt. Similarly, if ever receives a message from another node, it will commit to deciding . During its next event, it will send its own message and decide and halt on its corresponding . That is, node will not decide a value until it has broadcast its commitment to do so, and received an on the broadcast.

The behavior described above guarantees agreement and validity. It is not sufficient, however, to achieve liveness, as an ill-tempered scheduler can conspire to keep the race between and too close for a decision commitment. To overcome this issue we introduce a random delay strategy that has nodes randomly step away from the race for a while by replacing their broadcast values with placeholders ignored by those who receive them. Because our adversary does not learn the content of broadcast messages, it does not know which nodes are actively participating and which nodes are taking a break (as in both cases, nodes continually broadcast messages)—thwarting its ability to effectively manipulate the race.

In more detail, each node partitions its broadcasts into groups of size . At the beginning of each such group, flips a weighted coin to determine whether or not to replace the counter and proposal values it broadcasts during this group with placeholders—eliminating its ability to affect other nodes’ counter/proposal values. As we will later elaborate in the liveness analysis, the goal is to identify a point in the execution in which a single node is broadcasting its values while all other nodes are broadcasting values—allowing to advance its proposal sufficiently far ahead to win the race.

To be more specific about the probabilities used in this logic, node maintains an estimate of the number of nodes in the network. It replaces values with placeholders in a given group with probability . (In the pseudocode, the flag indicates whether or not is using placeholders in the current group.) Node initializes to . It then updates it by calling the updateEstimate routine (described in Algorithm ) for each message it receives.

There are two ways for this routine to update . The first is if the number of unique IDs that has received so far (stored in ) is larger than . In this case, it sets . The second way is if it learns another node has an estimate . In this case, it sets . Node learns about other nodes’ estimates, as the algorithm has each node append its current estimate to all of its messages (with the exception of messages). In essence, the nodes are running a network size estimation routine parallel to its main counter race logic—as nodes refine their estimates, their probability of taking useful breaks improves.

3.2 Safety

We begin our analysis by proving that our algorithm satisfies the agreement and validity properties of the consensus problem. Validity follows directly from the algorithm description. Our strategy to prove agreement is to show that if any node sees a value with a counter at least ahead of value (causing it to commit to deciding ), then is the only possible decision value. Race arguments of this type are easier to prove in a shared memory setting where nodes work with objects like atomic registers that guarantee linearization points. In our message passing setting, by contrast, in which broadcast messages arrive at different receivers at different times, we will require more involved definitions and operational arguments.555We had initially hoped there might be some way to simulate linearizable shared objects in our model. Unfortunately, our nodes’ lack of information about the network size thwarted standard simulation strategies which typically require nodes to collect messages from a majority of nodes in the network before proceeding to the next step of the simulation.

We start with a useful definition. We say dominates at a given point in the execution, if every (non-crashed) node at this point believes is winning the race, and none of the messages in transit can change this perception.

To formalize this notion we need some notation. In the following, we say at point (or at ), with respect to an event from the message schedule of an execution , to describe the state of the system immediately after event (and any associated steps that execute atomically with ) occurs. We also use the notation in transit at to describe messages that have been broadcast but not yet received at every non-crashed receiver at .

Definition 3.1.

Fix an execution , event in the corresponding message schedule , consensus value , and counter value . We say is -dominated at if the following conditions are true:

  1. For every node that is not crashed at : and , where at point , (resp. ) is the largest value has sent or received in a counter message containing consensus value (resp. ). If has not sent or received any counter messages containing (resp. ), then by default it sets (resp. ) in making this comparison.

  2. For every message of the form that is in transit at : .

The following lemma formalizes the intuition that once an execution becomes dominated by a given value, it remains dominated by this value.

Lemma 3.2.

Assume some execution is -dominated at point . It follows that is -dominated at every that comes after .

Proof.

In this proof, we focus on the suffix of the message schedule that begins with event . For simplicity, we label these events , with . We will prove the lemma by induction on this sequence.

The base case () follows directly from the lemma statement. For the inductive step, we must show that if is -dominated at point , then it will be dominated at as well. By the inductive hypothesis, we assume the execution is dominated immediately before occurs. Therefore, the only way the step is violated is if transitions the system from dominated to non-dominated status. We consider all possible cases for and show none of them can cause such a transition.

The first case is if is a event for some node . It is clear that a crash cannot transition a system into non-dominated status.

The second case is if is a event for some node . This event can only transition the system into a non-dominated status if is a counter message that includes and a counter . For to receive this message, however, means that the message was in transit immediately before occurs. Because we assume the system is dominated at , however, no such message can be in transit at this point (by condition of the domination definition).

The third and final case is if is a event for some node , that is combined with a event, where is a counter message that includes and a counter . Consider the values and set by node early in the steps associated with this event. By our inductive hypothesis, which tells us that the execution is dominated right before this event occurs, it must follow that (as and ). In the steps that immediately follow, therefore, node will set . It is therefore impossible for to then broadcast a counter message with value . ∎

To prove agreement, we are left to show that if a node commits to deciding some value , then it must be the case that dominates the execution at this point—making it the only possible decision going forward. The following helper lemma, which captures a useful property about counters, will prove crucial for establishing this point.

Lemma 3.3.

Assume event in the message schedule of execution is combined with a , where , for some counter . It follows that prior to in , every node that is non-crashed at received a counter message with counter and value .

Proof.

Fix some , , and , as specified by the lemma statement. Let be the first event in such that at some node has local counter and value . We know at least one such event exists as and satisfy the above conditions, so the earliest such event, , is well-defined. Furthermore, because must modify local counter and/or consensus values, it must also be an event.

For the purposes of this argument, let and be ’s counter and consensus value, respectively, immediately before is scheduled. Similarly, let and be these values immediately after and its steps complete (i.e., these values at point ). By assumption: and . We proceed by studying the possibilities for and and their relationships with and .

We begin by considering . We want to argue that . To see why this is true, assume for contradiction that . It follows that early in the steps for , node switches its consensus value from to . By the definition of the algorithm, it only does this if at this point in the steps: (the last term follows because is included in the values considered when defining ). Note, however, that must be less than . If it was greater than or equal to , this would imply that a node ended an earlier event with counter and value —contradicting our assumption that was the earliest such event. If and , then must increase its value during this event. But because , the only allowable change to would be to set it to . This contradicts the assumption that .

At this checkpoint in our argument we have argued that . We now consider . If , then starts with a sufficiently big counter—contradicting the assumption that is the earliest such event. It follows that and must increase this value during this event.

There are two ways to increase a counter; i.e., the two conditions in the if/else-if statement that follows the check. We start with the second condition. If , then can set to this maximum. If this maximum is equal to , then this would imply . As argued above, however, it would then follow that a node had a counter and value before . If this is not true, then . If this was the case, however, would have adopted value earlier in the event, contradicting the assumption that .

At this next checkpoint in our argument we have argued that , , and increases to through the first condition of the if/else if; i.e., it must find that and . Because this condition only increases the counter by , we can further refine our assumption to .

To conclude our argument, consider the implications of the component of this condition. It follows that is an for an actual message . It cannot be the case that is a message, as will not increase its counter on acknowledging a . Therefore, is a counter message. Furthermore, because counter and consensus values are not modified after broadcasting a counter message but before receiving its subsequent acknowledgment, we know (we replace the network size estimate with a wildcard here as these estimates could change during this period).

Because has an acknowledgment for this , by the definition of the model, prior to : every non-crashed node received a counter message with counter and consensus value . This is exactly the claim we are trying to prove. ∎

Our main safety theorem leverages the above two lemmas to establish that committing to decide means that dominates the execution. The key idea is that counter values cannot become too stale. By Lemma 3.3, if some node has a counter associated with proposal value , then all nodes have seen a counter of size at least associated with . It follows that if some node thinks is far ahead, then all nodes must think is far ahead in the race (i.e., dominates). Lemma 3.2 then establishes that this dominance is permanent—making the only possible decision value going forward.

Theorem 3.4.

The Counter Race Consensus algorithm satisfies validity and agreement.

Proof.

Validity follows directly from the definition of the algorithm. To establish agreement, fix some execution that includes at least one decision. Let be the first event in that is combined with a broadcast of a message. We call such a step a pre-decision step as it prepares nodes to decide in a later step. Let be the node at which this occurs and be the value it includes in the message. Because we assume at least one process decides in , we know exists. We also know it occurs before any decision.

During the steps associated with , sets . This indicates the following is true: Based on this condition, we establish two claims about the system at , expressed with respect to the value during these steps:

  • Claim 1. The largest counter included with value in a counter message broadcast666Notice, in these claims, when we say a message is “broadcast” we only mean that the corresponding event occurred. We make no assumption on which nodes have so far received this message. before is no more than .

    Assume for contradiction that before some broadcast a counter message with value and counter . By Lemma 3.3, it follows that before every non-crashed node receives a counter message with value and counter . This set of nodes includes . This contradicts our assumption that at the largest counter has seen associated with is .

  • Claim 2. Before , every non-crashed node has sent or received a counter message with value and counter at least .

    By assumption on the values has seen at , we know that before some node broadcast a counter message with value and counter . By Lemma 3.3, it follows that before , every node has sent or received a counter with value and counter .

Notice that claim 1 combined with claim 2 implies that the execution is -dominated before . By Lemma 3.2, the execution will remain dominated from this point forward. We assume was the first pre-decision, and it will lead to tell other nodes to decide before doing so itself. Other pre-decision steps might occur, however, before all nodes have received ’s preference for . With this in mind, let be any other pre-decision step. Because comes after it will occur in a -dominated system. This means that during the first steps of , the node will adopt as its value (if it has not already done so), meaning it will also promote .

To conclude, we have shown that once any node reaches a pre-decision step for a value , then the system is already dominated in favor of , and therefore is the only possible decision value going forward. Agreement follows directly. ∎

3.3 Liveness

We now turn our attention liveness. Our goal is to prove the following theorem:

Theorem 3.5.

With high probability, within scheduled events, every node executing counter race consensus has either crashed, decided, or received a message. In the limit, this termination condition occurs with probability .

Notice that this theorem does not require a fair schedule. It guarantees its termination criteria (with high probability) after any scheduled events, regardless of which nodes these events occur at. Once the system arrives at a state in which every node has either crashed, decided, or received a message, the execution is now univalent (only one decision value is possible going forward), and each non-crashed node will decide after at most two additional events at .777In the case where receives a message, the first might correspond to the message it was broadcasting when the arrived, and the second corresponds to the message that itself will then broadcast. During this second , will decide and halt.

Our liveness proof is longer and more involved than our safety proof. This follows, in part, from the need to introduce multiple technical definitions to help identify the execution fragments sufficiently well-behaved for us to apply our probabilistic arguments. With this in mind, we divide the presentation of our liveness proof into two parts. The first part introduces the main ideas of the analysis and provides a road map of sorts to its component pieces. The second part, which contains the details, can be found in the full paper [44].

3.3.1 Main Ideas

Here we discuss the main ideas of our liveness proof. A core definition used in our analysis is the notion of an -run. Roughly speaking, for a given constant integer and node , we say an execution fragment is an -run for some node , if it starts and ends with an event for , it contains total events for , and no other node has more than events interleaved. We deploy a recursive counting argument to establish that an execution fragment that contains at least total events, must contain a sub-fragment that is an -run for some node .

To put this result to use, we focus our attention on -runs, where is the constant used in the algorithm definition to define the length of a group (see Section 3.1 for a reminder of what a group is and how it is used by the algorithm). A straightforward argument establishes that a -run for some node must contain at least one complete group for —that is, it must contain all broadcasts of one of ’s groups.

Combining these observations, it follows that if we partition an execution into segments of length , each such segment contains a -run for some node , and each such run contains a complete group for . We call this complete group the target group for segment (if there are multiple complete groups in the run, choose one arbitrarily to be the target).

These target groups are the core unit to which our subsequent analysis applies. Our goal is to arrive at a target group that is clean in the sense that is during the group (i.e., sends its actual values instead of placeholders), and all broadcasts that arrive at during this group come from non-active nodes (i.e., these received messages contain placeholders instead of values). If we achieve a clean group, then it is not hard to show that will advance its counter at least ahead of all other counters, pushing all other nodes into the termination criteria guaranteed by Theorem 3.5.

To prove clean groups are sufficiently likely, our analysis must overcome two issues. The first issue concerns network size estimations. Fix some target group . Let be the nodes from which receives at least one message during . If all of these nodes have a network size estimate of at least at the start of , we say the group is calibrated. We prove that if is calibrated, then it is clean with a probability in .

The key, therefore, is proving most target groups are calibrated. To do so, we note that if some is not calibrated, it means at least one node used an estimate strictly less than when it probabilistically defined at the beginning of this group. During this group, however, all nodes will receive broadcasts from at least unique nodes, increasing all network estimates to size at least .888This summary is eliding some subtle details tackled in the full analysis concerning which broadcasts are guaranteed to be received during a target group. But these details are not important for understanding the main logic of this argument. Therefore, each target group that fails to be calibrated increases the minimum network size estimate in the system by at least . It follows that at most target groups can be non-calibrated.

The second issue concerns probabilistic dependencies. Let be the event that target group is clean and be the event that some other target group is clean. Notice that and are not necessarily independent. If a node has a group that overlaps both and , then its probabilistic decision about whether or not to be active in this group impacts the potential cleanliness of both and .

Our analysis tackles these dependencies by identifying a subset of target groups that are pairwise independent. To do so, roughly speaking, we process our target groups in order. Starting with the first target group, we mark as unavailable any future target group that overlaps this first group (in the sense described above). We then proceed until we arrive at the next target group not marked unavailable and repeat the process. Each available target group marks at most future groups as unavailable. Therefore, given a sufficiently large set of target groups, we can identify a subset , with a size in , such that all groups in are pairwise independent.

We can now pull together these pieces to arrive at our main liveness complexity claim. Consider the first events in an execution. We can divide these into segments of length . We now consider the target groups defined by these segments. By our above argument, there is a subset of these groups, where , and all target groups in are mutually independent. At most of these remaining target groups are not calibrated. If we discard these, we are left with a slightly smaller set, of size still , that contains only calibrated and pairwise independent target groups.

We argued that each calibrated group has a probability in of being clean. Leveraging the independence between our identified groups, a standard concentration analysis establishes with high probability in that at least one of these groups is clean—satisfying the Theorem statement.

3.3.2 Full Analysis

Our proof of Theorem 3.5 proceeds in two steps. The first step introduces useful notation for describing parts of message schedules, and proves some deterministic properties regarding these concepts. The second step leverages these definitions and properties in making the core probabilistic arguments.

Definitions and Deterministic Properties

Each node keeps a counter called . This counter is initialized to and is incremented with each event. Given a message schedule and node , we can divide the schedule into phases with respect to based on ’s local counter. In more detail, label the events in the schedule, . For each , we define phase (with respect to ) to be the schedule fragment that starts with acknowledgment and includes all events up to but not including . If no such exists (i.e., if is the last event in the execution), we consider phase undefined and consider to only have phases in this schedule. Notice, by our model definition, during a given phase , all non-crashed nodes receive the message broadcast as part of the that starts the phase.

We partition a given node ’s phases into groups, which we define with respect to the constant used in the algorithm definition as part of the logic for resetting the nodes’ flag. In particular, we partition the phases into groups of size . For a given node , phases to define group , phases to define group , and, more generally, for all , phases to define group . Notice, by the definition of our algorithm, a node only updates its flag at the beginning of each group. Therefore, the messages sent by a give node during a given one of its groups are either all messages, or all non- messages.

We now introduce the higher level concept of a run, which will prove useful going forward.

Definition 3.6.

Fix an execution with corresponding message schedule , an integer , and a node . We call a subsequence of an -run for if it satisfies the following three properties:

  1. starts and ends with an event,

  2. contains total for , and

  3. no other node has more than in .

We now show that for any , any sufficiently long (defined with respect to ) fragment from a message schedule will contain an -run for some node:

Lemma 3.7.

Fix an execution and integer . Let be any subsequence of the corresponding message schedule that includes at least events. There exists a subsequence of that is an -run for some node .

Proof.

Because contains total , a straightforward counting argument provides that at least one node has at least in . Consider the the subsequence of that starts with the first event and ends with the such event. (That is, we remove the prefix of before the first and the suffix after the event.)

It is clear that satisfies the first properties of our definition of an -run for . If it also satisfies the third property (that no other node has more than in ), then we are done: setting satisfies the lemma statement.

On the other hand, if does not satisfy the third property, there must exist some node that has more than events in . In this case, we can apply the above argument recursively to and , identifying a subsequence of that starts with the first and ends after the such event. The resulting satisfies the first two properties of the definition of an -run for . If it also satisfies the third property, we are done. Otherwise, we can recurse again on .

Because each such recursive application of this argument strictly reduces the size of the subsequence (at the very least, you are trimming off the first and last ), and the original has a bounded number of events, the recursion must eventually arrive at a subsequence that satisfies all three properties of the -run definition. ∎

We next prove an additional useful property of -runs. In particular, a -run defined for some node is long enough that it must contain all phases of at least one of ’s groups. Identifying complete groups of this type will be key to the later probabilistic algorithms.

Lemma 3.8.

Let be a -run for some node . It follows that contains all of the phases for at least one of ’s groups (i.e., a complete group for ).

Proof.

Because , must contain at least events. It follows that it contains at least of node ’s phases (extra final of the ensures that all of the events that define phase of the run are included in the run). Because each node group consists of phases, any sequence of phases must include all phase of at least one full group. ∎

We next introduce the notion of a clean group, and establish that the occurence of a clean group guarantees that we arrive at the termination state from our main theorem.

Definition 3.9.

Let be a complete group for some node . We say is clean if the following two properties are satisfied:

  1. Node sets to at the beginning of the group described by .

  2. For every event that occurs in the first phases of , is a message. (We do not restrict the messages received during the final phase of the clean group.)

Lemma 3.10.

Fix some execution . Assume fragment from is a clean group for some node . It follows that by the end of all nodes have either crashed, decided, or received a message.

Proof.

Fix some , and as specified by the lemma statement. Let be the consensus value adopts for the first phase of the clean group. Because only receives messages during all but the last phase of a clean group, we know will not change this value again in this group until (potentially) the last phase. As we will now argue, however, it will have already decided before this last phase, so the fact that might receive values in that phase is inconsequential.

In more detail, let and be the largest counter values that has seen for and , respectively, by the time it completes the that begins the first phase. Because we just assumed that adopts at this point, we know . Furthermore, because only receives messages, we know that in every phase starting with phase of the group, will either increment the counter associated or send a message. The largest counter associated with will not increase beyond during these phases.

It follows that if has not yet sent a message by the start of phase , it will see during the event that starts this phase that its largest counter for is larger than the largest counter for . Accordingly, during this phase will send a message. During the event that starts , will receive this and decide. At this point, all other nodes have received its message as well. Because this is the last phase of the group, it is possible that receives non- messages from other nodes—but at this point, this is too late to have an impact as has already decided and halted. (It is here that we see why is the right value for the group length .) ∎

Randomized Analysis

In Part 1 of this analysis we introduced several useful definitions and execution properties. These culminated with the argument in Lemma 3.10 that if we ever get a clean group in an execution, then we will have achieved the desired termination property. Our goal in this second part of the analysis is to leverage the tools from the preceding part to prove, with high probability, that the algorithm will generate a clean group after not too many are scheduled.

On Network Size Assumptions. If , then all that is required for the single node to experience a clean group is for it to set to . By Lemma 3.10, it will then decide and halt in the group that follows. By the definition of the algorithm, this occurs with probability at the beginning of each group, as initializes , and this will never change. Therefore: with high probability, will decide within groups (and therefore, scheduled ), and with probability , it will decide in the limit. This satisfies our liveness theorem. In the analysis that follows, therefore, we will assume .

On Independence Between the Schedule and Random Choices. According to our model assumptions (Section 2), the scheduler is provided no advance information about the nodes’ state or the contents of the messages they send. All the scheduler learns is the input assignment, and whether or not a given node sent some message (but not the message contents) as part of the steps it takes for a given or event. By the definition of our algorithm, however, until it halts, each node sends a message when initialized and after every , regardless of its random choices or the specific contents of the messages its receives. It follows that the scheduler learns nothing new about the nodes’ states beyond their input values until the first node halts—at which point, some additional information might be inferred. For a node to halt, however, means it has already sent a message and received an for this message, meaning that we have already satisfied the desired termination property at this point. Accordingly, in the analysis that follows, we can treat the scheduler’s choices as independent of the nodes’ random choices. This allows us to fix the schedule first and then reason probabilistically about the messages sent during the schedule, without worrying about dependence between the schedule and those choices.999Technically speaking, in the analysis above, we imagine, without loss of generality, that the scheduler creates an infinite schedule that describes how it wants the execution to unfold until it learns the first node halts. At that point, it can modify the schedule going forward.

In analyzing the probability of a group ending up clean, a key property is whether or not the nodes participating in that group all have good estimates of the network size (e.g., their values used in setting their flags). We call a group with good estimates a calibrated group. The formal definition of this property requires some care to ensure it exactly matches how we later study it:

Definition 3.11.

Fix an execution . Let be a complete group for some node in the message schedule . Let be the set of nodes that have at least one of their messages received by in the first phases of ’s group, let , and for each , let be the event in that starts the node group that sends the first of its messages received by in . We say that group is calibrated if for every : the value used in event to probabilistically set ’s flag is of size at least .

Notice in the above that if is empty than the property is vacuously true. Another key property of calibration is that it is determined entirely by the message schedule. That is, given an prefix of a message schedule, you can correctly determine the network size estimation of all nodes at the end of that prefix without needing to know anything about their input values or random choices. This follows because network size estimates are based on two things: the number of UIDs from which you have received messages (of any type), and other nodes’ reported estimates (which are included on all message types). As argued above, the only thing impacted by the node random choices and inputs are the types of messages they send, not when they send.

Therefore, given a message schedule and a group within the message schedule, we can determine whether or not that group is calibrated independent of the nodes’ random choices, supporting the following:

Lemma 3.12.

Let be a message schedule generated by the scheduler. Let be a -run for some node in , and be a complete group for in . If is calibrated, then the probability that is clean is at least .

Proof.

Fix some , and and as specified by the lemma statement. Fix , , and the events, as specified in our definition of calibrated (Definition 3.11).

We note that if is empty, then the only condition that must hold for to be clean is for to set to true. This occurs with probability —satisfying the lemma.

Continuing, we consider the case where is non-empty. Fix any . We begin by bounding the total number of ’s groups that might send a message that is received by in . To do so, we note that because is a -run, cannot have more than events in . Therefore, no more than of ’s groups can overlap (as each group requires events), and therefore there are at most groups that both overlap and deliver a message from to in this group.

We now lower bound the probability that sets to (and therefore only sends messages to ) at the beginning of all of these groups. We consider two cases based on the value of . If , then the fact that this group is calibrated only tells us that —which is not useful. In this case, however, we note that the definition of the algorithm guarantees that , as it initializes to and these estimates never decrease. We can therefore crudely lower bound the probability that sets to in all overlapping groups, by noting that it must be at least —satisfying the lemma.

We now consider the case where . In this case, we leverage the definition of calibrated, which tells us that at the beginning of the first of these overlapping groups, has a network estimate , and that this remains true for all overlapping groups as these estimates never decrease. Therefore, the probability that delivers only messages to during the first phases of is at least: .

Combining the above probability with the straightforward observation that is during with probability at least (as is the largest possible network size estimate), yields the following probability that is clean:

as required by the lemma statement. ∎

We have established that if a group is calibrated then it has a good chance () of being clean and therefore ensuring termination. To leverage this result, however, we must overcome two issues. The first is proving that calibrated groups are sufficiently common in a given schedule. The second is dealing with dependencies between different groups. Assume, for example, we want to calculate the probability that at least one group from among a collection of target groups is clean. Assume some node has a group that overlaps multiple groups in this collection. If sets to in this group this reduces the probability of cleanliness for several groups in this collection. In other words, cleanness probability is not necessarily independent between different target groups.

On Good Target Groups. We overcomes these challenges by proving that any sufficiently long message schedule must contain a sufficient number of calibrated and pairwise independent target groups.

To do so, let be some message schedule generated by the scheduler that contains events, where and , for any constant . Partition this schedule in segments each containing events. Label these segments .

By Lemma 3.7, each segment contains an -run for some node . Applying Lemma 3.8, it follows that this -run contains at least one complete group for . We call this complete group the target group for , and label it . (If there are more than one complete groups for in the -run, then we set to the first such group in the run.) Let be the complete set of these target groups.

We turn our attention to this set of target groups. To study their useful for inducing termination, we will use the notion of calibrated introduced earlier, as well as the following formal notion of non-overlapping:

Definition 3.13.

Fix two target groups and . We say and are non-overlapping if there does not exist a group that has at least one event in and . If and are not non-overlapping, then we say they overlap.

Our goal is to identify a subset of these target groups that are good—a property which we define with respect to calibration and non-overlapping properties as follows:

Definition 3.14.

Let be a subset of the target groups. We say the groups in are good if: (1) every is calibrated; and (2) for every , where , and are non-overlapping.

Notice that both the calibration and non-overlapping status of groups are a function entirely of the message schedule. Therefore, given a message schedule, we can partition it into segments and target groups as described above, and label the status of these target groups without needing to consider the nodes’ random bits.

To do so, we first prove a useful bound on the prevalence of calibration in . The core idea in the following is that every time a target group fails calibration, all nodes increase their network estimates. Clearly, this can only occur times before all estimates are the maximum possible value of , after which calibration is trivial. We then apply this result in making a more involved argument that on the frequency of good groups.

Lemma 3.15.

At most groups in are not calibrated.

Proof.

Fix some that is not calibrated. Let be the set of nodes that deliver at least one message to in the first phases of . By the definition of calibration, if is not calibrated, then at least one node starts its relevant group with a network estimate .

During the first phases of , node receives a message from every node in (this is the definition of ). This means that by the start of the final phase of , ’s network estimate is of size at least . The message that sends in the final phase therefore will be labelled with this network size. By the end of this final phase, all non-crashed processes will have received this estimate. Therefore, all these processes will update their network size to be at least at the beginning of their next phases.

At this po