The latest gossip on BFT consensus

07/13/2018 ∙ by Ethan Buchman, et al. ∙ 0

The paper presents Tendermint, a new protocol for ordering events in a distributed network under adversarial conditions. More commonly known as Byzantine Fault Tolerant (BFT) consensus or atomic broadcast, the problem has attracted significant attention in recent years due to the widespread success of blockchain-based digital currencies, such as Bitcoin and Ethereum, which successfully solved the problem in a public setting without a central authority. Tendermint modernizes classic academic work on the subject and simplifies the design of the BFT algorithm by relying on a peer-to-peer gossip protocol among nodes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Consensus is one of the most fundamental problems in distributed computing. It is important because of it’s role in State Machine Replication (SMR), a generic approach for replicating services that can be modeled as a deterministic state machine [1, 2]. The key idea of this approach is that service replicas start in the same initial state, and then execute requests (also called transactions) in the same order; thereby guaranteeing that replicas stay in sync with each other. The role of consensus in the SMR approach is ensuring that all replicas receive transactions in the same order. Traditionally, deployments of SMR based systems are in data-center settings (local area network), have a small number of replicas (three to seven) and are typically part of a single administration domain (e.g., Chubby [3]

); therefore they handle benign (crash) failures only, as more general forms of failure (in particular, malicious or Byzantine faults) are considered to occur with only negligible probability.

The success of cryptocurrencies or blockchain systems in recent years (e.g., [4, 5]) pose a whole new set of challenges on the design and deployment of SMR based systems: reaching agreement over wide area network, among large number of nodes (hundreds or thousands) that are not part of the same administration domain, and where a subset of nodes can behave maliciously (Byzantine faults). Furthermore, contrary to the previous data-center deployments where nodes are fully connected to each other, in blockchain systems, a node is only connected to a subset of other nodes, so communication is achieved by gossip-based peer-to-peer protocols. The new requirements demand designs and algorithms that are not necessarily present in the classical academic literature on Byzantine fault tolerant consensus (or SMR) systems (e.g., [6, 7]) as the primary focus was different setup.

In this paper we describe a novel Byzantine-fault tolerant consensus algorithm that is the core of the BFT SMR platform called Tendermint111The Tendermint platform is available open source at https://github.com/tendermint/tendermint.. The Tendermint platform consists of a high-performance BFT SMR implementation written in Go, a flexible interface for building arbitrary deterministic applications above the consensus, and a suite of tools for deployment and management.

The Tendermint consensus algorithm is inspired by the PBFT SMR algorithm [8] and the DLS algorithm for authenticated faults (the Algorithm 2 from [6]). Similar to DLS algorithm, Tendermint proceeds in rounds222Tendermint is not presented in the basic round model of [6]. Furthermore, we use the term round differently than in [6]; in Tendermint a round denotes a sequence of communication steps instead of a single communication step in [6]., where each round has a dedicated proposer (also called coordinator or leader) and a process proceeds to a new round as part of normal processing (not only in case the proposer is faulty or suspected as being faulty by enough processes as in PBFT). The communication pattern of each round is very similar to the ”normal” case of PBFT. Therefore, in preferable conditions (correct proposer, timely and reliable communication between correct processes), Tendermint decides in three communication steps (the same as PBFT).

The major novelty and contribution of the Tendermint consensus algorithm is a new termination mechanism. As explained in [9, 10], the existing BFT consensus (and SMR) algorithms for the partially synchronous system model (for example PBFT [8], [6], [11]) typically relies on the communication pattern illustrated in Figure 1 for termination. The Figure 1 illustrates messages exchanged during the proposer change when processes start a new round333There is no consistent terminology in the distributed computing terminology on naming sequence of communication steps that corresponds to a logical unit. It is sometimes called a round, phase or a view.. It guarantees that eventually (ie. after some Global Stabilization Time, GST), there exists a round with a correct proposer that will bring the system into a univalent configuration. Intuitively, in a round in which the proposed value is accepted by all correct processes, and communication between correct processes is timely and reliable, all correct processes decide.

42 1  11 21 31 41 2  11 12 13 14

Fig. 1: Proposer (coordinator) change: is the new proposer.

To ensure that a proposed value is accepted by all correct processes444The proposed value is not blindly accepted by correct processes in BFT algorithms. A correct process always verifies if the proposed value is safe to be accepted so that safety properties of consensus are not violated. a proposer will 1) build the global state by receiving messages from other processes, 2) select the safe value to propose and 3) send the selected value together with the signed messages received in the first step to support it. The value that a correct process sends to the next proposer normally corresponds to a value the process considers as acceptable for a decision:

  • in PBFT [8] and DLS [6] it is not the value itself but a set of signed messages with the same value id,

  • in Fast Byzantine Paxos [11] the value itself is being sent.

In both cases, using this mechanism in our system model (ie. high number of nodes over gossip based network) would have high communication complexity that increases with the number of processes: in the first case as the message sent depends on the total number of processes, and in the second case as the value (block of transactions) is sent by each process. The set of messages received in the first step are normally piggybacked on the proposal message (in the Figure 1 denoted with ) to justify the choice of the selected value . Note that sending this message also does not scale with the number of processes in the system.

We designed a novel termination mechanism for Tendermint that better suits the system model we consider. It does not require additional communication (neither sending new messages nor piggybacking information on the existing messages) and it is fully based on the communication pattern that is very similar to the normal case in PBFT [8]. Therefore, there is only a single mode of execution in Tendermint, i.e., there is no separation between the normal and the recovery mode, which is the case in other PBFT-like protocols (e.g., [8], [12] or [13]). We believe this makes Tendermint simpler to understand and implement correctly.

Note that the orthogonal approach for reducing message complexity in order to improve scalability and decentralization (number of processes) of BFT consensus algorithms is using advanced cryptography (for example Boneh-Lynn-Shacham (BLS) signatures [14]) as done for example in SBFT [15].

The remainder of the paper is as follows: Section II defines the system model and gives the problem definitions. Tendermint consensus algorithm is presented in Section III and the proofs are given in Section IV. We conclude in Section V.

Ii Definitions

Ii-a Model

We consider a system of processes that communicate by exchanging messages. Processes can be correct or faulty, where a faulty process can behave in an arbitrary way, i.e., we consider Byzantine faults. We assume that each process has some amount of voting power (voting power of a process can be ). Processes in our model are not part of a single administration domain; therefore we cannot enforce a direct network connectivity between all processes. Instead, we assume that each process is connected to a subset of processes called peers, such that there is an indirect communication channel between all correct processes. Communication between processes is established using a gossip protocol [16].

Formally, we model the network communication using the partially synchronous system model [6]: in all executions of the system there is a bound and an instant GST (Global Stabilization Time) such that all communication among correct processes after GST is reliable and -timely, i.e., if a correct process sends message at time to correct process , then will receive before 555Note that as we do not assume direct communication channels among all correct processes, this implies that before the message reaches , it might pass through a number of correct processes that will forward the message using gossip protocol towards .. Messages among correct processes can be delayed, dropped or duplicated before GST. Spoofing/impersonation attacks are assumed to be impossible at all times due to the use of public-key cryptography. The bound and GST are system parameters whose values are not required to be known for the safety of our algorithm. Termination of the algorithm is guaranteed within a bounded duration after GST. In practice, the algorithm will work correctly in the slightly weaker variant of the model where the system alternates between (long enough) good periods (corresponds to the after GST period where system is reliable and -timely) and bad periods (corresponds to the period before GST during which the system is asynchronous and messages can be lost), but considering the GST model simplifies the discussion.

We assume that process steps (which might include sending and receiving messages) take zero time. Processes are equipped with clocks so they can measure local timeouts. All protocol messages are signed, i.e., when a correct process receives a signed message from its peer, the process can verify who was the original sender of the message .

The details of the Tendermint gossip protocol will be discussed in a separate technical report. For the sake of this report it is sufficient to assume that messages are being gossiped between processes and the following property holds (in addition to the partial synchrony network assumptions):

  • Gossip communication: If a correct process receives some message at time , all correct processes will receive before .

Ii-B State Machine Replication

State machine replication (SMR) is a general approach for replicating services modeled as a deterministic state machine [1, 2]. The key idea of this approach is to guarantee that all replicas start in the same state and then apply requests from clients in the same order, thereby guaranteeing that the replicas’ states will not diverge. Following Schneider [2], we note that the following is key for implementing a replicated state machine tolerant to (Byzantine) faults:

  • Replica Coordination. All [non-faulty] replicas receive and process the same sequence of requests.

Moreover, as Schneider also notes, this property can be decomposed into two parts, Agreement and Order: Agreement requires all (non-faulty) replicas to receive all requests, and Order requires that the order of received requests is the same at all replicas.

There is an additional requirement that needs to be ensured by Byzantine tolerant state machine replication: only requests (called transactions in the Tendermint terminology) proposed by clients are executed. In Tendermint, transaction verification is the responsibility of the service that is being replicated; upon receiving a transaction from the client, the Tendermint process will ask the service if the request is valid, and only valid requests will be processed.

Ii-C Consensus

Tendermint solves state machine replication by sequentially executing consensus instances to agree on each block of transactions that are then executed by the service being replicated. We consider a variant of the Byzantine consensus problem called Validity Predicate-based Byzantine consensus that is motivated by blockchain systems [17]. The problem is defined by an agreement, a termination, and a validity property.

  • Agreement: No two correct processes decide on different values.

  • Termination: All correct processes eventually decide on a value.

  • Validity: A decided value is valid, i.e., it satisfies the predefined predicate denoted valid().

This variant of the Byzantine consensus problem has an application-specific valid() predicate to indicate whether a value is valid. In the context of blockchain systems, for example, a value is not valid if it does not contain an appropriate hash of the last value (block) added to the blockchain.

Iii Tendermint consensus algorithm

[1] current height, or consensus instance we are currently executing current round number upon start do broadcast  schedule to be executed after

from  while  ) broadcast  broadcast 

from  AND while  broadcast  broadcast 

while  for the first time schedule to be executed after

from  AND while  for the first time broadcast 

while  broadcast 

for the first time schedule to be executed after

from  AND while  reset , , and to initial values and empty message log

with

broadcast 

broadcast 

Algorithm 1 Tendermint consensus algorithm

In this section we present the Tendermint Byzantine fault-tolerant consensus algorithm. The algorithm is specified by the pseudo-code listing in Algorithm 1. We present the algorithm as a set of upon rules that are executed atomically666In case several rules are active at the same time, the first rule to be executed is picked randomly. The correctness of the algorithm does not depend on the order in which rules are executed.. We assume that processes exchange protocol messages using a gossip protocol and that both received and sent messages are stored in a local message log for every process. An upon rule is triggered once the message log contains messages such that the corresponding condition evaluates to . The condition that assumes reception of messages of a particular type and content denotes reception of messages whose senders have aggregate voting power at least equal to . For example, the condition , evaluates to true upon reception of messages for height , a round and with value equal to whose senders have aggregate voting power at least equal to . Some of the rules ends with ”for the first time” constraint to denote that it is triggered only the first time a corresponding condition evaluates to . This is because those rules do not always change the state of algorithm variables so without this constraint, the algorithm could keep executing those rules forever. The variables with index are process local state variables, while variables without index are value placeholders. The sign denotes any value.

We denote with the total voting power of processes in the system, and we assume that the total voting power of faulty processes in the system is bounded with a system parameter . The algorithm assumes that , i.e., it requires that the total voting power of faulty processes is smaller than one third of the total voting power. For simplicity we present the algorithm for the case .

The algorithm proceeds in rounds, where each round has a dedicated proposer. The mapping of rounds to proposers is known to all processes and is given as a function , returning the proposer for the round in the consensus instance . We assume that the proposer selection function is weighted round-robin, where processes are rotated proportional to their voting power777A validator with more voting power is selected more frequently, proportional to its power. More precisely, during a sequence of rounds of size , every process is proposer in a number of rounds equal to its voting power.. The internal protocol state transitions are triggered by message reception and by expiration of timeouts. There are three timeouts in Algorithm 1: , and . The timeouts prevent the algorithm from blocking and waiting forever for some condition to be true, ensure that processes continuously transition between rounds, and guarantee that eventually (after GST) communication between correct processes is timely and reliable so they can decide. The last role is achieved by increasing the timeouts with every new round , i.e, ; they are reset for every new height (consensus instance).

Processes exchange the following messages in Tendermint: , and . The message is used by the proposer of the current round to suggest a potential decision value, while and are votes for a proposed value. According to the classification of consensus algorithms from [10], Tendermint, like PBFT [7] and DLS [6], belongs to class 3, so it requires two voting steps (three communication exchanges in total) to decide a value. The Tendermint consensus algorithm is designed for the blockchain context where the value to decide is a block of transactions (ie. it is potentially quite large, consisting of many transactions). Therefore, in the Algorithm 1 (similar as in [7]) we are explicit about sending a value (block of transactions) and a small, constant size value id (a unique value identifier, normally a hash of the value, i.e., if , then ). The message is the only one carrying the value; and messages carry the value id. A correct process decides on a value in Tendermint upon receiving the for and voting-power equivalent messages for in some round . In order to send message for in a round , a correct process waits to receive the and of the corresponding messages in the round . Otherwise, it sends message with a special value. This ensures that correct processes can only a single value (or ) in a round. As proposers may be faulty, the proposed value is treated by correct processes as a suggestion (it is not blindly accepted), and a correct process tells others if it accepted the for value by sending message for ; otherwise it sends message with the special value.

Every process maintains the following variables in the Algorithm 1: , , , and . The denotes the current state of the internal Tendermint state machine, i.e., it reflects the stage of the algorithm execution in the current round. The stores the most recent value (with respect to a round number) for which a message has been sent. The is the last round in which the process sent a message that is not . We also say that a correct process locks a value in a round by setting and before sending message for . As a correct process can decide a value only if messages for are received, this implies that a possible decision value is a value that is locked by at least voting power equivalent of correct processes. Therefore, any value for which and of the corresponding messages are received in some round is a possible decision value. The role of the variable is to store the most recent possible decision value; the is the last round in which is updated. Apart from those variables, a process also stores the current consensus instance (, called height in Tendermint), and the current round number () and attaches them to every message. Finally, a process also stores an array of decisions, (Tendermint assumes a sequence of consensus instances, one for each height).

Every round starts by a proposer suggesting a value with the message (see line 1). In the initial round of each height, the proposer is free to chose the value to suggest. In the Algorithm 1, a correct process obtains a value to propose using an external function that returns a valid value to propose. In the following rounds, a correct proposer will suggest a new value only if ; otherwise is proposed (see lines 1-1). In addition to the value proposed, the message also contains the so other processes are informed about the last round in which the proposer observed as a possible decision value. Note that if a correct proposer sends with the in the , this implies that the process received and the corresponding messages for in the round . If a correct process sends message with () at time , by the Gossip communication property, the corresponding and the messages will be received by all correct processes before time . Therefore, all correct processes will be able to verify the correctness of the suggested value as it is supported by the and the corresponding voting power equivalent messages.

A correct process accepts the proposal for a value (send for ) if an external valid function returns for the value , and if hasn’t locked any value () or has locked the value (); see the line 1. In case the proposed pair is and a correct process has locked some value, it will accept if it is a more recent possible decision value888As explained above, the possible decision value in a round is the one for which and the corresponding messages are received for the round ., , or if (see line 1). Otherwise, a correct process will reject the proposal by sending message with value. A correct process will send message with value also in case expired (it is triggered when a correct process starts a new round) and a process has not sent message in the current round yet (see the line 1).

If a correct process receives message for some value and messages for , then it sends message with . Otherwise, it sends . A correct process will send message with value also in case expired (it is started when a correct process sent message and received any messages) and a process has not sent message in the current round yet (see the line 1). A correct process decides on some value if it receives in some round message for and messages with (see the line 1). To prevent the algorithm from blocking and waiting forever for this condition to be true, the Algorithm 1 relies on . It is triggered after a process receives any set of messages for the current round. If the expires and a process has not decided yet, the process starts the next round (see the line 1). When a correct process decides, it starts the next consensus instance (for the next height). The Gossip communication property ensures that and messages that led to decide are eventually received by all correct processes, so they will also decide.

Iii-a Termination mechanism

Tendermint ensures termination by a novel mechanism that benefits from the gossip based nature of communication (see Gossip communication property). It requires managing two additional variables, and that are then used by the proposer during the propose step as explained above. The and are updated to and by a correct process in a round when the process receives valid message for the value and the corresponding messages for in the round (see the rule at line 1).

We now give briefly the intuition how managing and proposing and ensures termination. Formal treatment is left for Section IV.

The first thing to note is that during good period, because of the Gossip communication property, if a correct process locks a value in some round , all correct processes will update to and to before the end of the round (we prove this formally in the Section IV). The intuition is that messages that led to locking a value in the round will be gossiped to all correct processes before the end of the round , so it will update and (the line 1). Therefore, if a correct process locks some value during good period, and are updated by all correct processes so that the value proposed in the following rounds will be acceptable by all correct processes. Note that it could happen that during good period, no correct process locks a value, but some correct process updates and during some round. As no correct process locks a value in this case, and will also be acceptable by all correct processes as for every correct process and as the Gossip communication property ensures that the corresponding messages that received in the round are received by all correct processes time later.

Finally, it could happen that after GST, there is a long sequence of rounds in which no correct process neither locks a value nor update and . In this case, during this sequence of rounds, the proposed value suggested by correct processes was not accepted by all correct processes. Note that this sequence of rounds is always finite as at the beginning of every round there is at least a single correct process such that and are acceptable by every correct process. This is true as there exists a correct process such that for every other correct process , or . This is true as is the process that has locked a value in the most recent round among all correct processes (or no correct process locked any value). Therefore, eventually will be the proper in some round and the proposed value will be accepted by all correct processes, terminating therefore this sequence of rounds.

Therefore, updating and variables, and the Gossip communication property, together ensures that eventually, during the good period, there exists a round with a correct proposer whose proposed value will be accepted by all correct processes, and all correct processes will terminate in that round. Note that this mechanism, contrary to the common termination mechanism illustrated in the Figure 1, does not require exchanging any additional information in addition to messages already sent as part of what is normally being called ”normal” case.

Iv Proof of Tendermint consensus algorithm

For all , any two sets of processes with voting power at least equal to have at least one correct process in common.

Proof.

As the total voting power is equal to , we have . This means that the intersection of two sets with the voting power equal to contains at least voting power in common, , at least one correct process (as the total voting power of faulty processes is ). The result follows directly from this. ∎

If correct processes lock value in round ( and ), then in all rounds , they send for or .

Proof.

We prove the result by induction on .

Base step Let’s denote with the set of correct processes with voting power equal to . By the rules at line 1 and line 1, the processes from the set can’t accept for any value different from in round , and therefore can’t send a message, if . Therefore, the Lemma holds for the base step.

Induction step from to : We assume that no process from the set has sent for values different than or until round . We now prove that the Lemma also holds for round . As processes from the set send for or in rounds , by Lemma IV there is no value for which it is possible to receive messages in those rounds (i). Therefore, we have for all processes from the set , and . Let’s assume by a contradiction that a process from the set sends in round for value , where . This is possible only by line 1. Note that this implies that received messages, where and (see line 1). A contradiction with (i) and Lemma IV. ∎

Algorithm 1 satisfies Agreement.

Proof.

Let round be the first round of height such that some correct process decides . We now prove that if some correct process decides in some round , then .

In case , has received at least messages at line 1, while has received at least messages. By Lemma IV two sets of messages of voting power intersect in at least one correct process. As a correct process sends a single message in a round, then .

We prove the case by contradiction. By the rule 1, has received at least voting-power equivalent of messages, i.e., at least voting-power equivalent correct processes have locked value in round and have sent those messages (i). Let denote this set of messages with . On the other side, has received at least voting power equivalent of messages. As the voting power of all faulty processes is at most , some correct process has sent one of those messages. By the rule at line 1, has locked value in round before sending . Therefore has received messages for in round (see line 1). By Lemma IV, a process from the set has sent message for in round . A contradiction with (i) and Lemma IV. ∎

Algorithm 1 satisfies Validity.

Proof.

Trivially follows from the rule at line 1 which ensures that only valid values can be decided. ∎

If we assume that:

  1. a correct process is the first correct process to enter a round at time (for every correct process , at time )

  2. the proposer of round is a correct process

  3. for every correct process , at time

  4. , and ,

then all correct processes decide in round before .

Proof.

As is the first correct process to enter round , it executed the line 1 after expired. Therefore, received messages in the round before time . By the Gossip communication property, all correct processes will receive those messages the latest at time . Correct processes that are in rounds at time will enter round (see the rule at line 1) and trigger (see rule 1) by time . Therefore, all correct processes will start round by time (i).

In the worst case, the process is the last correct process to enter round , so starts round and sends message for some value at time . Therefore, all correct processes receive the message from the latest by time . Therefore, if , all correct processes will receive message before expires.

By (3) and the rules at line 1 and 1, all correct processes will accept the message for value and will send a message for by time . Note that by the Gossip communication property, the messages needed to trigger the rule at line 1 are received before time .

By time , all correct processes will receive for and corresponding messages for . By the rule at line 1, all correct processes will send a message (see line 1) for by time . Therefore, by time , all correct processes will have received the for and messages for , so they decide at line 1 on .

This scenario holds if every correct process sends a message before expires, and if does not expire before . Let’s assume that a correct process is the first correct process to trigger (see the rule at line 1) at time . This implies that before time , received a ( must be by the rule at line 1) and a set of messages. By time , all correct processes will receive those messages. Note that even if some correct process was in the smaller round before time , at time it will start round after receiving those messages (see the rule at line 1). Therefore, all correct processes will send their message for by time , and all correct processes will receive those messages the by time . Therefore, as , this ensures that all correct processes receive messages from all correct processes before their respective local expire.

On the other hand, is triggered in a correct process after it receives any set of messages for the first time. Let’s denote with the earliest point in time is triggered in some correct process . This implies that has received at least messages for from correct processes, i.e., those processes have received for and messages for before time . By the Gossip communication property, all correct processes will receive those messages by time , and will send messages for . Note that even if some correct processes were at time in a round smaller than , by the rule at line 1 they will enter round by time . Therefore, by time , all correct processes will receive for and messages for . So if , all correct processes will decide before the timeout expires. ∎

If a correct process locks a value at time in some round ( and ) and , then all correct processes set to and to before starting round .

Proof.

In order to prove this Lemma, we need to prove that if the process locks a value at time , then no correct process will leave round before time (unless it has already set to and to ). It is sufficient to prove this, since by the Gossip communication property the messages that received at time and that triggered rule at line 1 will be received by time by all correct processes, so all correct processes that are still in round will set to and to (by the rule at line 1). To prove this, we need to compute the earliest point in time a correct process could leave round without updating to and to (we denote this time with ). The Lemma is correct if .

If the process locks a value at time , this implies that received the valid message for and at time . At least of those messages are sent by correct processes. Let’s denote this set of correct processes as . By Lemma IV any set of messages in round contains at least a single message from the set .

Let’s denote as time the earliest point in time a correct process, , triggered . This implies that received messages (see the rule at line 1), where at least one of those messages was sent by a process from the set . Therefore, process had received message before time . By the Gossip communication property, all correct processes will receive and messages for round by time . The latest point in time will trigger is 999Note that even if was in smaller round at time it will start round by time .. So the latest point in time can lock the value in round is (as at this point expires, so a process sends and updates to , see line 1).

Note that according to the Algorithm 1, a correct process can not send a