Bernoulli Meets PBFT: Modeling BFT Protocols in the Presence of Dynamic Failures

07/27/2020 ∙ by Martin Nischwitz, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

The publication of the pivotal state machine replication protocol PBFT laid the foundation for a large body of BFT protocols. While many successors to PBFT have been developed, there is no general technique to compare these protocols under realistic network conditions such as unreliable links. In this paper, we introduce a probabilistic model for evaluating BFT protocols in the presence of dynamic link and crash failures. Based on modeling techniques from communication theory, the network state of replicas is captured and used to derive the success probability of the protocol execution. To this end, we examine the influence of link and crash failure rates as well as the number of replicas. The model is derived from the communication pattern, making it implementation-independent and facilitating an adaptation to other BFT protocols. The model is validated with a simulation of PBFT and BFT-SMaRt. Further, a comparison in protocol behavior of PBFT, Zyzzyva and SBFT is performed and critical failure thresholds are identified.e thresholds are identified.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rapidly increasing connectivity of devices, as for example envisioned by the Internet of things, entices the development of large-scale, globally distributed systems. The European Metrology Cloud project [33], which aims to coordinate the digital transformation of legal metrology, is a prime example. The scale and complexity of such systems leads to a higher risk of failure and/or malicious behavior. The demand for trust and reliability, however, remains unchanged.

A technique to offer higher fault tolerance and availability for distributed systems is state machine replication (SMR). It requires processes to find agreement on the order of state transitions and, thus, consensus on the system state. One of the most prominent Byzantine fault tolerant (BFT) protocols is Practical Byzantine Fault Tolerance (PBFT[10]. Many modern systems, including the recent surge of blockchain applications, utilize PBFT, or a variation of it, as their core consensus algorithm, e.g., BFT-SMaRt [4], Tendermint [7], RBFT [3], CheapBFT [18], and Hyperledger Fabric v0.6 [2].

While many advanced BFT protocols exist, the performance impact of dynamic failures in general and unreliable links in particular is often ignored. Many protocols, however, require that messages arrive within a defined timespan, i.e., there is a bound on the message delay. If that bound is not met, the performance of the protocols might deteriorate. To quantify performance, protocols are often compared with benchmarks on either real systems or simulations. To the best of our knowledge, there is unfortunately no technique to assess the impact of unreliable links on the performance of BFT protocols without requiring a comprehensive implementation.

In this paper, we fill the gap and present a probabilistic modeling approach for BFT protocols to measure the impact of dynamic link and crash failures on their performance. The model is derived from the communication pattern and therefore transferable to many BFT protocols. It predicts the system state assuming the so-called dynamic link failure model [26], that is, unreliable communication links with message losses and high delays. More specifically, we assume a constant failure probability for all links and processes, model state transitions as Bernoulli trials, and express the resulting system state as probability density functions. Thus, our model can provide feedback already during the design and development phase of BFT protocols as well as support to parametrize timeouts.

Our model validation confirms that the model accurately captures the behavior of PBFT as well as BFT-SMaRt [4] and allows to predict the probability for successful protocol executions. Moreover, we employ our model to Zyzzyva [19], and SBFT [16] to showcase its applicability to other BFT protocols. To validate the approach with an independent source, we use BFT-SMaRt’s public library [4]. The paper’s contributions can be summarized as follows:

  • We develop a probabilistic model for PBFT to quantify the performance impact in the presence of dynamic link and crash failures. Since the model is based on communication patterns, it is implementation independent.

  • We generalize our modeling approach and show that it can be applied to other BFT protocols.

  • We validate the approach in a study by comparing it to a simulation of PBFT and BFT-SMaRt and apply it to Zyzzyva and SBFT.

  • We identify critical values for dynamic link failure and crash failure rates at which the previously mentioned protocols become unstable.

The remainder of the paper is organized as follows. In Section 2, we discuss related work with a focus on BFT modeling and failures. In Section 3, we define the system model. Next, we describe the detailed derivation of our modeling approach for PBFT in Section 4, and present a simulation-based model validation in Section 5. In Section 6, we use our model to reveal structural differences between PBFT, BFT-SMaRt, Zyzzyva, and SBFT, before we conclude the paper in Section 7.

2 Related Work

2.1 Preliminaries

The main properties of BFT protocols are described by the notions of safety and liveness [10]. Safety indicates that the protocol satisfies serializability, i.e., it behaves like a centralized system. Liveness, on the other hand, indicates that the system will eventually respond to requests. In order to tolerate faulty processes, at least processes are necessary [6]. Aside from process-related failures, the network, i.e., the communication between processes, also impacts the performance of BFT protocols and is often overlooked.

The network can be described as either synchronous or asynchronous. To bridge the gap between completely synchronous/asynchronous systems, the term partially synchronous was introduced [12]. A partially synchronous system may start in an asynchronous state but will, after some unspecified time, eventually return to a synchronous state. This captures temporary link failures, for example. A different perspective on a partially synchronous system is to assume a network with fixed upper bounds on message delays and processing times, where both are unknown a priori. The partially synchronous system model is utilized by many BFT protocols, e.g., PBFT, to circumvent the FLP impossibility [15] and guarantee liveness during the synchronous states of the system, without requiring it at all times. Deterministic BFT protocols guarantee safety, even in the asynchronous state, but require synchronous periods to guarantee liveness.

To detect Byzantine behavior, most BFT protocols utilize timeouts (and signatures). If the happy path of a protocol fails to make progress, a sub-protocol, e.g., a view change protocol, is triggered to recover [20]. To optimize performance, the timeout values should depend on the bounded message delay in the synchronous periods of the network, which plays an important role for deployments [11]. In addition to message delay characteristics, some networks, e.g., wireless networks, might be susceptible to link failures, leading to message omissions or corruptions. These failures are formally captured by the so-called dynamic link failure model [26], where the authors prove that consensus is impossible in a synchronous system with an unbounded number of transmission failures. Schmid et al. [27, 28] introduced a hybrid failure model to capture process and communication failures and derived bounds on the number of failures for synchronous networks.

2.2 BFT Models

Other modeling techniques to analyze the performance of distributed systems and PBFT have previously been proposed in the literature. The framework HyPerf [17] combines model checking and simulation techniques to evaluate BFT protocols. While model checking usually proves correctness, their framework uses simulations to explore the possible paths in the model checker and evaluate the performance of the protocol. The model is validated against an implementation of PBFT to predict latencies and throughput.

A method to model PBFT with Stochastic Reward Nets was proposed in [32]. The authors deployed Hyperledger Fabric v0.6, which implements PBFT, on a cluster of four nodes and compare the collected data to the distributions of their SRNs. Afterwards, the number of nodes in the model is scaled and evaluated regarding the mean time to consensus.

Singh et al. [30] provided a more direct approach to evaluate BFT protocols with the simulation framework BFTSim. The framework utilizes the high-level declarative language P2 [21] to implement three different BFT protocols [10, 19, 1]. Built upon ns-2 [24], the simulation can explore various network conditions.

While the previously listed models and/or simulators offer the possibility to evaluate the performance of BFT protocols, they all require comprehensive implementations of the respective protocol. The model presented in this paper, however, is derived from the communication pattern. Moreover, no simulations or measurements are required to employ our model; all system states can be evaluated with closed-form expressions at low computational cost. Finally, the main focus of our work is to present a model for BFT protocols that captures the impact of unreliable communication and varying message delays, which the other models only considered as a minor aspect.

2.3 Link and Crash Failures

In this paper, we analyze the consequences of message omissions in PBFT caused by dynamic link failures, which also covers misconfigured timeouts. Fathollahnejad et al. [14] examined the impact of dynamic link failures on their leader election algorithm in a traffic control system. They assume a constant failure probability and present techniques to calculate the probability for disagreement. As in our paper, the number of received messages of an all-to-all broadcast is modeled with Bernoulli trials. Since their use case does not consider Byzantine failures, their protocol does not require a consecutive collection of message quorums which is, however, implemented in most BFT protocols and thus the main focus of the model presented in this paper.

Xu et al. presented RATCHETA [34], a consensus protocol which was designed for embedded devices in a wireless network that might be prone to dynamic link failures. They included an evaluation with artificially induced packet losses, measuring the number of failing consensus instances. RATCHETA requires a trusted subsystem that prevents a process from casting differing votes during the same consensus instance, eliminating the possibility of equivocations. It therefore yields a resilience, allowing Byzantine failures.

In addition, there is a body of literature that covers the theoretical limits of consensus protocols [29, 28, 5], which are based on a hybrid failure model [27] and therefore also capture link failures. Existing models, however, rarely consider performance aspects resulting from unreliable network conditions such as dynamic crash and link failures. Since all BFT protocols have a built-in protocol to recover from crashed processes, e.g., view changes, their impact on the performance is tied to the frequency of the recovery algorithm execution.

3 System Model

3.1 Process Model

A distributed system consists of a fixed number of processes (we use the term process, node, and replica interchangeably), each with a unique ID. Typically, no more than processes are allowed to be subject to Byzantine faults. To guarantee safety, the system requires replicas.

In order to capture diverse failure cases, e.g., congestion due to high traffic load, we introduce the term dynamic crash failures. That is, processes can become unavailable in each phase for a short period of time. It is assumed that every crashed process will recover almost immediately, upholding its pre-crash state, and may thus be available in the next phase of the protocol. Since most BFT protocols (including PBFT) are based on consecutive phases, a crashed replica will remain inactive until the protocol-specific recovery algorithm, e.g., view-change protocol for PBFT

, has recovered crashed replicas. In this paper, crash failures are assumed to be independent and identically distributed (i.i.d.) random variables for all processes.

Since our model is derived from the communication pattern of the protocol, special roles such as the primary in PBFT which follow a different communication pattern, are incorporated into the model.

3.2 Network Model

We assume that each network node has a peer-to-peer connection to all other nodes. The network model in this paper allows for (1) messages to be delayed indefinitely, i.e., past the configured timeout parameter of PBFT, and (2) message omissions as well as corruptions. The former case acknowledges BFT protocols that rely on synchronous periods to guarantee liveness and are based on timeouts to detect process and/or link failures. PBFT, for example, makes use of timeouts to detect if progress is being made and as a consequence to initiate the view-change protocol, which elects a new primary. Messages that arrive after a configured timeout can therefore be considered as message omissions. The same applies to invalid or corrupted messages. The resulting failure model can be described with the dynamic link failure model, introduced in [26]. While in practice many BFT protocols rely on other protocols to guarantee reliable communication, e.g., TCP, they should implement means to handle lost messages due to crashed/malicious processes. We therefore assume unidirectional links, which implies unreliable communication.

If assumptions made regarding the bound of message delays fail, i.e., the timeouts are not configured appropriately, the protocol can be considered to operate in an asynchronous network with unbounded message delays. This does not apply if an attacker is considered to have control over the scheduling of messages, as this could easily lead to stopping a BFT protocol altogether [22]. As with process failures, the link failures are assumed to be i.i.d. for all links.

4 Modeling PBFT

The model presented in this section offers means to evaluate PBFT in the presence of dynamic link failures and crash failures. In particular, a performance impact assessment becomes possible. For the sake of clarity, we provide an overview of our modeling approach and introduce our notation first. Next, we unroll our model for various failure types step-by-step starting with dynamic crash failures, before we incorporate dynamic link failures.

4.1 Overview

Figure 1: Modeled view on PBFT’s happy path communication pattern. Each phase is modeled via alternating predictions for crash () and link failures ().

In PBFT, the happy path consists of five phases of message exchanges, as depicted in Figure 1. The first and last phase consist of transmissions from and to the client. In the first phase, the leader of the current view will collect and serialize client requests. This is followed by a second phase in which the primary disseminates the requests to all other replicas in a so-called pre-prepare message. If a replica receives and accepts a pre-prepare message, it stores that message and enters the third phase, broadcasting prepare messages. Replicas wait for a quorum of messages, i.e., at least prepare messages which match the stored pre-prepare message. In the fourth phase replicas broadcast commit messages to all other replicas. Finally, if a replica collects another quorum of commit messages, which match a previously collected quorum of prepare messages, it will commit (and execute) the state transition. In the fifth and last phase of the protocol, replicas reply to the client, confirming that the client’s request was executed from the replicated system.

Disregarding the client interactions, PBFT’s happy path can be reduced to three phases by omitting the first and last phase. The remaining three phases can be summarized as a broadcast phase and two quorum collection phases. In the following, we assume that the primary is ready to initiate the consensus algorithm. Consequently, only the communication between the replicas is captured in our model.

In each phase of the protocol, the communication between and the availability of replicas is modeled as a combination of Bernoulli trials. More specifically, we model link and crash failures in alternating rounds for each of PBFT’s phases, as depicted at the bottom of Figure 1. To this end, we introduce random variables and to express the success probabilities for the respective failure type in round . We assume failures to be i.i.d. for all links and processes.

In a first step, only faulty nodes are modeled as crash failures in a series of interdependent Bernoulli trials, i.e., . In a second step, we extend the model by incorporating link failures. The communication is modeled along the lines of the three transmission phases , , and . Combined with the node failures, our model yields an interleaving series of dependent system states, i.e., .

In summary, the system state of all replicas at each protocol phase is captured by a series of probability density functions (PDFs), each constituting the calculation of the following. Please note, that each PDFs allows for precise prediction of the protocol behavior and can be transformed into more common performance metrics, e.g., latency, with statistics or other models that predict the duration of individual phases.

4.2 Notation

In Table 1, we summarize relevant probabilities, events, and random variables, which are used in our model. For ease of comprehension, the link and crash failure distributions are now reduced to single probabilities, i.e., and , respectively. The assumption to have identical link failure probabilities for all links is not an uncommon practice in this field of research [13, 30, 34]. The impact of arbitrary failure distributions on the model are explained in Section 4.5. The system state of the protocol is modeled by calculating PDFs that describe each replica’s state. To this end, the random variables and events listed in Table 1 are indexed according to the different phases in PBFT. In particular, , , and represent the number of replicas that received a pre-prepare message, received a quorum of prepare messages and received a quorum of commit messages, respectively. Additionally, the number of active replicas after each phase is described with , , and . Due to the nature of the PBFT algorithm, the distributions are dependent on each other, i.e., a replica that crashed or failed to collect the required messages will not be able to complete the happy path. The performance of the protocol can be assessed by calculating the previously mentioned PDFs that describe the states of each replica.

A key building block of our model are Bernoulli trials. Therefore, we use the notation to express the probability to get exactly successes in a Bernoulli experiment with trials and a success probability of . Furthermore, we define as the sum over all Bernoulli trials with at least and up to successes. Finally, the notations and are used to abbreviate (conditional) probabilities.

Symbol Description Range
Probabilities
Probability for a crash failure.
Probability for a link failure.
Events
Indicates a successful reception of a quorum of prepare messages at the current primary.
Random Variables
The number of replicas that have received a pre-prepare message.
The number of replicas, excluding the primary, that did not crash in the pre-prepare phase.
The number of replicas, excluding the primary, that received a pre-prepare message as well as collected a quorum of prepare messages.
The number of replicas that received a pre-prepare message as well as collected a quorum of prepare messages.
The number of replicas that did not crash in the prepare phase.
The number of replicas that have received a pre-prepare message as well as collected a quorum of both, prepare and commit messages.
The number of replicas that have did not crash in the commit phase and successfully executed the algorithm.
Table 1: Model notation for PBFT.

4.3 Modeling crash failures

We start by modeling one of the most prominent failures, crash failures. In particular, a crash failure implies no participation of the crashed process in the phase in which the crash occurred. Hence, the replica will neither receive nor send any messages. In the following, the three random variables , and are derived for a crash failure probability , assuming reliable communication links.

The happy path of PBFT is initiated with the primary broadcasting a pre-prepare message to all other replicas. For the sake of simplicity, it is assumed the primary cannot crash during the pre-prepare phase. Since the probability for a crash is uniform across all nodes, the PDF of the still active replicas is given by . The distribution of describes the number of replicas that will now broadcast prepare messages to all other replicas in the second phase of the protocol.

Following this procedure, the distribution of active nodes in further phases is calculated conditioned on the previous phase, meaning

(1)

Adding up the values of allows to predict the success probability for the happy path of PBFT. As replicas cannot skip a phase in PBFT, a crashed replica will not recover during the happy path rendering dynamic crashes similar to permanent ones.

4.4 Modeling crash and link failures

We now extend our model by introducing link failures, i.e., the links are no longer considered reliable and are subject to a link failure probability . The three random variables , and are introduced to model the behavior of the protocol during the three communication phases. Due to the special behavior of the primary in the second phase, is divided into the event and random variable to capture the communication of the primary and other replicas, respectively. Assuming the dynamic link failure model, all links are subject to the same failure probability and can be, as with crash failures before, described with Bernoulli trials. In the following, we therefore start alternating between and (cf. Figure 1) to model the success of the message delivery and node availability, respectively.

Calculating

After receiving a request from the client, the first phase is initiated with the primary broadcasting a pre-prepare message to all other replicas. Again, we assume that the primary cannot crash while receiving requests from a client. Since the success probability for each message transmission is equal to and independent of other transmissions, the number of successful transmissions can be calculated with a Bernoulli trial. The PDF of is given by

(2)

and describes the number of replicas that have received a pre-preapre message from the primary.

Calculating

Given the distribution of , some of replicas might crash in this phase. Based on , the PDF of the remaining active replicas is

(3)

Calculating

The communication in the second phase is composed of two cases: (1) whether the primary can collect prepare messages (i.e., ), and (2) the number of non-primary replicas that collect at least prepare messages (i.e., ).

The primary can only collect at least prepare messages if at least active replicas have received the previous pre-prepare message, i.e., . In this case, at least transmissions of prepare messages of the replicas have to successfully reach the primary. This can be expressed as the sum over all favorable Bernoulli trials, i.e., all trials with at least successes out of . The conditional probability for the primary to collect the prepare message is given by

(4)

For a non-primary node, i.e., a replica, to advance to , two requirements need to be met: (1) the replica has received a respective pre-prepare message, and (2) the replica has collected a quorum of matching prepare messages. For a quorum, only prepare messages are required, since a replica’s own prepare message and the primary’s pre-prepare message count towards the required messages. The previous requirements translate to

  1. there cannot be more replicas that receive prepare messages than replicas that have previously received a pre-prepare message, i.e., , and

  2. a replica can only receive prepare messages if at least replicas, including itself, have received a pre-prepare, i.e., .

The calculation of can thus be divided into the following cases, assuming that replicas have received a pre-prepare message. First, for , no replica will be able to gather the required quorum of prepare messages, thus, the probability for is always one. Second, if , the probability has to be zero. Finally, for all other cases, of the replicas that broadcast prepare messages, excluding the primary, the probability for replicas to receive of those messages can be modeled as another Bernoulli trial. The probability of success in that Bernoulli trial is identical to a replica receiving at least messages of the possible. Thus, the conditional PDF of for and is given by

(5)

with being the probability that a replica will receive at least prepare messages, given that replicas, including the replica itself, are broadcasting that message, which implies they have received the pre-prepare message as well. This can be calculated with another Bernoulli trial to get receptions from messages of the other replicas,

(6)

Combining (4) and (5) yields the conditional PDF of . The calculation is split into multiple cases as follows

(7)

The final PDF of

is given by applying the law of total probability to (

7), which yields

(8)

Calculating

As with and (3), the distribution of replicas that are still active, based on , is

(9)

Calculating

Now, let us turn to the states and . In the third phase, the primary behaves in the same way as every other replica, simplifying many calculations regarding the communication as we do not need to mind so many exceptions. As with , there are two requirements necessary for a replica to reach : (1) the replica must be in state and (2) it must have received at least commit messages, not counting its own. Thus, we can conclude that

  1. there cannot be more replicas that have received commit messages than replicas that have reached , i.e., , and

  2. a replica can only receive commit messages if at least replicas, including itself, have reached state , i.e., .

Deriving is similar to with (5), (6), and (7). The conditional probability of for and is accordingly

(10)

where is the probability that a replica will receive at least commit messages if replicas, including itself, are broadcasting that message, i.e.,

(11)

Applying the law of total probability yields

(12)

Calculating

Finally, the PDF of all active nodes after the last phase is similar to (3) and (9)

(13)

Combining (2), (3), (8), (9), (12) and (13) yields the system state at any time during execution of PBFT’s happy path. If more than replicas have completed the last phase, i.e. , the happy path of PBFT was successful. For the system to provide liveness in regards to the current request, only replicas are sufficient.

4.5 Generalization

For the sake of simplicity, we so far assumed constant failure probabilities for links and processes, i.e., and . We also assumed in our calculations, that those probabilities be constant for each phase of PBFT. This is not a requirement and could be expanded to reflect more sophisticated failure models that include time-based correlations, for example, as long as they remain i.i.d. for each phase.

Since the model is derived solely from communication patterns, it can be adapted to other BFT protocols, which exhibit similarities to PBFT. This is facilitated by the modular design of the model, i.e., the expression of communication phases, e.g., broadcast, quorum, as PDFs which can be combined to describe the overall system state. To demonstrate the adaptability, we show the application of the model to BFT-SMaRt, Zyzzyva and SBFT in Appendices A, B, and C, respectively. These adaptations showcase how the model can be applied to a variety of communication patterns, including client interaction and the possibility to branch into a fast or slow path.

We deliberately chose to highlight the model derivation in this section, leaving the formal definition of the modular components for future work.

5 Model Validation

To verify the correctness of the model, a discrete-event simulator was written in Rust. The codebase of the simulation is publicly available on Github111https://github.com/mani2416/bft_simulation. In a first instance, the simulator implements the happy path of PBFT for single requests without batching. The dynamic link failure model is realized by discarding each message reception event with a configurable probability (except the messages from the client). In addition, each node will miss all messages belonging to a certain communication phase with probability , simulating a crash failure. The state transitions of all processes are logged, i.e., receiving a pre-prepare, collecting a quorum of prepare and commit messages, as well as any crashes that occurred during each phase. By doing this, we can compare the simulated state with the predictions of our model. The simulation can easily scale to larger numbers of nodes (above 100) since only the state transitions in the happy path are of interest and no actual SMR is implemented, i.e., requests are not executed.

In order to validate our model with an independent source, we also deployed the Java-based public BFT SMR library BFT-SMaRt [4] as a reference implementation. Recent work considers the library as a stable implementation of BFT SMR [9, 25, 31]. It implements a consensus protocol [8] that bears a high resemblance to PBFT

: it utilizes epochs, an equivalent to the views in

PBFT, and operates in three phases with respective message types. For simplicity, we stick to PBFT’s terminology, when discussing BFT-SMaRt While mostly similar, the communication pattern of BFT-SMaRt differs from PBFT in two details, which required minor model adaptations. Firstly, nodes do not count the primary’s pre-prepare message as a prepare message for the second phase. Secondly, nodes are allowed to skip the second phase if a quorum of other nodes was able to complete that phase so that at least commit messages are broadcast in the third phase. We describe the model adaptations in Appendix A. To apply dynamic link and crash failures to BFT-SMaRt, artificial message omission probabilities were implemented into the library. Accordingly, all messages associated with one of the three phases are dropped with the probability and each node discards all messages of a whole phase with probability . The changes necessary to implement the aforementioned failures affected the class that is responsible for handling incoming messages only and consisted of less than 50 lines of additional code. Since BFT-SMaRt was not implemented to cope with dynamic link failures, the library relies on TCP to handle message omissions. Therefore, the introduced message losses causes the library to halt during the view-change protocol. Consequently, to gather enough data and reliably evaluate the protocol, the protocol is restarted after each request. By doing this, we are still able to log all states and state transitions as before. The library was executed on a single computer and up to 10 replicas and one client were instantiated to execute the requests.

Increasing the number of processes while keeping the maximum number of faulty processes constant leads to an increased robustness of both protocols against link and crash failures, because more messages are available to build a quorum while the required quorum size remains equal. We therefore evaluate both protocols for the most interesting scenario , i.e., the minimum number of processes required to tolerate faulty processes.

(a)
(b) ,
(c) ,
(d)
(e) ,
(f) ,
Figure 2: Model validation results for PBFT (a-c) and BFT-SMaRt (d-f).

To validate the model for a larger parameter space, we evaluated PBFT for different numbers of processes, link failure rates, and crash failure rates. Figures 1(c), 1(b) and 1(a) show the probability of a single (representative) process to successfully reach phase for different , and , respectively. The simulation results for 5,000 requests are plotted with confidence intervals and model predictions are depicted as crosses. Increasing the number of processes in the network can have, depending on the number of processes and failure rates, either a stabilizing or destabilizing effect on the performance. A more detailed analysis of this behavior is given in Section 6.3. Increasing the failure rate of either links or processes causes a constant decrease for .

The comparison between model predictions and experimental results for BFT-SMaRt are shown in Figures 1(f), 1(e) and 1(d). Since BFT-SMaRt implements actual SMR and the execution was unstable due to the previously mentioned halts during the view-change protocol, we executed 1,000 requests for each parameter combination only (without batching). Figure 1(d) depicts the measured and by the model predicted PDF of . The impact of link and crash failures on BFT-SMaRt is similar to PBFT. The small deviations visible between Figures 1(c) and 1(b) and Figures 1(f) and 1(e) stem from the algorithmic differences described above. The overall results confirm that our model predictions for PBFT and BFT-SMaRt align accurately with the simulations and experimental results.

6 Evaluation

6.1 Protocol stability

The all-to-all broadcast quorum collection phase in BFT protocols, i.e., for processes to collect out of possible messages (not counting its own), is inherently resilient against link failures. A node cannot collect a quorum if at least out of its incoming links are failing. Consequently, even in the worst case, at least nodes with at least link failures, i.e. overall link failures, are necessary for the quorum collection phase to potentially fail.

(a) with . (b) with .
Figure 3: for PBFT with .
Figure 4: Contour plot of as predicted by our model for PBFT with

; the gradient shows a vector field over

and .

We can validate this theoretical boundary for PBFT with our model. Since the number of processes that partake in each quorum phase is dependent on previous phases, the boundary for each phase is calculated as

(14)

with being the expected number of nodes that are still active after the previous phase. Depicted in Figure 4 are the predicted probabilities for for increasing link failures (Figure 2(a)) and crash failures (Figure 2(b)). The line labeled ”stable” marks the boundary, given by (14), past which the average number of link failures can potentially cause the quorum phase to fail. The linear decrease to the left of the boundary originates from the previous phases of the protocol. In Figure 2(a), the number of processes that receive the pre-prepare message in the first phase () limits the overall number of processes that can complete the protocol. Since decreases linearly with the link failure rate, so too will and . Past the boundary, the quorum phase may fail due to the increased number of link failures, resulting in an abrupt decline of and . The same behavior is visible for increasing crash failures in Figure 2(b), albeit with an even steeper linear phase. Because processes cannot recover within the happy path of PBFT, each successive phase with crash failures will decrease the number of available nodes for further phases, leading to the steeper decline before the boundary.

The evaluation methodology and the respective results can be used to parametrize the protocol to ensure that the protocol execution remains stable even for a given failure rate. Since most BFT

protocols treat delayed messages as link failures, the model can be utilized to fine-tune timeouts. That is, for a known (or assumed) delay distribution, a timeout parameter can be translated to a failure rate. A small timeout leads accordingly to a higher failure rate, but at the same time is able to quickly detect (genuinely) lost messages and make progress. For instance, let us assume that the message delay on all links can be described with a normal distribution of mean

and standard deviation

. Further, we assume for some arbitrary protocol. The timeout that keeps the protocol in the stable region is derived by finding an upper bound, where the integrated PDF of the delays is equal to . In our example, the timeout should be . To conclude, our model allows to evaluate various failure scenarios and adjust parameters accordingly.

6.2 Impact of number of processes, link, and node failures in PBFT

To better demonstrate the predictive capabilities of the model, a contour plot of for PBFT is provided in Figure 4, for varying link and crash failure rates. Additionally, the gradient is displayed, derived from the operating points for different and , as they are predicted by the model. The orientation of the arrows indicates the impact of variations in either failure rate on . The more pronounced the horizontal component of a vector, the more dominant is the impact of crash failures on and the same applies to the vertical component and link failures. The contour plot allows to quickly discern the impact of either failure rate on the protocol performance. For example, for low link failure rates, i.e., , changes in the link failure rate have mostly a negligible impact compared to changes in the crash failure rate. However, for very low rates of crash failures and a moderate number of link failures (), the link failure rate dominates the performance of PBFT.

Figure 4 also validates the observations made in Figure 2(a). While the link failure rate is below the boundary, i.e., the linear slope, the crash failure rate mostly dominates the impact on the performance. For higher values of link failures, their impact increases, correlating to the steeper decline that follows after the boundary.

The influence of process numbers for a given link failure rate on PBFT is plotted in Figure 4(a). While it is well known that most BFT protocols do not scale well with the number of processes due to the quadratic message complexity, it generally offers means to increase stability in the presence of dynamic link failures. In Appendix D, we show that the probability to collect a quorum for converges to or , depending on and the quorum size. We also observe a dip in , visible for fewer nodes and failure rates between . The dip is caused by two effects: (1) for larger

the variance of the binomial distribution increases, and (2) while the mean of the distribution narrows down on a value depending on the failure rates, the number of required messages to complete the quorum moves towards a discrete value, which is approximately

. As a consequence, the number of nodes can increase the success probability of the quorum collection phase for failure rates below a certain threshold.

6.3 Comparison between PBFT, BFT-SmaRt, Zyzzyva, and SBFT

(a)
(b)
(c)
(d)
Figure 5: Predictions made by our model: (a) depicts the impact of number of replicas for different link failure rates in PBFT, (b) compares the happy path success probabilities of PBFT, BFT-SMaRt, Zyzzyva as well as SBFT and a more detailed analysis of SBFT is given with (c) and (d).

To showcase the adaptability of our model, we applied it to Zyzzyva [19] and SBFT [16]. The detailed adaptations to the model are listed in Appendices C and B.

An exemplary comparison of all protocols for different crash failure rates is given in Figure 4(b). Depicted are the overall success rates, i.e., for the Zyzzyva and SBFT the combination of fast and slow paths. To better demonstrate the capabilities of the model, the individual success probabilities for the fast and slow path of SBFT for different link failure rates are plotted in Figures 4(d) and 4(c). Since SBFT allows for optional, additional replicas, denoted as , the model allows to quickly assess the protocol behavior for different failure rates and configurations of . The plots show that SBFT outperforms PBFT for increasing numbers of and higher failure rates, while PBFT is more stable if SBFT transitions from the fast path to the slow path.

7 Conclusion

The probabilistic predictions of the presented model were validated with implementations of PBFT and BFT-SMaRt for various numbers of processes and dynamic link and crash failure rates. It was demonstrated with BFT-SMaRt, Zyzzyva and SBFT, that the model can be adapted with little effort to other communication patterns of BFT protocols. The model gives a prediction of the distribution of process states during execution, allowing prediction of protocol behavior (e.g. how many view changes will occur) and therefore performance evaluation. Additionally, if the message delay statistics are known, the model can be deployed to tune the timeouts for BFT protocols, since most protocols cannot differentiate between a delayed or an omitted message, making them indifferent in their impact on the performance of the algorithm. The model allows to assess the impact of crash and link failures for various operating points of a protocol to identify key boundaries regarding protocol stability.

As was demonstrated with BFT-SMaRt, Zyzzyva and SBFT, the model can be applied to different BFT protocols by modifying the respective equations for the distributions or adding further random variables should the protocol consist of more phases (as is the case with SBFT). Further adaptations are facilitated by the fact that a body of BFT protocols are derived from the core structure of PBFT and consist of interdependent phases.

In further work, we are planning to apply the model to more BFT protocols and evaluate their performance regarding dynamic failures. Furthermore we are exploring possibilities to extend the model to predict more sophisticated key performance indicators, such as throughput and latency. Lastly, we will consider adaptations to our model in order to account for correlated link failures, e.g., as was proposed with a model by Nguyen [23].

References

  • [1] Michael Abd-El-Malek, Gregory R. Ganger, Garth R. Goodson, Michael K. Reiter, and Jay J. Wylie. Fault-scalable Byzantine Fault-tolerant Services. SIGOPS Oper. Syst. Rev., 39(5):59–74, October 2005. doi:10.1145/1095809.1095817.
  • [2] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro, David Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, Srinivasan Muralidharan, Chet Murthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessandro Sorniotti, Chrysoula Stathakopoulou, Marko Vukolić, Sharon Weed Cocco, and Jason Yellick. Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains. In Proceedings of the Thirteenth EuroSys Conference, EuroSys ’18, pages 30:1–30:15, New York, NY, USA, 2018. ACM. doi:10.1145/3190508.3190538.
  • [3] P. Aublin, S. B. Mokhtar, and V. Quéma. RBFT: Redundant Byzantine Fault Tolerance. In 2013 IEEE 33rd International Conference on Distributed Computing Systems, pages 297–306, July 2013. doi:10.1109/ICDCS.2013.53.
  • [4] A. Bessani, J. Sousa, and E. E. P. Alchieri. State Machine Replication for the Masses with BFT-SMART. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 355–362, June 2014. doi:10.1109/DSN.2014.43.
  • [5] Martin Biely, Ulrich Schmid, and Bettina Weiss. Synchronous consensus under hybrid process and link failures. Theoretical Computer Science, 412(40):5602–5630, 2011. Stabilization, Safety and Security. URL: http://www.sciencedirect.com/science/article/pii/S0304397510005359, doi:10.1016/j.tcs.2010.09.032.
  • [6] Gabriel Bracha and Sam Toueg. Asynchronous Consensus and Broadcast Protocols. J. ACM, 32(4):824–840, October 1985. doi:10.1145/4221.214134.
  • [7] Ethan Buchman, Jae Kwon, and Zarko Milosevic. The latest gossip on BFT consensus. CoRR, abs/1807.04938, 2018. URL: http://arxiv.org/abs/1807.04938, arXiv:1807.04938.
  • [8] Christian Cachin. Yet another visit to Paxos, April 2011.
  • [9] Carlos Carvalho, Daniel Porto, Luís Rodrigues, Manuel Bravo, and Alysson Bessani. Dynamic Adaptation of Byzantine Consensus Protocols. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, SAC ’18, pages 411–418, New York, NY, USA, 2018. ACM. doi:10.1145/3167132.3167179.
  • [10] Miguel Castro and Barbara Liskov. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans. Comput. Syst., 20(4):398–461, November 2002. doi:10.1145/571637.571640.
  • [11] Alírio Santos de Sá, Allan Edgard Silva Freitas, and Raimundo José de Araújo Macêdo. Adaptive Request Batching for Byzantine Replication. SIGOPS Oper. Syst. Rev., 47(1):35–42, January 2013. doi:10.1145/2433140.2433149.
  • [12] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the Presence of Partial Synchrony. J. ACM, 35(2):288–323, April 1988. doi:10.1145/42282.42283.
  • [13] N. Fathollahnejad, R. Barbosa, and J. Karlsson. A Probabilistic Analysis of a Leader Election Protocol for Virtual Traffic Lights. In 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), pages 311–320, January 2017. doi:10.1109/PRDC.2017.56.
  • [14] N. Fathollahnejad, E. Villani, R. Pathan, R. Barbosa, and J. Karlsson. On reliability analysis of leader election protocols for virtual traffic lights. In 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W), pages 1–12, June 2013. doi:10.1109/DSNW.2013.6615529.
  • [15] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. Impossibility of Distributed Consensus with One Faulty Process. J. ACM, 32(2):374–382, April 1985. doi:10.1145/3149.214121.
  • [16] G. Golan Gueta, I. Abraham, S. Grossman, D. Malkhi, B. Pinkas, M. Reiter, D. Seredinschi, O. Tamir, and A. Tomescu. Sbft: A scalable and decentralized trust infrastructure. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 568–580, 2019. doi:10.1109/DSN.2019.00063.
  • [17] R. Halalai, T. A. Henzinger, and V. Singh. Quantitative Evaluation of BFT Protocols. In 2011 Eighth International Conference on Quantitative Evaluation of SysTems, pages 255–264, September 2011. doi:10.1109/QEST.2011.40.
  • [18] Rüdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle, Seyed Vahid Mohammadi, Wolfgang Schröder-Preikschat, and Klaus Stengel. CheapBFT: Resource-efficient Byzantine Fault Tolerance. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, pages 295–308, New York, NY, USA, 2012. ACM. doi:10.1145/2168836.2168866.
  • [19] Ramakrishna Kotla, Lorenzo Alvisi, Mike Dahlin, Allen Clement, and Edmund Wong. Zyzzyva: Speculative Byzantine Fault Tolerance. SIGOPS Oper. Syst. Rev., 41(6):45–58, October 2007. doi:10.1145/1323293.1294267.
  • [20] Barbara Liskov. From Viewstamped Replication to Byzantine Fault Tolerance. In Replication: Theory and Practice, number 5959 in Lecture Notes in Computer Science, 2010. URL: http://pmg.csail.mit.edu/pubs/liskov10__from_views_replic_byzan_fault_toler-abstract.html.
  • [21] Boon Thau Loo, Tyson Condie, Joseph M. Hellerstein, Petros Maniatis, Timothy Roscoe, and Ion Stoica. Implementing Declarative Overlays. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP ’05, pages 75–90, New York, NY, USA, 2005. ACM. doi:10.1145/1095810.1095818.
  • [22] Andrew Miller, Yu Xia, Kyle Croman, Elaine Shi, and Dawn Song. The Honey Badger of BFT Protocols. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 31–42, New York, NY, USA, 2016. ACM. doi:10.1145/2976749.2978399.
  • [23] H. H. Nguyen, K. Palani, and D. M. Nicol. Extensions of Network Reliability Analysis. In 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 88–99, June 2019. doi:10.1109/DSN.2019.00023.
  • [24] The NS-2 Project. Ns-2, January 2016. URL: https://www.isi.edu/nsnam/ns/.
  • [25] Vincent Rahli, Ivana Vukotic, Marcus Völp, and Paulo Esteves-Verissimo. Velisarios: Byzantine Fault-Tolerant Protocols Powered by Coq. In Amal Ahmed, editor, Programming Languages and Systems, pages 619–650, Cham, 2018. Springer International Publishing. doi:10.1007/978-3-319-89884-1_22.
  • [26] Nicola Santoro and Peter Widmayer. Time is not a healer. In B. Monien and R. Cori, editors, STACS 89, pages 304–313, Berlin, Heidelberg, 1989. Springer Berlin Heidelberg. doi:10.1007/BFb0028994.
  • [27] U. Schmid. How to model link failures: a perception-based fault model. In 2001 International Conference on Dependable Systems and Networks, pages 57–66, July 2001. doi:10.1109/DSN.2001.941391.
  • [28] U. Schmid, B. Weiss, and I. Keidar. Impossibility Results and Lower Bounds for Consensus under Link Failures. SIAM Journal on Computing, 38(5):1912–1951, 2009. arXiv:https://doi.org/10.1137/S009753970443999X, doi:10.1137/S009753970443999X.
  • [29] U. Schmid, B. Weiss, and J. Rushby. Formally verified Byzantine agreement in presence of link faults. In Proceedings 22nd International Conference on Distributed Computing Systems, pages 608–616, July 2002. doi:10.1109/ICDCS.2002.1022311.
  • [30] Atul Singh, Tathagata Das, Petros Maniatis, Peter Druschel, and Timothy Roscoe. BFT Protocols Under Fire. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, NSDI’08, pages 189–204, Berkeley, CA, USA, 2008. USENIX Association. URL: http://dl.acm.org/citation.cfm?id=1387589.1387603.
  • [31] J. Sousa, A. Bessani, and M. Vukolic. A Byzantine Fault-Tolerant Ordering Service for the Hyperledger Fabric Blockchain Platform. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 51–58, June 2018. doi:10.1109/DSN.2018.00018.
  • [32] H. Sukhwani, J. M. Martínez, X. Chang, K. S. Trivedi, and A. Rindos. Performance Modeling of PBFT Consensus Process for Permissioned Blockchain Network (Hyperledger Fabric). In 2017 IEEE 36th Symposium on Reliable Distributed Systems (SRDS), pages 253–255, September 2017. doi:10.1109/SRDS.2017.36.
  • [33] F Thiel, Marko Esche, Federico Grasso Toro, Alexander Oppermann, Jan Wetzlich, and Daniel Peters. The European Metrology Cloud. In Proceedings of the 18th International Congress of Metrology, January 2017. doi:10.1051/metrology/201709001.
  • [34] W. Xu and R. Kapitza. RATCHETA: Memory-Bounded Hybrid Byzantine Consensus for Cooperative Embedded Systems. In 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS), pages 103–112, October 2018. doi:10.1109/SRDS.2018.00021.

Appendix A Model adaptations for BFT-SMaRt

In order to accurately capture the communication pattern of BFT-SMaRt, two adaptations to the model, compared to PBFT, are necessary. Firstly, the collection of prepare messages is not optimized to count the primary’s pre-prepare as a prepare message. Secondly, in PBFT, the processes can only successfully complete a consensus round, if they have reached all three stages of the protocol, i.e., received a pre-prepare message and collected of both, prepare and commit messages. Complementary to this, in BFT-SMaRt, a process can complete a consensus round, if it has received a pre-prepare message and collected commit messages. Since the collection of commit messages implies that at least processes have received prepare messages, the processes can commit safely.

Applying these differences to the model results in changes to two equations: the calculation of in (7) and in (10).
Second phase: The replicas send out prepare messages.

(15)
(16)

Third phase: The replicas send out commit messages.

(17)
(18)
(19)

As (17) is now conditioned on two previous phases, the calculation of the distribution has to be adapted with

(20)
(21)

Appendix B Model adaptations for Zyzzyva

Zyzzyva[19] uses a primary to order requests, afterwards, the client communicates directly with the replicas to collect either (fast path) or two quorums of (slow path) messages. Due to the direct integration of the client into the communication patterns, the link and crash failure rates are assumed to apply to the client as well. The impact of crash failures, i.e. the calculation , is identical for all phases.
First phase: the primary broadcasts a prepare to all replicas, identical to PBFT.
Second phase: the replicas respond to the client (quorum of for fast path or at least for slow path).

(22)
(23)
(24)
(25)

Third phase: the client broadcasts to all replicas.

(26)
(27)

Fourth phase: the replicas respond to the client (quorum of ).

(28)
(29)

The success probabilities for the fast and slow path are and , respectively.

Appendix C Model adaptations for SBFT

SBFT [16] combines the properties of optimistic execution, i.e. all replicas execute in a fast path, and redundancy, i.e. additional, optional replicas. In SBFT, the network consists of replicas, with being redundant replicas. The algorithm utilizes threshold signatures to reduce the message complexity for collecting the required quorum sizes ( for the fast path and for the slow path). SBFT uses a primary to order requests. The impact of crash failures, i.e. the calculation , is identical for all phases.
First phase: the primary broadcasts a prepare to all replicas, identical to PBFT.
Second phase: the replicas respond to the collectors.

(30)
(31)
(32)
(33)

Third phase: the collectors broadcast to all replicas (fast and slow path identical).

(34)
(35)

Fourth phase: the replicas respond to the collectors (fast path end).

(36)
(37)
(38)
(39)

Fifth phase: the collectors broadcast to all replicas.

(40)
(41)

the conditional probability is calculated in the same manner as (21).
Sixth phase: the replicas respond to the collectors (slow path end).

(42)
(43)

Appendix D Convergence of ’binomial quorums’

The probability of a replica to collect a quorum of messages, if each of the incoming messages has an independent probability to be omitted, can be formulated as

(44)

which, considering the complementary event, can be rewritten as

(45)

The binomial distribution can be approximated with the central limit theorem, or in this case, the

De Moivre-Laplace theorem. The theorem states, that a binomial distribution for will converge to a normal distribution as grows to infinity. The normal distribution has a mean of and standard deviation of .
The binomial distribution can thus be approximated as

(46)

This leads to a cumulative distribution function of

(47)

Applying this formula to ((45)) leads to

(48)

If grows to infinity, we get

(49)

Of interest here are the development of the numerator and denominator in the exponential term. The behavior of the numerator for growing depends on the value of and . Considering that , the size of the quorum, has to be a value between 1 and , we can substitute with , with being the relative size of the quorum compared to .

(50)

For the nominator applies

(51)

Meanwhile, the denominator will always converge towards infinity, albeit slower than the numerator

(52)

Since the error function has the properties

(53)
(54)
(55)

we can conclude that

(56)

with for BFT.