Twins: White-Glove Approach for BFT Testing

by   Shehar Bano, et al.

Byzantine Fault Tolerant (BFT) systems have seen extensive study for more than two decades, yet we lack a principled strategy for testing BFT implementations. This paper presents Twins, a new approach for testing BFT systems. The main idea of Twins is that we can emulate Byzantine behavior by running two (or generally up to k) instances of a node with the same identity. Each of the two instances (or Twins) runs unmodified, correct code. The Twins approach requires only a thin network wrapper that delivers messages to/from both Twins. To the rest of the system, the Twins appear indistinguishable from a single node behaving in a `questionable' manner. Twins generates `interesting' Byzantine behaviors, including equivocation, double voting, and losing internal state, while forgoing `uninteresting' behaviors that are trivially rejected by honest nodes, such as producing semantically invalid messages. Building on this idea, Twins can systematically generate Byzantine attack scenarios at scale, execute them in a controlled manner, and check for desired protocol properties. The paper demonstrates that Twins successfully reinstates several famous attacks on BFT protocols. In all cases, protocols break within fewer than a dozen protocol steps, hence it is realistic for the Twins approach to expose the problems. In two of these attacks, it took the community more than a decade to discover protocol flaws that Twins would have surfaced within minutes. Additionally, Twins testing was successfully incorporated into a production setting in which Twins executed 3M Twins-generated scenarios, and exposed (self-injected) subtle safety bugs within minutes of testing.


page 1

page 2

page 3

page 4


BFT Protocol Forensics

Byzantine fault-tolerant (BFT) protocols allow a group of replicas to co...

Flexible Byzantine Fault Tolerance

Existing Byzantine fault tolerant (BFT) protocols work in a homogeneous ...

Dissecting the Performance of Chained-BFT

Permissioned blockchains employ Byzantine fault-tolerant (BFT) state mac...

Revisiting EZBFT: A Decentralized Byzantine Fault Tolerant Protocol with Speculation

In this note, we revisit EZBFT[2] and present safety, liveness and execu...

The Bedrock of BFT: A Unified Platform for BFT Protocol Design and Implementation

Byzantine fault-tolerant protocols cover a broad spectrum of design dime...

Handel: Practical Multi-Signature Aggregation for Large Byzantine Committees

We present Handel, a Byzantine fault tolerant aggregation protocol that ...

Embedding a Deterministic BFT Protocol in a Block DAG

This work formalizes the structure and protocols underlying recent distr...

1 Introduction

Traditionally in the area of security, defenses are tested by evaluating their resilience against relevant attacks. This is, however, not the case for Byzantine Fault Tolerant (BFT) protocols introduced in the seminal work of Lamport et al. [lps82]: (i) Byzantine behavior is unconstrained, hence, one can only implement a subset of such behaviors; and (ii) the subset of Byzantine behaviors to be tested are chosen by system developers, who are naturally tainted by having designed the system with certain limited Byzantine behaviors in mind. Similar challenges arise when testing BFT protocols via formal specification and verification methods. Here, too, branching over Byzantine (arbitrary) behavior is unconstrained, leading to state explosion when modeling and model checking. Last, as a pragmatical consideration, developing test code that implements Byzantine attacks might be risky.

We propose Twins, a new approach for systematically testing BFT systems. Instead of coding incorrect behavior, Twins runs faulty nodes in two (or generally, ) parallel universes in tandem. Both instances have the same credentials/signing-keys and run autonomously. Thus, for example, both nodes can send messages in the same protocol round, but these messages will carry conflicting information; to the rest of the system, this twins behavior will appear indistinguishable from an equivocating behavior by a single node. In another example, a node may ‘vote’ for something and its twin will ‘forget’ and contradict the vote; again, to the rest of the system, this will appear indistinguishable from a single node behaving in a ‘questionable’ manner.

Twins is based on the insight that most interesting Byzantine attacks can rely on a correct implementation of the protocol, such that a Byzantine node appears to be honest. Therefore, a message not generated by the correct implementation will not be generated. This eliminates trivial issues such as semantically invalid messages as well as more complex scenarios, e.g., sending a message without justification. Thus, leveraging existing code, Twins can automatically cover most material Byzantine behaviors.

Indeed, in Section 3, we demonstrate that several famous attacks on BFT protocols are reinstated in the Twins approach. Furthermore, in all cases, protocols break within fewer than a dozen protocol steps, hence it is realistic for the Twin approach to expose the problem. In two of these attacks, it took the community more than a decade to discover protocol flaws that Twins would have surfaced within minutes. Additionally, we have built Twins testing into a production setting in which Twins exposed (self-injected) subtle safety bugs within minutes of testing.

We refer to Twins as a “white glove” approach: It is neither “block-box”, since it does modify the internal behavior of the tested system, nor is it “white-box”, because it does not open internal code modules. Twins minutely interacts with existing code to control message delivery and schedule various coarse-steps such as protocol rounds. Most importantly, Twins is practical to deploy in real systems as it uses existing correct node code. Twins can be implemented by thinly wrapping twin nodes with a network-scheduler acting as an adversary, easily keeping up with an evolving software project.

Twins testing for BFT replication. Our work on Twins arises in the context of BFT replication protocols. In this domain, several worrisome safety and liveness vulnerabilities were exposed recently [abraham2018revisiting, momose2020force] in both known protocols [fab, zyzzyva] and in new ones [synchs].

One reason that BFT replication is particularly suitable for Twin testing is the follows. A common paradigm underlying practical BFT replication protocols is a view-by-view design. Each view is driven by a designated leader proposing to the nodes and going through voting rounds by the nodes. If a leader is successful, a consensus decision is reached in the view. If not, nodes give up after a timeout and move to the next view. Transitioning to the new view/leader is tricky: A new leader must discover if the previous leader was successful, but it may be able to communicate only with a subset of the nodes. The transition logic turns out to be the source of problems in all the above cases, hence exposing the flaw requires only one or two leader rotations.

Twins implementation.

We built a unit-testing apparatus based on the Twins approach in the LibraBFT open-source project, the BFT replication core of the Libra payment system 

[librabft]. To gain trust in safety-critical blockchains, protocols like LibraBFT require much scrutiny.

Implementing Twin in LibraBFT consists of two principal parts. The first is a test executor that deploys a network configuration where some nodes have twins. The test executor hides twins behind a thin multiplexing wrapper; to the rest of the system, each pair of twins appear as a single entity. The test executor controls the scheduling of message deliveries according to a prescribed scenario. This is accomplished through a transport emulator in the LibraBFT repository called Network Playground.

The second part is a test generator. The test generator enumerates scenarios by varying the number of nodes and the message delivery schedule, then feeding the scenarios to the test executor. We describe in the paper several strategies for drastically reducing the number of scenarios through aggressive trimming of symmetrical scenarios. Among these strategies, one minimally ‘opens’ the LibraBFT implementation and lets the test executor determine when a node acts as a leader in the consensus protocol. This removes duplicate scenarios that differ only in their leaders.

Section 7 reports on our experience with the Twins tester in LibraBFT.

Coverage. What attacks does the Twins approach capture? Developing a rigorous theory that answers this question is an intriguing question left for future work. Indeed, test coverage has not seen much study even in the relatively more scrutinized area of crash faults [Niksic2019]. Here, we provide anecdotal evidence of coverage in three forms:

(i) We bring intuition and experience of several decades of work in the field. There are only a handful of ways in which a Byzantine attacker can materially deviate from the safety rules imposed by its protocol. For example, it can equivocate and send different proposals to different groups of recipients, or it can pretend it did not send/receive a message and propose or vote in a manner that conflicts with such a message. In Section 2, we provide insight on high-level attack behaviors that Twins emulates.

(ii) Our real-life implementation provides compelling evidence of coverage. In our evaluation in Section 7, we created a simple safety-violating setting by deploying (instead of ) validators with Twins. This led to a consistency violation. We further injected three subtle incorrect logic into LibraBFT, which only slightly deviated from the original spec (similar to mutation testing, a well-known technique to evaluate the effectiveness of existing tests). In all three cases, with only twins (faults), Twins successfully exposed safety violations.

(iii) We demonstrate in Section 3 that several known attacks on BFT consensus protocols are reinstated by the Twins approach. These attacks cover a broad spectrum of vulnerabilities, e.g., safety, liveness, timing, and responsiveness.

Limitations & Scope. We recognize that (setting aside trivial, easily filtered formatting violations) some Byzantine behaviors are not covered by Twins.

For one, in some protocol steps, a node may wait for messages to determine its next action. Under Twins, the node is forced to act according to the messages it received, as if the node provided a justification for each step in form of the history of messages it received. Deviating from this behavior was not required to reinstate any of the attacks discussed in Section 3, though in principle, various deviating behaviors would not be covered by Twins. Another coverage challenge emerges in synchronous protocols because a node behavior may be based on real time. In such protocols, Twins essentially forces a node to behave in a timely manner. We tackle this case in one of the attacks investigated in Section 3 and demonstrate that nonetheless, a slight adaptation of the original attack reinstates the attack in Twins. However, we do not know yet which timing attacks may not be covered.

Additional challenges stem from pragmatic needs. In particular, while the Twins approach allows creating up to instances of a node, throughout this paper we set . The Byzantine behavior emulated by Twins involves equivocation or duplication, which needs only two messages to diverge or match. This choice helps us cover interesting cases, while pruning potentially redundant testcases due to higher values of . It is unknown yet which safety violations can only be covered by .

Finally, this paper explores the use of Twins to expose safety and liveness violations—the principal properties of BFT systems. There are other properties of BFT systems such as transaction inclusion and fairness [fair-ordering, censorship] and performance bounds [prime], that Twins testing can be extended to verify.

Increasing coverage of Twins in the settings we explore as well as others, and providing a formal treatment of coverage remain interesting open challenges; we discuss some concrete future directions in Section 9.

2 Motivating the Twins Approach

We open this section with a quick primer on the Byzantine Fault Tolerant (BFT) replication problem, and describe the notation that will be used to describe attacks through the rest of this paper. We then provide high-level intuition on why Twins is a viable approach by showing the different kinds of Byzantine behaviors that can be captured by Twins. (Concrete attacks using Twins are described later in Section 3 and Section 7.1.)

BFT Replication. The goal of BFT replication is for a group of nodes to provide a fault-tolerant service through redundancy. Clients submit requests to the service. These requests are collectively sequenced by the nodes; this enables all nodes to execute the same chain of requests and hence agree on their (deterministic) output.

Except when specifically noted, we consider protocols that maintain safety against arbitrary delays in message transmissions. That is, we assume an asynchronous network setting. The main challenge is to drive agreement on a chain of requests (and their output) among all non-faulty nodes despite node failures. It is common to rely on leaders to populate the network with a unique proposal. During periods in which the leader is non-faulty and communication among the leader and non-faulty nodes is timely, this regime can drive consensus quickly. This approach is called partial synchrony, indicating that it maintains safety at all times and progress only during periods of synchrony.

In the Byzantine fault model, a node may crash or arbitrarily deviate from the protocol. In this setting, a BFT replication system implements a fault tolerant service via nodes, of which a threshold may be Byzantine. As Byzantine behavior is defined rather vaguely, there is no principled way to evaluate BFT systems. Twins is a new approach to systematically test the safety and liveness of BFT systems. The main idea of Twins is the following: running two (generally, up to ) autonomous instances of a node that both use correct code and share the same identity, allows us to emulate most interesting Byzantine attacks. Two nodes share the same identity when they share the same credentials and signing keys.

Notation. Nodes are represented by capital alphabets (e.g., ) and the twin of a node is represented by the same alphabet with the prime symbol (e.g., ). When referring to a set of nodes, we enclose them in parentheses e.g., . We underline a node that is serving as the leader, e.g., . The adversary can delay and filter messages between nodes. We denote partitions of nodes by enclosing them in braces, e.g., and , and reserve the capital letter to denote them. Additionally, to show messages allowed in a given direction, we use the symbols and . For example, means can send messages to and ; similarly, means can send messages to and receive messages from any node of the partition . The scenarios described below use a network configuration of 7 nodes, . Byzantine nodes have twins denoted with , as in , . To experiment with any of the deviating behaviors described below, one can increase the number of Byzantine faults to (say have twins ) and expect to see conflicting commits.

Equivocation. A quintessential Byzantine behavior is for a node to equivocate. That is, in the same step, a Byzantine node might send different messages to different recipients.

Twins covers equivocation by splitting honest nodes between two partitions, each one communicating with only one twin of each pair. For example, we can split the system into . The leader(s) and execute correct leader code but nevertheless may generate conflicting proposals due to different inputs or randomness seeds. If there is a protocol flaw then these conflicting proposals could respectively commit in and , hence safety breaks.

Amnesia. An important role that nodes have in agreement protocols is vote for a single proposal per view. However, a Byzantine node might vote for a proposal and then ‘forget’ that it has voted and vote again. Twins covers amnesia by letting one of the twins vote on one proposal. Since the other twin is oblivious to the vote happening, it may nevertheless—albeit executing correct code—vote on a different proposal.

More concretely, as in the scenario above, we can split the nodes into two partitions, . If there is a protocol flaw then this double-voting behavior may result in conflicting commits in and , hence safety breaks.

Losing internal states. Another notable deviation for Byzantine nodes is to lose their internal state, particularly a lock that guards a value they voted for. Twins covers this deviation by letting one of the twins get locked on a value in one view, but in some subsequent view, bring the other twin who is ignorant that a lock exists.

More concretely, we can split the nodes into two partitions . In one view, the adversary relays messages only among . In the next view, it switches to , causing —albeit executing correct code—to ignore their ‘previous’ actions. This can repeat any number of times. If there is a protocol flaw then conflicting proposals may commit in different views, hence safety breaks.

3 Known Attacks

In this section, we demonstrate several attacks on BFT replication protocols with known vulnerabilities, expressed as Twins scenarios. We provide insight into the attacks and defer the details of all but the linear leader-replacement attack to an appendix, due to space constraints.

3.1 Reinstated Attacks

We present several known attacks on BFT protocols, expressed as Twins scenarios. In all cases, exposing vulnerabilities requires only a small number of nodes, partitions, rounds and leader rotations. It is worth noting that later, our evaluation (Section 7) of LibTwins, Twins implemented for LibraBFT, shows that running an automated scenario generator (Section 4.2) with these configurations would cover the described attacks within minutes111We did not undertake to re-implement all these protocols and apply a Twins test generator to them; our implementation covers only LibraBFT [librabft].

Safety attack on Zyzzyva. Zyzzyva broke new ground in BFT replication with the introduction of an optimistic single phase “fast track” commit. Eleven years elapsed from its publication until a safety flaw in Zyzzyva was discovered [abraham2018revisiting], during which numerous research project and systems were built on it. Twins generates a scenario that exposes the flaw with 4 nodes and two leader rotations: the first leader equivocates via a twin, and the next two leaders drop messages to/from some nodes. The details of this attack using Twins is described in Appendix B.

Liveness attack on FaB. FaB [fab], a precursor to Zyzzyva, is a view-based protocol with an optimistic fast track. Not surprisingly, a similar problem arises in FaB due to a flawed leader replacement protocol [abraham2018revisiting], albeit manifesting as a liveness bug. Twins exposes this bug in a short scenario with and three leader rotations, leading to a complete absence of leader proposals. The detailed attack using Twins is described in Appendix C.

Timing attack on Sync HotStuff. Force-Locking Attack [momose2020force] is a timing attack on a preliminary version of a synchronous BFT protocol named Sync HotStuff [synchs] (which was subsequently updated to resist the attack). As before, Twins captures this attack with only a small system size, , and two leader rotations. However, in order to create timing attacks, Twins needs to be aware of timing information for protocol steps and messages deliveries. Extending Twins with timing data is left for future work. In the specific attack at hand, course-grain timing at fixed intervals—fewer than ten—suffice to reinstate the attack. The detailed attack using Twins is described in Appendix D.

Non-Responsiveness attack on linear leader-replacement. Practical Byzantine Fault Tolerance (PBFT) [pbft] is a seminal work that was designed to work efficiently in the asynchronous setting. Carrying the classical PBFT solution to the blockchain world, Tendermint [tendermint-2018] and Capser [casper] introduced a simplified linear strategy for leader-replacement. However, it has been observed [tendermint-2016, hotstuff-2018] that this strategy forgoes an important property of asynchronous protocols—Responsiveness—the ability of a leader to advance as soon as it receives messages from nodes222Tendermint is a precursor to HotStuff [hotstuff-2019] and LibraBFT [librabft] which operates in two-phase views, but has no Responsiveness. HotStuff/LibraBFT solve this by adding a third phase.. Indeed, bringing linear leader-replacement approach into PBFT, we demonstrate a liveness attack using a Twins scenario. Lack of progress is detected by observing that two consecutive views with honest leaders whose communication with a quorum is timely do not produce a decision. We present the details of this attack using Twins in the next section.

3.2 Non-Responsiveness Attack

We now describe in more detail the non-Responsiveness attack above on linear leader-replacement. The seminal PBFT solution operates two-phase views. A simplified, linear leader-replacement works as follows. A leader proposes to extend the highest quorum certificate (QC) it knows. A QC is formed on a proposed value if it gathers votes from nodes. Nodes vote on the leader proposal if it extends the highest QC they know. A commit decision on the leader proposal forms if nodes form a QC, and then nodes vote for the QC. Progress is hinged on leaders obtaining the highest QC from the system, otherwise liveness is broken.

Using the notation from Section 2, the liveness attack here uses 4 replicas , where has a twin . In the first view, and generate equivocating proposals. Only receive a QC for ’s proposal. The next leader is who proposes to re-propose the proposal by , which and do not vote for because they already have a QC for that height. Only and receive a QC for ’s proposal. This scenario repeats indefinitely, resulting in loss of liveness. More specifically, this attack works as follows:

View 1:

Initialize and with different inputs and .

  • [wide=-1em]

  • Create the partitions , .

  • Let and run as leaders for one round. proposes to and gathers votes from creating (). proposes to and gathers votes but not a QC.

  • Create the following partitions: , , . broadcasts (), which only reaches i.e., .

View 2:

Drop all proposals from and until View 2 starts.

  • [wide=-1em]

  • Remove all partitions, i.e., .

  • Let run as leader for one round. re-proposes (i.e., ’s proposal in the previous round) to . do not vote as they already have () for that height. gathers votes from the other nodes and forms ().

  • Create partitions , , .

  • broadcasts (), which only reaches .

View 3:

Drop all proposals from until View 3 starts.

  • [wide=-1em]

  • Create the partitions , .

  • Let run as leader for one round. proposes which extends the highest QC it knows, (). As before, manages to form (), but as a result of a partition, the QC will only reach . Next, there is a view-change, is the new leader, and there are no partitions. proposes which extends (), the highest QC it knows. However, do not vote because does not extend their highest QC i.e., (). This scenario can repeat indefinitely, resulting in the loss of liveness.

4 Systematic Unit-Test Generation

Figure 1: Twins high-level design.

Whereas the previous section demonstrated manually crafted Twins attack scenarios, this section presents a strategy for systematically generating Twins scenarios.

Systematically and efficiently generating unit-test scenarios that provide good coverage requires tailoring to the specific BFT protocol settings. The Twins tester generates and executes scenarios, or testcases, which describe the node and network configurations. Specifically, the Twins tester is comprised of two components as shown in Figure 1: (i) the test executor, and (ii) the test generator. The test executor runs a single testcase and generates output logs, while the test generator produces various testcases that are fed to the test executor to check for violations. The following design goals underlie the Twins tester:

  • Generic & Modular. Twins is modular with respect to the particular BFT protocol implementation being tested, impose as little complexity as possible on the development, and easily keep up with code changes.

  • Parametrizable. The test network setup (i.e., the number of nodes, leaders per round, and network configuration per round) and adversarial assumptions (i.e., how many Byzantine faults are tolerated) is configurable.

  • Feasible. Twins allows pruning duplicate scenarios in order to provide coverage of material attacks.

  • Customizable Coverage. The coverage of testcases, i.e., the subset of all possible testcases to choose for test execution, is configurable by randomly sampling tests to run among all possible enumeration.

  • Reproducible. Twins writes logs to persistent storage, containing sufficient information to detect and reproduce any safety violations.

Next, we describe the two main components (Figure 1) of Twins—the test executor and the test generator—in detail.

4.1 Test Executor

In every Twins test, a threshold of the nodes are ‘misconfigured’ to have a twin instance with identical transport endpoint credential and secret keys. The Twins test executor gets as input a testcase consisting of a node-set, a subset of which are marked compromised (representing Byzantine nodes); and a round-by-round message delivery schedule. The test executor sets up a network of nodes with a given number of compromised nodes and per round partitions and leaders. The compromised nodes correspond to the nodes for which the test executor creates twins (i.e., identical instances with the same credentials and signing keys), thereby emulating misbehavior.

As mentioned above, we address BFT replication protocols that proceed in rounds initiated by a designated leader, each round representing a state transition in the protocol’s state machine replicated on each node. For each round, the test executor creates a given network partition and assigns given leaders to the round. The test executor runs the BFT protocol among nodes for a pre-specified number of rounds, at the end of which, the test executor checks for violations. Specifically, protocol guarantees can be violated in two principal ways, safety and liveness. A safety violation is detected if two nodes commit to conflicting decisions. A liveness violation can be detected if the protocol fails to commit within a certain number of steps or a certain duration bound.

4.2 Test Generator

We build a test generator of round-by-round scenarios: for each round, the test generator enumerates possible leaders and message delivery schedules among nodes. The test generator produces various testcases to be fed into the test executor. Each testcase represents a unique instance of executor configuration parameters, i.e., the compromised nodes and per round network partitions and leaders. Testcases are generated systematically as follows (see notations in Section 2):

  • Step 1. The test generator first produces the set of all possible partitions of nodes (called partition scenarios). For example, for a network of 4 nodes (), possible partition scenarios () include , and . This problem relates to the Stirling Number of the Second Kind [stirling] which enumerates the ways in which a set of nodes can be divided up into non-empty partitions, where ranges from (i.e., each node is self-isolated) to (i.e., fully connected network without partitions).

  • Step 2. Next the test generator assigns each partition scenario to all possible leaders i.e., the set of nodes assuming any of those can be a potential leader. For example, for the example partition scenario above for a network of nodes , possible leader-partition combinations include , , , . Each leader-partition combination fully describes the Twins configuration required for each round.

  • Step 3. The test generator lists testcases by enumerating all possible ways in which the leader-partition pairs generated in the previous step can be arranged over rounds (i.e., permutation, with or without replacement).

The test generator iterates over the generated testcases linearly, and invokes the test executor for each testcase. For safety tests, usually a small number of rounds () suffices to expose logical bugs in the protocol. Test generators therefore need to enumerate a reasonable number of combinations.

Pruning testcases. Important to the success of the approach is for the test generator to avoid duplicate testcases (e.g., in symmetry or node label333Nodes can have designated roles in the protocol, referred to as node labels. Twins incorporates the label ‘leader’, which is the case for standard BFT protocols. Extensions of these protocols might have further hierarchy e.g., primary and secondary leaders. This is currently not supported, but the test generator can be easily extended to support different node labels. rotation) and generate only materially different scenarios. The implementation we describe in the Evaluation section of this paper (Section 7

) employs aggressively such pruning. Certain heuristics further substantially reduce the number of scenario configurations. For example, in most safety violations the set of honest parties is split into two, hence it suffices to play with two or three partitions per round. These optimizations make it feasible to cover a broad range of meaningful testcases. For liveness tests, many testcases will obviously fail to make progress because there does not exist a super-majority quorum that has reliable and timely communication among its members. Hence, for liveness testing the test generator must guarantee that eventually such a quorum exists.

Message delays and timeouts. We note that the test generator does not address message delays and timeouts, only the dropping of messages and their relative delivery order. Because the BFT protocol may employ timers, the dropping of messages implicitly implies that relevant endpoint incur a violation of presumed bounds on transmission delays. Future work may incorporate explicit message delays into the test generator to test for specific timing violations and also to test BFT protocols in the synchronous model (Section 9).

5 Overview of LibraBFT

We now shift our attention to utilizing Twins for testing BFT replication in LibraBFT [librabft]. We discuss our implementation and evaluation of Twins for LibraBFT in Sections 7 and 6. In this section, we provide an overview of LibraBFT (for details, see the technical report [librabft-techpaper]).

LibraBFT operates in a round-by-round manner, electing leaders in each round among the nodes to balance node participation. Rounds are slightly different from conventional “views” because it takes multiple rounds to reach a decision, but leaders are rotated in each round. The leader protocol is quite simple. A leader proposes an extension to the longest chain of requests that it knows already. Usually leaders collect batches of requests to propose, referred to as blocks, hence the LibraBFT protocol forms a chain of blocks (or a blockchain). Nodes vote for a proposed block, unless it conflicts with a longer chain that they believe may have reached consensus already. Nodes send their votes to the next leader to help the leader learn the longest safe chain. If there are three consecutive blocks in the chain, , , , which are proposed in consecutive rounds, , , , and each block has votes from nodes (gathered in a data structure called the quorum certificate, or QC), then the protocol has reached consensus on block .

If send votes to the next leader in a timely manner, a QC is formed by the leader and it sends the next proposal. Nodes maintain a timer to track progress. When the timer expires and a node still has not received a proposal, it broadcasts a timeout vote on a Nil block. When a node gathers enough timeout votes to form a timeout certificate, it advances its round. Every time a round fails, timeout periods are increased, allowing lagging nodes to catch up and enabling the protocol to eventually reach a decision.

As briefly alluded to in the Introduction, the trickiest part of BFT replication is to manage leader transition. LibraBFT maintains four parameters to ensure safety, and at the same time facilitate progress: (i) , the node’s current round; (ii) , the last round for which the node voted; (iii) , the round of the block certified by the QC attached with the block being processed; (iv) , the parent of the block certified by the QC; and (iv) , the highest known grandparent round. Note that as a QC serves as a pointer to the previous certified block, and do not need to be explicitly tracked; these can be derived from the QC carried by a block.

Upon Receiving a Proposal.

Upon receiving proposal for a block, a node processes the certificates it carries, and votes for the proposed block if it satisfies a simple voting rule: If a node voted for , it prefers the sub-tree of proposals rooted at block (regardless of round numbers). A node will not vote for a block that does not belong to its preferred sub-tree rooted at , unless ’s parent has votes from nodes at a higher round than . Concretely:

  • Safety Rule 1. The is greater than .

  • Safety Rule 2. The block’s is greater than or equal to .

If the node decides to vote for the proposed block, it updates its state as follows:

  • Update Rule 1. Update to round of the proposed block.

  • Update Rule 2. Update the node’s to the proposed block’s if the latter is higher.

  • Update Rule 3. Update the node’s to the , if the latter is higher.

Upon Receiving a Vote.

For every round, the nodes send their votes to the leader of the next round. When the leader receives a vote, it performs the following safety checks:

  • Safety Rule 3. If a vote from the same node was previously received for the same block and round, the leader rejects the vote and generates a ‘duplicate vote’ warning.

  • Safety Rule 4. If a vote from the same node was previously received for a different block but same round, the leader rejects the vote and generates an ‘equivocating vote’ warning.

If a vote passes both these checks, the leader considers it as valid and checks if it has enough votes to form a QC. When a QC has been formed, the leader generates a new round event, broadcasts a new block proposal and updates its state.

  • Update rule 4. When a leader gathers enough votes to form a QC, it broadcasts a new proposal and increments .

Spoiler alert: In our evaluation in Section 7.1, we are going to deliberately modify the above rules. We will see that this enables safety violations that the Twins tester will expose.

Figure 2: Consensus and preferred sub-trees in LibraBFT.
Figure 3: Design of Libra’s Network Playground for writing unit tests for LibraBFT protocol.

6 Implementation

We implemented a Twins tester for LibraBFT, which we call LibTwins. As described in Section 4, an implementation consists of two principal ingredients, a test generator and an test executor (Figure 1). We first describe the test executor implementation which leverages a network emulator in LibraBFT referred to as the network playground. We then proceed to describe the test generator implementation. For completeness, the Rust code and interfaces for the main functions of LibTwins, execute_test and test_generator, are provided in Appendix A. We are open sourcing the Rust implementation of LibTwins 444 .

6.1 Test Executor

The LibTwins test executor leverages the test framework of LibraBFT, network playground555 Network playground provides an apparatus for unit tests and for running single-host LibraBFT deployments, emulating a network and intercepting all messages exchanged between nodes. Tests can be written to manipulate the intercepted messages (e.g., by dropping certain messages) and observe node response. Figure 3 shows the design of the network playground. Nodes are represented by processes run on different threads (that run the full consensus protocol), and network links between them are expressed as Rust channels that provide asynchronous unidirectional communication between threads. In LibraBFT, nodes are identified by their Account Address (a public key that uniquely identifies a node). Channels are associated with their respective account addresses (nodes). When a node starts a new round, it checks whether it is leader for this round; if yes, then it generates on the fly a block to propose using a mock block generator. Each call to the mock block generator produces a different block. This has important implication for LibTwins, as we require a node and its twin to propose different blocks at the same round to emulate equivocation.

The test executor component (Section 4) of LibTwins is built on top of network playground. This required the following modifications and extensions to the original library:

  • Adding twins. We wrote a new method to add nodes to the network that supports twins. The method takes ‘compromised nodes’ as a parameter to refer to the nodes for which to create twins. For each target node, a duplicate instance is created with the same credentials and signing keys. Consequently, in the eyes of the other nodes the compromised node and its twin are indistinguishable.

  • Inferring rounds. LibTwins requires to apply a number of filtering policies at the round level. Network playground does not have a notion of rounds—it only supports static configurations that remain unchanged throughout protocol execution. There is no global notion of rounds in a distributed system with partial synchrony; instead, nodes have their own view of which round they are in, which they include in their messages. We enable network playground to extract round from intercepted messages and accordingly apply filtering criteria.

  • Round-based message filtering. Network playground allows writing rules to drop intercepted messages that meet certain criteria, i.e., messages to or from specified nodes and messages of specified types e.g., votes or proposals. LibTwins extends network playground to drop intercepted messages per round, which allows emulating different network partitions per round. The message dropping rules treat compromised nodes and their twins differently—the rules apply to account addresses (which uniquely identify nodes), not public keys (which are the same for a target node and its twins).

  • Deterministic multi-leader election. LibraBFT currently uses a non-deterministic leader election algorithm. LibTwins requires leader election at a finer granularity, i.e., assigning a specified leader to each round, potentially assigning multiple leaders to a round (because if a compromised node is elected as a round leader, its twins becomes leader too). We wrote a new leader election algorithm for LibraBFT that supports these requirements.

To emulate running the protocol for a given number of rounds, we approximate rounds by the number of messages emitted by nodes. Note that in a system with partial synchrony, we can only make guesses about rounds as there is no global notion of rounds. Using message-count per-round (without partitions) as an ‘over-guesstimate’, we let the nodes vote for extra rounds. Over-running a test has no consequence on the results of LibTwins (other than longer test execution time) because any safety violations would have already been detected in earlier rounds.

6.2 Test Generator

The test generator produces testcases in three main steps. First, it generates all the possible ways in which a set of nodes can be split into partitions (partition scenarios). Second, it generates all possible ways in which leaders can be combined with the partitions generated in the previous step. Finally, it generates all the possible ways in which the partition-leader combinations can be permuted over rounds of consensus protocol execution. The test generator can operate in online or offline modes. In the online mode, testcases are generated on the fly, and fed to the test executor. The test generator can be configured to write the generated testcases to a file. In the offline mode, the test generator reads previously generated testcases from a file and feeds them to the test executor. For debugging purposes, the test generator can also operate in a ‘dry run’ mode—testcases are generated with the given parameters, without running them, and statistics are printed at the end.

A naïve enumeration of all combinations of partitions, leaders, and rounds may explode quickly (see Table 1). In order to constrain the number of generated testcases in a particular run, we provide hooks to control the number of partitions, the number of leader-partition pairs, and the number of leader-partition configuration assignments to rounds. For all three cases, we specify whether the selection is deterministic—first —or randomized—an sample. In the third case—configuration assignment to rounds—the total combination space to select from is large. Therefore, the test generator allows randomizing the per-round configuration selection, rather than sampling over the entire space of assignments.

7 Evaluation

We validate the capability of LibTwins to model and detect attacks, present microbenchmarks for the main components of LibTwins, and describe our experiments at scale using Amazon Web Services (AWS) [aws]. We are open sourcing the Rust implementation of LibTwins, AWS orchestration scripts, and microbenchmarking scripts and data to enable reproducible results666 .

All our evaluations correspond to 4–7 nodes, 4–7 rounds and 2–3 partitions. Intuitively, these configurations seem sufficient to expose any safety violations. Indeed, the known attacks on BFT protocols described in Section 3 were exposed with only a small number of nodes, partitions and leader rotations. A recent work [Niksic2019] on the coverage of random testing to detect crash faults shows that coverage depends on the number of partitions and node labels (in our case, the leaders), but not on the number of nodes. For Jepsen [jepsen], all the bugs that provide meaningful coverage have a small number of rounds, and 2–3 partitions and roles [Niksic2019]. Using higher values for these parameters leads to a very large number of testcases, which cannot be feasibly tested without some sort of filtering (Section 6.2). It is an interesting open question whether increasing the value of these parameters has a higher chance of exposing safety violations.

7.1 Validation

We deliberately introduce bugs to LibraBFT, and validate that LibTwins is able to model and detect attacks that exploit the injected vulnerabilities. This approach is similar to mutation testing, a well-known technique to evaluate the quality of existing tests in terms of whether they can detect programs with deliberately injected modifications (called “mutants”). While approaches such as automated mutation testing can help us to exhaustively introduce mutants, this is computationally expensive and not practical for large, complex systems. We select bugs to inject into LibraBFT based on their ability to compromise the program’s functional correctness. We note that this choice is based on our intuition and experience, and does not provide any coverage guarantees. The validation approach we use is to: (i) inject the bug into LibraBFT; and (ii) generate testcases using the LibTwins test generator, checking for any safety violations. We instantiate the test generator with different configurations, starting with small parameter values that we increase until a safety violation is exposed.

We begin with the base case: can LibTwins generate a testcase that violates safety when the BFT threshold is exceeded (i.e., Byzantine nodes)? We discovered a safety violation with 4 nodes and 2 twins , 7 rounds, and static scenario configuration (i.e., each partition-leader combination is run for all rounds). LibTwins executed 62 testcases of which 8 led to safety violation within 86s.

Changing quorum size to . BFT protocols consider a state transition safe if it receives votes from an honest majority of nodes (i.e., quorum). We change LibraBFT’s quorum size from to . LibTwins detects a safety violation with 4 nodes and 1 twin , 7 rounds, and static scenario configuration (i.e., where each partition-leader combination is run for all the rounds). Within 20s, LibTwins executes 14 testcases of which 6 lead to safety violation. These testcases have the same pattern: Nodes are split into two partitions of size 2 and 3, with in one partition and in the other. As nodes in the two partitions can form quorum, oblivious to each other they continue to generate quorum certificates on blocks proposed by and , respectively. Ultimately, nodes in the two partitions commit two different blocks. In the log excerpt below from one of the test executions, honest nodes commit block extending parent block , while another honest node extends the same parent but committing a different block .

[A] Commit [id: 9be3486f, round: 1, parent_id: 4ce7be08]
[B] Commit [id: 9be3486f, round: 1, parent_id: 4ce7be08]
[C] Commit [id: 9be3486f, round: 1, parent_id: 4ce7be08]
[D] Commit [id: a9ff86ce, round: 1, parent_id: 4ce7be08]
[A’] Commit [id: a9ff86ce, round: 1, parent_id: 4ce7be08]

Accepting conflicting votes. When a node receives a block proposal, it votes for the block only if the is greater than the (Safety Rule 1, Section 5). We introduce a subtle bug to LibraBFT by changing this rule, so that a node votes for a block if the is greater than or equal to the . LibTwins detects the safety violation within a few seconds, with 4 nodes and 1 twin , and 7 rounds. This safety bug was detected in one-shot, with 0 partitions. Nodes vote on proposals from both A and —after a few rounds, they end up committing two different proposals for the same round.

(a) 4 Nodes, 2 Partitions, 4 Rounds.
(b) 4 Nodes, 2 Partitions, 7 Rounds.
Figure 4:

Time taken by the test generator to produce LibTwins testcases. Each data point is the average of 10 runs; error bars represent one standard deviation.

Forgetting to update preferred round. When a node receives a block proposal, it votes for the block if the is greater than , and the block’s is greater than or equal to (Safety rules 1 and 2, Section 5). We disable the first check, and bypass the second check by never updating so it permanently remains at 0 (Update rule 2, Section 5). The main ingredient of an attack that exploits the bug described above is to propose a block in an old round, and get the nodes to over-write committed blocks (safety violation). The challenge for LibTwins is that as a twin node runs correct code, it cannot be made to propose blocks in arbitrary rounds. One option is to partition the twin node in an old round, and bring it back up in a later round, so it starts proposing blocks from where it left. This is, however, not possible in a ‘full disclosure’ protocol like LibraBFT where each quorum certificate (or timeout certificate) contains the full history of previous messages that led to the certificate. That is, as soon as recovers from the partition, it receives a quorum certificate (or timeout certificate) from other nodes and advances its round.

To emulate going back in time and proposing a block for an older round, we let it run as leader for a few rounds, crash it, and then recover it again as leader. When comes back up again it starts from round 0, proposing a block that builds on the genesis block (the first committed block). Because of our modifications to the and checks, the nodes re-write history.

Nodes Twins Partitions Rounds Step 1 Step 2 Step 3
Without Replacement With Replacement Static
4 1 2 4 15 15 15
4 1 3 4 25 25 25
4 1 2 7 15 15 15
4 1 3 7 25 25 25
7 2 2 4 255 510 510
7 2 3 4 3,025 6,050 6,050
7 2 2 7 255 510 510
7 2 3 7 3,025 6,050 6,050
Table 1: The number of LibTwins testcases generated for various configurations. Steps 1,2 and 3 correspond to the testcase generation pipeline described in Section 4. Step 1: The number of ways in which nodes can be distributed among partitions. Step 2: The number of ways in which the partitions generated in Step 1 can be combined with leaders. Step 3: The number of ways in which the partition-leader pairs generated in Step 2 can be permuted (with and without replacement) over rounds. Static configuration refers to the case where each partition-leader pair is statically configured for all the rounds.

7.2 Microbenchmarks

We present microbenchmarks for the two main components of LibTwins: test generator (Section 6.2) and test executor (Section 6.1). The microbenchmarks are run on an Apple laptop (MacBook Pro) with a 2.9 GHz Intel Core i9 (6 physical and 12 logical cores), and 32 GB 2400 MHz DDR4 RAM.

Test generator microbenchmarks. The test generator incurs a one-time computational cost—once the testcases are generated, the test generator feeds them one by one to the test executor. Table 1 shows the number of testcases generated with different configurations. We observe that the number of nodes and the number of rounds significantly increase the output of Step 1, which increases proportionally in the number of twins (as we only configure nodes with twins to become leaders). We find that non-static configurations in Step 3 cause the number of testcases to explode. Therefore, of the various filters implemented for the test generator (Section 6.2), we find the filter at Step 2 to be most useful. We use this filter to make our at-scale Twins testing (Section 7.3) feasible. Note that this inevitably comes at the cost of completeness of coverage—a trade-off that we cannot completely eliminate. Figure 4 shows how long the test generator takes to produce testcases for the same number of nodes (4) and partitions (2), and 4 (Figure 3(a)) and 7 (Figure 3(b)) rounds. We observe that while it expectedly takes longer to generate testcases for 7 rounds vs. 4 rounds due to a larger number of permutations, for each case the time taken increases linearly in the number of testcases. We observe a similar linear trend in our microbenchmarks for other configurations with varying number of nodes and partitions (figures not included due to space constraints).

Rounds 4 Nodes 7 Nodes
Mean (ms) Std. (ms) Mean (ms) Std. (ms)
4 239 314 547 1,286
5 250 87 555 1,059
6 284 88 555 802
7 296 87 559 752
8 334 209 647 810
9 363 175 643 557
10 398 222 653 539
11 433 168 718 570
12 465 179 748 223
Table 2: The time test executor takes to execute a testcase for 4 and 7 nodes, over varying number of rounds and fixed partitions (=2). Each measurement is repeated for 100 randomly selected testcases.

Test executor microbenchmarks. Table 2 shows the time the test executor takes to execute a testcase. We repeat each measurement over 100 randomly selected testcases from a configuration with 2 partitions, and varying number of nodes (4 and 7) and rounds (4–12). We observe that for 4 nodes, the execution time ranges from 234–465ms for 4–12 rounds, with a maximum standard deviation of 314ms. For 7 nodes, the execution time ranges from 547–748ms for 4–12 rounds, with a maximum standard deviation of s.

The variation observed above in execution times is expected because of how LibraBFT handles timeouts (Section 5). For each testcase, LibTwins runs LibraBFT until it has observed a given number of messages (proposals and votes), which roughly corresponds to the number of rounds. In some testcases, LibTwins can quickly pull out the given number of messages and finish the testcase in a timely manner. In other testcases, we might end up with partitions where the nodes are not able to make progress and advance rounds, due to frequent round failures and increased timeout values. The implication of this for LibTwins is that some testcases may take longer to run, waiting for the network to emit enough messages to conclude the test. The execution of testcases has negligible () memory and CPU footprints.

7.3 Running Tests at Scale

We evaluate LibTwins at scale, by running it against the correct code of LibraBFT. Specifically, we executed 3M testcases which were randomly selected from the 200M testcases corresponding to the third row of Table 1 (that is, with 4 nodes, 2 partitions, 7 rounds, permuted with replacement). We first generated all the 200M testcases and randomly selected 3M samples. We ran the test generator in offline mode so the testcases are written to file rather than being passed to the test executor. We then split the generated testcases into 20 shards. The testcases can be easily sharded, as the tests are independent of each other—this implies that subject to the availability of computing power to generate and execute testcases, LibTwins can be scaled up arbitrarily via sharding. We execute the sharded testcases over 20 parallel instances of LibTwins on AWS. We use t3.2xlarge instances with 8 vCPUs, 2.5 GHz, Intel Skylake P-8175; 32 GB of RAM, and 300 GB of SSD storage. All machines run a fresh installation of Ubuntu 18.04. We did not observe any safety violations.

8 Related Work

There are two typical approaches to testing distributed systems. The first approach is to offer strong guarantees by building a fully verified system from the ground up [lamport:1994, prabhu2020plankton], or to show the absence or presence of bugs [wu2014diagnosing, chen2016good, chen2014detecting, lin2013defined] by exhaustively enumerating the space of system behaviors [model-checking, yabandeh2010predicting] under systematically injected faults [LDFI].

Fully verified systems do not scale to systems deployed in the real world. Model checking and exhaustive enumeration of distributed system faults (especially, Byzantine arbitrary behavior) leads to state explosion (despite partial order reduction techniques [partial-order]), resulting in low performance. This motivates the second approach of random testing, which underlies the discipline of Chaos Engineering, exemplified by systems like Chaos Monkey [chaos-monkey]. The main idea is to test the resiliency of a distributed system by randomly injecting faults (e.g., terminating processes). Jepsen [jepsen] is a blackbox testing framework that runs processes with a random, auto-generated workload and randomly injected network partitions. A related approach is to subject the system being evaluated to trials by fire such as Cosmos Game of Stakes [cosmos2018games], i.e., financially incentivizing the community to attack the test network, and analyzing successful attacks to harden the network. Random testing is effective and scalable—but it is not comprehensive or reproducible, and cannot be used to evaluate distributed systems in an ongoing fashion.

Prior work (with the exception of Jepsen, a random testing framework) has focused on crash faults. Twins is a new, principled approach to test BFT systems by emulating Byzantine behavior via twins—copies of ‘compromised’ nodes that can send duplicate or conflicting messages. Twins advances state-of-the-art in two ways. First, it provides a framework to systematically generate tests with configurable coverage, and only modeling correct executions (thus avoiding the state explosion problem associated with formal methods). We show with extensive evaluations (Section 7) that Twins is suitable for evaluating real-world systems, and can be scaled up arbitrarily for larger test coverage. The second contribution of Twins is to characterize what we call a ‘white glove’ testing approach—occupying the middle ground between existing black-box and white-box testing approaches. Twins can automatically generate unit tests that modify the interaction of components with the environment (unlike black-box testing), without opening the code (as in white-box testing).

9 Future Work & Conclusion

This paper presented Twins, a novel approach to systematically test BFT systems. The new approach provides coverage for many, but not all, Byzantine attacks. The paper demonstrated anecdotal evidence of coverage with respect to several known Byzantine attacks, and an implementation of Twins for LibraBFT that exposes misconfiguration and purposely injected logical bugs within minutes. Many directions are left open for future extensions.

Theory of Twins coverage. As mentioned in the Introduction, it is left open to rigorously characterize the attacks that Twins can cover. In particular, we conjecture that Twins covers all Byzantine behaviors in a class of protocols that have ‘full disclosure’: each message includes a reference to its entire causal past and any source of non-determinism (such as local coin flips), and nodes act deterministically according to their causal past. It would seem that this class of protocols is fully covered by Twins since the only possible attack by Byzantine nodes is to select different subsets of messages to report to different targets. Similarly, we conjecture that Twins can cover timing violations in a class of ‘lock-step’ synchronous protocols.

Additional Twins mechanisms. As mentioned in the “Limitations” part of the Introduction, for other classes of protocols, it is left open to increase the coverage of the Twins approach. One potential extension is to cover more Byzantine behaviors that do not adhere to ‘full disclosure’. Another extension is to cover timing violations.

Checking additional properties. A different dimension for extension is the type of guarantees which Twins tests. While this paper focused squarely on safety of the core consensus protocol, the Twins testing approach can be extended to test ancillary components of BFT systems. For example, LibraBFT switches to a new set of nodes by committing a special block that includes the new set of nodes and signals the reconfiguration event. It would be useful to investigate if Twins can cause a safety violation by creating an inconsistent node change (i.e., parts of the network believe in different nodes). Similarly, LibraBFT’s smart contract execution engine is re-instantiated via a similar mechanism, and can be subjected to a similar Twins-based attack.

Extending Twins implementation. With respect to the concrete LibraBFT Twins implementation presented in Section 6, several extensions are left for future work, including: (i) tackling more than a pair of twins; (ii) detecting liveness violations; and (iii) implementing process-level twins over TCP/IP.


This work is funded by Calibra. The authors would like to thank Ben Maurer, Avery Ching, Zekun Li, David Dill, Daniel Xiang, Kartik Nayak, and Ling Ren for feedback on late manuscript, and George Danezis for comments on early manuscript. We also thank the Calibra Research and Engineering teams for valuable feedback.


Appendix A LibTwins Implementation of Test Executor and Test Generator

This section provides the Rust code for the two main functions of Twins, execute_test and test_generator. The code listings in Figure 5 and Figure 6 present simplified Twins interfaces, i.e., we omit Rust-specific features such as explicit typing, details of error messages returned, de-referencing, and managing variable ownership.

fn execute_scenario(
    num_nodes, // number of nodes
    target_nodes, // the nodes for which to create twins
    round_partitions, // Vector of partitions for each round
    round_leaders // Vector of leaders for each round
) {
    let runtime = consensus_runtime();
    let playground = NetworkPlayground::new(runtime.handle());
    // Start nodes and twins
    let nodes = SMRNode::start_num_nodes_with_twins(
    // Create partitions
    create_partitions(&playground, round_partitions);
    // Start running the protocol and sending messages
    block_on(async move {
        let proposals = playground
            .wait_for_messages(2, NetworkPlayground::proposals_only::<TestPayload>)
        // Pull enough votes to get a commit on the first block
        let votes: Vec<VoteMsg> = playground
            .wait_for_messages(num_nodes * num_of_rounds, NetworkPlayground::votes_only::<TestPayload>))
    // Check that the branches are consistent at all heights
    let all_branches = vec![];
    for i in 0..nodes.len() {
        let node_commits = vec![];
        while let node_commit_id = nodes[i].commit_cb_receiver.try_next() {
    // Stop all nodes
    for each_node in nodes {
Figure 5: The execute_scenario function which executes test cases.
fn twins_no_quorum_test() {
    let runtime = consensus_runtime();
    let playground = NetworkPlayground::new(runtime.handle());
    let num_nodes = 4;
    // 4 honest nodes
    let n0 = 0, n1 = 1, n2 = 2, n3 = 3;
    // twin of n0
    let twin0 = node_to_twin.get(n0);
    // twin of n1
    let twin1 = node_to_twin.get(n1);
    // Index #s of nodes for which we will create twins
    let target_nodes = vec![0];
    // Specify round leaders
    let round_leaders = HashMap::new();
    for i in 1..10 {
        // Insert (n0, twin0, n3) as leaders for round i
        round_leaders.insert(i, vec![n0, twin0, n3]);
    // Specify round partitions
    let round_partitions = HashMap::new();
    for r in 0..10 {
        // Insert partitions for round r
                vec![n0, twin0, n1],
                vec![n2, n3],
Figure 6: Twins ‘No Quorum’ test.

The test executor, implemented by (Figure 5), executes test cases generated by the test generator. This function takes as input the number of nodes and twins, and the leaders and partitions for each round. It creates a network with the given inputs, and starts running the protocol until the nodes have emitted a given number of messages, which approximate the number of rounds for which the protocol has been run.

The function exposes a simple interface, abstracting complex underlying network and SMR configurations. To demonstrate the simplicity and flexibility of , we show how to implement a simple test (Figure 6) where no quorum can be formed, and therefore no block gets committed. We set up a network with 4 honest nodes (), and 1 twin (). We split the network into two partitions and . For each round , (in partition 1) and (in partition 2) are leaders. We then run the protocol for enough rounds (at least 3 in LibraBFT) to get a commit on a block. In partition 1, both and propose different blocks for the same rounds. will only vote for one of the two proposals because the second proposal is for a round that is not greater than its (Safety rule 1, Section 5). The second partition does not have enough nodes to form quorum. Consequently, no blocks are committed.

Appendix B Detailed Safety Attack on Zyzzyva

We present a summary of Zyzzyva, and use Twins to reinstate a known safety attack [abraham2018revisiting] on Zyzzyva [zyzzyva]. We use the notation described in Section 2.

b.1 Summary of Zyzzyva

Zyzzyva is an SMR protocol in the same settings as LibraBFT (partial synchrony and ). It operates in a view-by-view manner. Each view has a designated leader. Nodes vote on the leader proposal if they consider it valid (we describe the validity criteria below, which has a flaw that enables the safety attack). A commit decision on the leader proposal forms in either of two tracks, fast and two-phase. In the fast track, all nodes vote for the leader proposal to commit it. In the two-phase track, nodes form a commit-certificate (), then nodes vote for the to commit the proposal.

At the beginning of the view, nodes send the new leader a signed NEW-VIEW status message. The leader’s first proposal carries the status of nodes at the beginning of the view to prove the proposal validity. The (flawed) definition in Zyzzyva for a valid proposal upon view change is as follows. For each sequence slot:

  • Validity Rule 1 The leader picks among the states of nodes, the from the highest view, if one exists.

  • Validity Rule 2 Otherwise, the leader picks a proposal that has votes from the highest view, if one exists.

  • Validity Rule 3 Finally, if none of the above exist, the leader creates a Nil proposal.

The flaw is to prioritize Validity Rule 1 over Validity Rule 2, which causes the leader to prefer even if generated in a lower view than votes.

b.2 Safety Attack on Zyzzyva

The Zyzzyva flawed scenario safety demonstrated in [abraham2018revisiting] goes through a succession of three views. In the first view, a faulty leader generates conflicting proposals and splits honest nodes between that vote for and that vote for . The faulty leader gathers a on but does not send it to other nodes. In the second view, a good leader adopts and drives agreement in the fast track. In the third view, faulty nodes join the honest nodes that voted for in the first view. They send the leader a for , hence the protocol proceeds with , in conflict with the commit. The attack on Zyzzyva needs only nodes, of which is faulty, and it is fairly easy to re-instate using the Twins framework. There are four nodes, . To model the case that is Byzantine, it has a twin initialized with different input. We drive the execution creating partitions and electing leaders at each step, according to the attack described above. We describe below the detailed attack using Twins.

Step 1

Initialize and with different inputs and .

Step 2

During View 1:

  • Create the following partitions: , .

  • Let run as leader for one round. proposes to and gathers votes from creating a .

  • Create the following partitions: , , .

  • As a result, does not get to share on with and .

  • Similarly, for one round let propose to and gather votes from .

Step 3

Delay all messages until a new view starts. View 2:

  • Create the following partitions: , .

  • Run as leader, and let it collect (NEW-VIEW) messages from and . Using Validity Rule 2 (Section B.1), decides to propose for .

  • Remove all partitions, i.e., .

  • proposes , and collects votes from everyone. This leads to a commit of .

Step 4

Delay all further messages until new view starts. View 3:

  • Create the following partitions: , .

  • Run as leader, and collect (NEW-VIEW) messages from and . Note that sends the on (from view 1) to . Using Validity Rule 1 (Section B.1), decides to propose .

  • proposes to , and gathers votes from , and (who empty their local logs, undoing ). This leads both and to commit , a safety violation.

Appendix C Detailed Liveness Attack on FaB

We present a summary of FaB, and use Twins to reinstate a known liveness attack on FaB [abraham2018revisiting]. We use the notation described in Section 2.

c.1 Summary of FaB

FaB is a single-shot consensus protocol for the partial synchrony setting with .777FaB is actually designed for a parameterized model with , with safety guaranteed against Byzantine failures and fast track guaranteed against . For brevity and uniformity, we ignore here and set .

A precursor to Zyzzyva, FaB is a view-based protocol with an optimistic fast track. A leader drives a decision in the fast track if all nodes vote for it, and in the two-phase track if nodes vote for a commit-certificate (). When a new leader is elected, it picks a valid proposal that does not conflict with neither votes nor a in the previous view.

c.2 Liveness Attack on FaB

The (flawed) selection criterion above leads an execution in the following scenario to become stuck. A faulty leader equivocates and proposes to and honest nodes, respectively. In transitioning to the next view, there is a commit-certificate for and votes for (including an equivocation by one faulty), hence neither is safe, and the new leader is stuck. The attack on FaB needs only nodes, of which is faulty, and it can be easily re-instated using Twins. There are four nodes, with as a Byzantine node, for which we create a twin initialized with different input. We describe below the attack using Twins.

Step 1

Initialize and with different inputs and .

Step 2

During View 1:

  • Create the following partitions: ,

  • Run as leader for one round. proposes to which decides to vote on .

  • Insert the following rule in : . That is, the only messages allowed are those from and , to .

  • , and send their votes which only reach . Thus, only produces a for .

  • Meanwhile, the leader proposes to .

Step 3

Delay all further messages until new view starts. Create the partitions: , . Let the new leader collect NEW-VIEW status messages from . These status messages block from proposing both and due to the FaB proposal validity rule. The rule states that a proposal is valid if it does not conflict with neither votes nor a in the previous view, which is not the case for (has a ) and (has votes) as described below:

  • From , the NEW-VIEW message contains the value , and a for it.

  • From , the NEW-VIEW message contains the value , and no .

  • From , the NEW-VIEW message contains the value , and no .

Appendix D Detailed Liveness Attack on Sync HotStuff

We present a summary of Sync HotStuff, and use Twins to reinstate the force-locking attack [momose2020force] on a preliminary version of Sync HotStuff (which was fixed in an updated version). We use the notation described in Section 2.

d.1 Summary of Sync HotStuff

The preliminary version of Sync HotStuff [synchs] is an SMR solution in the synchronous model with parties.888The description here covers the first of three variants in that paper; two other variants are designed for slightly different synchrony assumptions, but the attacks on them are similarly covered by the Twins approach.

In synchronous protocols like Sync HotStuff, nodes execute the protocol in terms of , which is the known bound assumed on maximal network transmission delay. Sync HotStuff operates in a view-by-view regime —in each view there is a designated leader which proposes values to nodes. If a node accepts the proposed value, it broadcasts its vote. A node creates a commit certificate () for a proposed value if it receives votes on it. Nodes track the highest , and only vote on a proposed value if it: (i) extends the highest known to the node, and (ii) does not equivocate another value proposed for the same height.

A node creates and broadcasts a blame against a leader: (i) if the leader does not propose a value for , or (ii) the leader proposes an equivocating value. If a node observes blames against the leader in the current view, it broadcasts the blames, then waits (to allow the blames to reach all honest nodes), and moves to the new view. In the new view, it immediately sends the new leader the highest it knows of.

After a view change, the new leader waits for to receive node status messages (carrying the highest known to them). The leader then proposes a value that extends the highest from among the received status messages. Nodes proceed in the new view as previously described.

d.2 Implementing Synchrony Attacks in Twins

Due to the synchronous settings and the nature of the attack which heavily leverages synchrony assumptions, in this case a Twins scheduler must control message delivery timing. More precisely, rather than only specifying whether a message is delivered to a party or dropped, attacks on synchronous protocols require the Twins scheduler to deliver messages to specific parties at specified times. While this is captured by the Twins approach, our current implementation (Section 6) does not support this feature (this will be implemented in future Twins extensions).

Generally, we expect that the granularity of the scheduler timing can be fairly coarse. In particular, there is a known parameter , the bound presumed by the algorithm on message transmission delays and hard-coded into it. Indeed, the force-locking attack needs to deliver messages at increments, e.g., at times . Therefore, a Twins network emulator could operate in discrete lock-step at increments. With this capability in place, the force-locking attack can be re-instated in the Twins approach as described below.

d.3 Safety Attack on Sync HotStuff

We now rebuild the force-locking attack on the preliminary version of Sync HotStuff using Twins. The crux of the attack is for a faulty leader to generate a last-minute proposal that reaches only half of the honest nodes. The other half trigger a view change, and now the system becomes split. The first half continues to commit the first leader proposal with “help” from Byzantine nodes. The second half starts a new view and fork the chain. This attack can be reinstated with Twins using 5 nodes , of which are faulty and have twins .

Notation. We extend the notation described in Section 2 to capture message transmission in the synchronous setting as follows: denotes the transmission of a value from a set of nodes that generate the value at time , to a set of nodes that receive the value at time . If a value is broadcast, we use the symbol instead of a set: For example, means that broadcasts a value at time . Additionally, to highlight the ‘send’ or ‘receive’ action on a value, we use bold text on the left or right side of the arrow, respectively. For example, means that sends to (message arrival time is not known).

To reinstate this attack with Twins, we deploy 5 nodes , of which are faulty and have twins . Here, , , and quorum size is (since synchronous BFT protocols tolerate Byzantine nodes for ). We describe below the detailed attack using Twins.

At time


  • [leftmargin=*]

  • is the leader, and broadcasts a proposal with for the value which extends .

At time


  • [leftmargin=*]

  • receives , and broadcasts its vote.

At time


  • [leftmargin=*]

  • blames since it did not receive a proposal from within . Twins also did not receive a proposal from , hence they also blame with . broadcast their blames with , receive blames from each other, and start waiting for .

At time


  • [leftmargin=*]

  • receives ’s vote on , but it cannot create a on since it has less than votes.

  • broadcast their votes on , which arrive at with delay 0. As a result, gathers votes on and creates ().

At time


  • [leftmargin=*]

  • receives blame messages from , broadcasts all blame messages, and starts waiting for .

  • has waited for since it quit the old view with leader , so it starts the next view and sends its highest commit certificate () along with blames on to the next leader , with .

  • The new leader receives () from and blames on , and broadcasts a proposal for value extending . Note that does not know about ().

  • receives the proposal from , and broadcasts its vote with delay , then it sets its commit timer to and starts counting down.

At time


  • [leftmargin=*]

  • receives votes on from ; as it has now gathered votes on it creates (). However, this certificate is too late, as we will see in the following steps.

At time


  • [leftmargin=*]

  • has waited for since it quit the old view with leader , so it starts the next view and sends its highest certificate () to the new leader .

  • receives ’s vote on but does not vote since (which extends ()) does not extend its highest certificate ().

At time


  • [leftmargin=*]

  • commits since it finished waiting for and observed no equivocation or blame in the view . However, ’s highest certificate is () (see time ).

  • Now if the current leader goes offline, this will result in a view change to view and the new leader will extend the blockchain from the highest certificate from the previous view, (). But has committed conflicting with , hence safety is violated.