RCanopus: Making Canopus Resilient to Failures and Byzantine Faults

10/22/2018
by   S. Keshav, et al.
University of Waterloo
0

Distributed consensus is a key enabler for many distributed systems including distributed databases and blockchains. Canopus is a scalable distributed consensus protocol that ensures that live nodes in a system agree on an ordered sequence of operations (called transactions). Unlike most prior consensus protocols, Canopus does not rely on a single leader. Instead, it uses a virtual tree overlay for message dissemination to limit network traffic across oversubscribed links. It leverages hardware redundancies, both within a rack and inside the network fabric, to reduce both protocol complexity and communication over- head. These design decisions enable Canopus to support large deployments without significant performance degradation. The existing Canopus protocol is resilient in the face of node and communication failures, but its focus is primarily on performance, so does not respond well to other types of failures. For example, the failure of a single rack of servers causes all live nodes to stall. The protocol is also open to attack by Byzantine nodes, which can cause different live nodes to conclude the protocol with different transaction orders. In this paper, we describe RCanopus (`resilent Canopus') which extends Canopus to add liveness, that is, allowing live nodes to make progress, when possible, despite many types of failures. This requires RCanopus to accurately detect and recover from failure despite using unreliable failure detectors, and tolerance of Byzantine attacks. Second, RCanopus guarantees safety, that is, agreement amongst live nodes of transaction order, in the presence of Byzantine attacks and network partitioning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/28/2019

Clairvoyant State Machine Replication

We propose a new protocol for the generalized consensus problem in async...
02/05/2018

Gosig: Scalable Byzantine Consensus on Adversarial Wide Area Network for Blockchains

Existing Byzantine fault tolerance (BFT) protocols face significant chal...
10/18/2021

SPON: Enabling Resilient Inter-Ledgers Payments with an Intrusion-Tolerant Overlay

Payment systems are a critical component of everyday life in our society...
06/25/2019

A Permit-Based Optimistic Byzantine Ledger

PermitBFT solves the byzantine consensus problem for n nodes tolerating ...
02/18/2020

Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks

When a neural network is partitioned and distributed across physical nod...
11/12/2018

You Only Live Multiple Times: A Blackbox Solution for Reusing Crash-Stop Algorithms In Realistic Crash-Recovery Settings

Distributed agreement-based algorithms are often specified in a crash-st...
07/06/2020

GossipSub: Attack-Resilient Message Propagation in the Filecoin and ETH2.0 Networks

Permissionless blockchain environments necessitate the use of a fast and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Distributed consensus is a key enabler for many distributed systems including distributed databases and blockchains [3]. Canopus [12] is a scalable distributed consensus protocol that ensures that live nodes in a system agree on an ordered sequence of operations (called transactions). Unlike most prior consensus protocols, Canopus does not rely on a single leader. Instead, it uses a virtual tree overlay for message dissemination to limit network traffic across oversubscribed links. It leverages hardware redundancies, both within a rack and inside the network fabric, to reduce both protocol complexity and communication overhead. These design decisions enable Canopus to support large deployments without significant performance degradation.

The existing Canopus protocol is resilient in the face of node and communication failures, but its focus is primarily on performance, so does not respond well to other types of failures. For example, the failure of a single rack of servers causes all live nodes to stall. The protocol is also open to attack by Byzantine nodes, which can cause different live nodes to conclude the protocol with different transaction orders.

In this paper, we describe RCanopus (‘resilent Canopus’) which extends Canopus to add liveness, that is, allowing live nodes to make progress, when possible, despite many types of failures. This requires RCanopus to accurately detect and recover from failure despite using unreliable failure detectors, and tolerance of Byzantine attacks. Second, RCanopus guarantees safety, that is, agreement amongst live nodes of transaction order, in the presence of Byzantine attacks and network partitioning.

Importantly, the algorithms presented in this document are compatible with pipelining. That is, they do not assume that only one consensus cycle (a term defined in more detail in Section 2

) is in progress at a given moment in time. This allows RCanopus to maintain high performance in terms of transaction throughput and median transaction delay.

Our design for RCanopus relies on several key ideas:

  • RCanopus uses a hierarchy of Byzantine-fault-tolerant (BFT) committees with signed decisions from one committee further approved by higher-level committees

  • To deal with high-latency inter-datacenter links, consensus cycles are pipelined, which allows multiple consensus cycles to be executed in parallel, increasing throughput

  • Nodes are geographically grouped. The first level of nodes are grouped by server rack, allowing fast intra-group communication. The second level of nodes, which form a BFT committee called Byzantine Group (BG), are grouped by geographical region. Finally, BGs are hierachically clustered and execute the RConsensus protocol in parallel.

The remainder of the paper is structured as follows. In Section 2, we outline the Canopus protocol, deferring details to Reference [12]. Section 3 states our assumptions. We lay out some building blocks for RCanopus in Section 4 and Section 7 discusses how we categorize different types of faults. Subsequent sections address each category of fault and Section 13 summarizes the mitigation mechanisms.

2 Background

Canopus is a distributed coordination protocol implemented using a globally distributed set of servers called nodes, each of which periodically collects a set of transactions, called a transaction block from its clients. The nodes execute a consensus protocol to decide a global order on their transaction blocks.

The Canopus protocol divides execution into a sequence of consensus cycles. At the end of every consensus cycle, all nodes achieve agreement or finality, that is, a total order on their inputs.222Some optimizations discussed later in this document postpone the commitment of transactions until a later cycle. Each cycle is labeled with a monotonically increasing cycle ID. During a consensus cycle, the protocol determines the order of pending write requests within transaction blocks received by nodes from clients before the start of the cycle and performs the write requests in the same order at every node in the group. Read requests are responded to by the node receiving it. Canopus provides linearizable consistency while allowing any node to service read requests and without needing to disseminate read requests.

Canopus determines the ordering of the write requests by having each node, for each cycle, independently choose a large random number, then ordering transaction blocks based on these random numbers. Ties are expected to be rare and are broken deterministically using the unique IDs of the nodes. Requests received by the same node are ordered by their order of arrival, which maintains request order for a client that sends multiple outstanding requests in one session with a server during the same consensus cycle.

Figure 1: An example of a leaf-only tree (LOT).

During each consensus cycle, each Canopus node disseminates the write requests it receives during the previous cycle to every other node in a series of rounds. Instead of directly broadcasting requests to every node in the group, which can create significant strain on oversubscribed links in a datacenter network or wide-area links in a multi-datacenter deployment, message dissemination follows paths on a topology-aware virtual tree overlay. Specifically, Canopus uses a Leaf-Only Tree overlay [2], that allows nodes arranged in a logical tree to compute an arbitrary global aggregation function. Each round computes the state (ordered set of transactions) at subsequently higher tiers of the overlay tree.

A LOT has three distinguishing properties:

i. Physical and virtual nodes: Only leaf-nodes exist physically in the form of a dedicated process running in a physical machine. Internal nodes are virtual. When necessary to distinguish between the two, we denote a leaf node as a pnode and an internal node as a vnode; the term ‘node,’ however, always refers to a pnode.

ii. Node emulation: Each pnode emulates all of its ancestor vnodes. i.e., each pnode is aware of and maintains the state corresponding to all of its ancestor vnodes. Thus, the current state of a vnode can be obtained by querying any one of its descendants, making vnodes inherently fault tolerant, and making vnode state access parallelizable.

iii. Super-leaves: LOT is topology aware. Moreover, all the pnodes located within the same rack are grouped into a single super-leaf (SL) for two reasons. First, this reduces the number of messages exchanged between any two super-leaves; instead of all-to-all communication between all the pnodes in the respective super-leaves, only a subset of the SL nodes, called its representatives, communicate with another SL on behalf of their peers. Second, because all the pnodes in a SL replicate their common parents’ state, a majority of the SL members need to simultaneously fail to cause the SL to fail.

Figure 1 shows an example of a leaf-only tree consisting of 27 pnodes. There are three pnodes per SL. The pnode emulates all of its ancestor vnodes 1.1.1, 1.1, and 1. The root node 1 is emulated by all of the pnodes in the tree.

SLs are grouped together in a toplogy-aware fashion, with SLs in the same geographical area forming a higher-level vnode, and so on.

3 Assumptions

We now state the assumptions made in the design of RCanopus.

  • A1. Crash-stop failures: Nodes fail by crashing: there are no ‘transient’ failures. A failed node rejoins the system only through a node-join protocol (see Section 5.6). In contrast, network failures can be transient, so that a network partition may recover due to extraneous recovery actions.

  • A2: Reliable communication channel: We assume the existence of communication channels between all pairs of nodes, and between clients and nodes, that are not intermediated by other nodes, but only by tamper-proof routers and links. Hence, we assume that there are no message failures, such as losses, corruption, duplication, or reordering. In practice, this is easily achieved using a combination of TCP and dynamic network routing.

  • A3. Synchrony within an SL: We assume that nodes within a SL, which, by definition, are connected by the same switch and in the same rack, run in a synchronous environment in which the communication and processing delays are bounded by a known value . Formally,

    1. in the absence of failure, the maximum communication delays between the nodes are known and bounded

    2. the maximum processing time required to execute each step of a deterministic algorithm is known and bounded

    and these bounds hold despite message, node, and link failures. Thus, we assume the existence of an atomic broadcast primitive SLBroadcast(value) that satisfies the following properties [7]:

    1. Validity: If a correct process broadcasts a message , then it eventually delivers .

    2. Agreement: If a correct process delivers a message , then all correct processes eventually deliver .

    3. Integrity: For any message , every correct process delivers at most once, and only if was previously broadcast by its sender.

    4. Total order: If correct processes and both deliver messages and , then delivers before if and only if delivers before .

    Such a primitive can be built using an approach such as AllConcur [11]. Moreover, using an approach such as Zoolander [13], we assume if a call is made by a node in some SL to SLBroadcast(value) at time then it is received by all other live nodes in the same SL by time .

  • A4. Asynchronous inter-SL communication: To make progress despite the FLP impossibility result[6], which states that consensus cannot be guaranteed in asynchronous environments, our design gives up liveness in executions where the environment is unstable (e.g., messages are not delivered in a timely manner or processing of requests at servers is unusually slow). A system that does not guarantee liveness in all executions is exempt from the FLP result, and is able to maintain consistency even during network partitions.

    Some parts of the system, such as the top-level group membership, do require weak synchrony assumptions. However, these components operate on long time scales and can therefore be configured with conservatively long timeouts for failure detection without compromising performance. Practically speaking, this means that in the absence of a network partition, we assume that there is an upper bound on the sum of inter-SL message delivery and server response times such that it is possible for a node to send a message to another node and set a timeout such that, if the timeout expires, then can assume with high confidence that is dead (though there is some possibility that is actually only slow, not dead, or alive, but on the other side of a network partition; in this case, the node coordinates with other nodes to establish consensus on ’s status). The timeout value should be chosen long enough to allow for transient network partitions to recover; a partition lasting longer than is equivalent to a permanent partition333During a network partition, nodes on one side of the partition are unable to communicate with nodes on the other side. This does not make the network asynchronous: in an asynchronous network, any network communication may be subject to failure.. Moreover, we do not require clocks at different nodes to be synchronized or run at the same rate, since the protocols are self-synchronizing.

  • A5: PKI: We assume that every client and every Canopus node has its own ID and own private/public key pair, so that every signed communication is non-repudiable and non-falsifiable. The public key of a node can be viewed as its unique node identifier.

  • A6: Byzantine failure of a super-leaf: We assume that if even one node in an SL suffers from a Byzantine failure, the entire SL is Byzantine, a conservative assumption. This is based on the pragmatic observation that all the nodes in an SL are likely to be homogeneous and in the same rack. Hence, if one of them has been maliciously taken over, it is very likely the rest of the nodes on the rack have also been similarly compromised.

  • A7: Byzantine groups: We assume that we can partition , the set of SLs, into a number of Byzantine groups (BGs) where and all are mutually non-overlapping such that (a) sibling SLs in each Byzantine group are geographically ‘close’ and (b) if there can be Byzantine failures in BG , then . In other words, each BG is resilient to Byzantine node failures. Thus, the only failure mode for a BG is a network partition that prevents it from achieving quorum. From outside the BG, this appears as an atomic failure of the entire BG.

    Note that to prevent Byzantine faults, all communications that are sent on behalf of the BG must be signed with a quorum certificate which proves that members of the BG (including at least correct nodes) agree on the communication.

    In practice, we expect each member of a Byzantine group to be hosted by a different cloud service provider. Thus, a Byzantine failure models a security breach in a cloud service provider. We expect that in most practical cases there will be only one such breach ongoing, so that, .

  • A8: Number of BG leaders: Each BG elects a BG leader that participates in a BFT consensus protocol. We assume that if the number of BG leaders who are subject to Byzantine failure is at most then the number of BGs exceeds .

4 Building blocks

This section describes the major building blocks in our design.

4.1 Intra-SL building blocks

Based on the assumption of a synchronous environment within a SL, we use standard approaches for crash-fault tolerant group membership, leader election, and reliable broadcast protocols. As an illustrative example, we discuss how they can be implemented using the well-known ZooKeeper [8] system. However, this can be replaced by any equivalent system, such as Raft [10] or AllConcur [11].

In a typical Zookeeper installation, a subset of nodes within the SL (called ZooKeeper servers) provide a ZooKeeper service that is used by the other nodes. If one or more of these nodes fails, then the ZooKeeper service needs to replace the failed node with a live node, if such nodes are available. If the number of live nodes in an SL drops below the minimum required for a ZooKeeper quorum, then the ZooKeeper stalls and the SL is considered to have failed.

The proposed ZooKeeper znode hierarchy in our design is shown in Figure 2. Recall that we assume that nodes fail by crashing (Assumption A1) and explicitly rejoin, which means that there are no transient node failures. So, once a node is marked dead, it cannot be marked alive until it explicitly rejoins.

Figure 2: Local ZooKeeper hierarchy. There is one such hierarchy per SL. It is used to establish consensus on membership, leader election, and reliable broadcast of client transactions.

4.1.1 Group membership

A group membership algorithm allows a set of nodes participating in a distributed protocol to learn of each other. Membership in the group changes atomically when a node joins or leaves. This means that when a node leaves, the remaining nodes agree that the node has left, and when a node joins, all other nodes agree that the new node is a member.

We use ZooKeeper to maintain intra-SL membership. Participants join the group by creating ephemeral znodes (ZooKeeper nodes), and leave the group when such znodes are deleted, which happens automatically when a node dies and its ZooKeeper session is terminated. Group membership updates are disseminated asynchronously using ZooKeeper notifications, and may be received by different nodes at different times but in the same order.

Specifically, live nodes in an SL establish their membership and liveness by adding sequential ephemeral znodes under path /local/members/ (see Figure 2). At any point, a node calls getChildren on the path /local/members/ to learn the currently set of live members in the SL (though this may subsequently change).

4.1.2 Leader election

This involves agreeing on the identity of a distinguished node from a set of eligible nodes. In practice, multiple rounds of leader election may need to executed before a leader is chosen because a chosen leader may fail and need to be replaced. In our proposed solution, all nodes in the SL create sequential ephemeral znodes as children of /local/members/. A node becomes the leader, called a monitor, when its znode becomes the lowest-numbered child under the designated root. Uniqueness of znode sequence numbers ensures that there is at most one leader for each round. If a node fails and its znode is deleted before the corresponding leader election round becomes effective (i.e., before all lower-numbered leaders are presume dead) then no leader is elected for that round. Leader updates are disseminated using asynchronous notifications.

4.1.3 Atomic broadcast

We assume that nodes in an SL form a group with atomic broadcast, meaning that all nodes can broadcast messages to each other and receive messages in the same (arbitrarily decided) order. An atomic broadcast can be implemented using ZooKeeper or a purpose-built high-performance atomic broadcast primitive such as AllConcur [11].

When using ZooKeeper, messages are disseminated by node in cycle by adding state to a persistent sequential znode with the path /local/state/c/this/n. Note that by using a persistent znode, the message persists even if its transmitter subsequently fails. Messages are received by fetching this znode’s state. Asynchronous notifications are used to disseminate send events, and recipients are responsible for remembering which messages they have processed already. Note that recipients must store this state locally, and cannot store this state in ZooKeeper since a node can fail after processing a message and before recording the processing of the message in ZooKeeper.

At the end of a cycle, message data in ZooKeeper can be garbage collected, to prevent unbounded growth in the size of the ZooKeeper state. This garbage collection keeps the local ZooKeeper state relatively small.

4.2 Byzantine groups

Geographically close SLs are grouped together to form a Byzantine Group (BG). We expect these SLs to be hosted at a variety of hosting providers, so that compromise of a single SL in a BG would not result in the compromise of its peer SLs in the same BG.

We make a BG resilient to Byzantine faults in its member SLs by relying on existing Byzantine consensus protocols such as PBFT [5], BFT-SMART [4], or HotStuff [1]. Such a scheme is immune to up to simultaneous node failures as long as the number of nodes in the BG is at least . Importantly, each of these schemes delivers a quorum certificate that guarantees that the consensus value is valid. The certificate contains signatures from a quorum of nodes that can be verified by the recipient and guarantees that the information contained in any message is valid. Also, to prevent replay attacks, the certificate contains the current cycle number.

BFT consensus is also used to maintain SL membership status in the BG: it is the set of SLs reporting transactions in the latest completed BG consensus. If there is a network partition, SLs in the minority partition cannot submit their transactions, so are automatically excluded from membership.

4.3 Global coordination

In addition to consensus on membership within an SL and in a BG, we also need to achieve BFT consensus on a system-wide (‘global’) scale for several reasons, including: (a) obtaining of the set of emulators of a vnode at the BG level or higher (b) dealing with apparent failures of BGs/vnodes arising from a network partition, (c) learning the quorum size in each BG (the number of signatures in each Byzantine Group’s quorum certificate that is necessary to validate it) (d) creating global quorum certificates to certify the set of BGs participating in each cycle (these are discussed in more detail in Section 5.5.3).

To do so, we deploy a global BFT consensus service provided by a set of BG leaders, each elected from amongst the monitors of every live BG. Note that this distributed service also needs to be BFT since BG leaders may suffer from Byzantine failures. Thus, this service is likely to have high latency in performing updates. Nevertheless, it can serve read requests relatively quickly since each BG leader participating in this service can cache membership state along with its quorum certificate.

4.3.1 Convergence Module-based global coordination

An alternative approach is for global coordination to be done “in-band” with respect to the consensus protocol using a specialized transaction type that must be endorsed by an authorized system administrator. This approach associates group membership data with each cycle in a precise way: a group membership change proposed in cycle becomes effective in cycle where is the pre-defined and constant depth of the processing pipeline.

Convergence of each consensus cycle despite network partitions and BG failures is achieved using a service called the Convergence Module (CM) (details can be found in Appendix B) The CM internally keeps track of BGs that are suspected of being faulty, but operates orthogonally to the mechanism responsible for recording group membership. In particular, a BG can participate in the consensus protocol and yet its input may be excluded from a particular consensus cycle by the CM because the BG was deemed faulty. Repeated exclusions of a BG from consecutive consensus cycles may nevertheless prompt administrators to apply manual group membership changes.

Roughly speaking, the protocol functions as follows: a BG that stalls during a particular consensus cycle because it is unable to retrieve the inputs of one or more other BGs reports the situation to the CM, who then determines the output of the cycle under consideration as the union of the inputs of a carefully selected subset of BGs. For example, the subset can be determined in a manner that ensures sufficient replication, meaning that a BG’s input is included in the output of a given cycle only if it has been replicated at or more BGs for some administrator-defined threshold . To reduce overhead, optimizations are defined to bypass the CM entirely in absence of failures, and also to minimize interaction with the CM during network partitions that last many consensus cycles. These optimizations deal with complex scenarios involving Byzantine failures and concurrency.

5 RCanopus description

We now describe the RCanopus system. We begin with actions taken by a client, which sends transactions to RCanopus nodes. We then discuss operation of a node during each RCanopus cycle. Some nodes play the role of a representative or monitor, and some monitors are also BG leaders. We discuss the actions taken by each such role. Note that although we discuss actions in each cycle independently, RCanopus is pipelined, that is, multiple cycles are in progress simultaneously, as shown in Figure 3.

Figure 3: RCanopus pipeline. is chosen to be longer than the intra-SL consensus time. The long latency for inter-BG exchanges results in multiple simultaneous incomplete cycles outstanding at any point in time.

5.1 Client actions

Clients send their transactions, each accompanied by a client-generated transaction id and a client-generated nonce, both signed by the the client’s private key, to RCanopus nodes. To prevent Byzantine nodes from simply ignoring their transactions, clients send their transactions to nodes in different SLs in the same BG. When they are ready to commit results from a cycle, nodes respond to clients with proof that their transactions were incorporated into the global transaction order. The proof consists of two items:

  1. A Merkle tree branch connecting the client’s transaction to the root for a transaction block (TB).

  2. A quorum certificate from a BG certifying that TB was committed at cycle c. Specifically, the certificate should include TB’s Merkle tree root.

The proof does not need to include a global quorum certificate showing the certifying BG was a member of the system during cycle . This is because a BG would only generate and return a quorum certificate if it knows that it is live (this is discussed in more detail in Section 5.5.2) and has or fewer malicious nodes. Our system does not provide safety if there are more than malicious nodes in a single BG.

A Byzantine client could try to confuse the system by sending inconsistent transactions to different nodes. This is, however, easy to detect because a valid client must send the same transaction to all nodes with the same client-generated transaction id. Any deviation from this immediately identifies the client as being malicious. Therefore, we do not discuss this failure case any further.

5.2 Node actions

We now discuss the actions taken by all RCanopus nodes, independent of their role.

5.2.1 Collecting client transactions

In the first round of each consensus cycle, all nodes within an SL share their state (i.e., set of client transactions) with each other. Let batch time, b be a constant such that , where is the maximum round-trip communication delay in the synchronous SL environment. Then, the first round lasts seconds (see Figure 3).

During the first seconds of the round, a node collects transactions from its clients. It atomically broadcasts these transactions to its peers at the end of the interval. By Assumption A3, this state stabilizes within more seconds. Hence, when the round ends, all live nodes agree on the ordered list of transactions received by nodes in the SL during the first half of the round444Note that some of these transactions may originate from servers that failed after writing their state to ZooKeeper as permanent znodes. This is silently ignored since it does not affect safety.. Because of pipelining, consensus cycles start every seconds as shown in Figure 3.

5.2.2 Transaction numbering

A node assigns a transaction number to each transaction block of client transactions that it receives. This is used to identify the position of the block in the global order. It is possible for a Byzantine node to deliberately choose this number in a way that increases the chance for the TB ending up near the top or the bottom of the global order. To prevent this attack, the transaction number of a TB chosen as a Merkle hash of the transactions. Moreover, when TBs are further aggregated, the newly-created parent node orders its child subtrees by their Merkle roots, forming a unique and non-manipulable total order.

5.2.3 Delayed commit

Nodes commit the state computed at the end of cycle at the end of cycle , i.e, delayed by seconds. This is because at this time it can be sure that every other node in the system also will either commit or not commit a given other BG’s state, guaranteeing safety. This is discussed in more detail in Section 12

5.3 Representative and emulator actions

5.3.1 Representative election

A certain number of nodes in an SL, for example, those corresponding to znodes with the first few sequence numbers in /local/members/, are elected as SL representatives to fetch remote state. In the third and subsequent RCanopus rounds of each cycle, representatives from each SL fetch vnode state from the vnode’s emulators (i.e. any subtended pnode).

5.3.2 State request and response

At the start of each cycle, representatives learn of the list of vnodes (at the BG level and higher), the IP addresses of their emulators, and BG quorum sizes from the global consensus service555Alternatively, these calls can be made by the monitor, then disseminated to the representatives.. Subsequently, representatives send state requests to these emulators to learn about the state of one of their ancestor vnodes. To avoid a denial-of-service attack by Byzantine emulators, each representative contacts at least emulators (from different SLs) in each Byzantine group. Moreover, a representative starts a retry timer (on the order of ) before sending the state request message. Finally, a representative also registers a watch on changes to data associated with the local znode /local/state/c/v, which is where the fetched state would be stored by other SL representatives.

On receiving a request, the emulator sends either a state response with the requested state, along with a quorum certificate, or a null response, indicating that it does not yet have the desired state. If the quorum certificate has more signatures than each underlying BG’s quorum size, then the representative accepts the response and stores it in the local znode /local/state/c/v (unless this state has already been updated by some other representative).

There are two possible failure cases:

  • The emulator may send a null response. If so, the representative resends its request to the same emulator after a waiting for a retry interval.

  • First, the emulator may have failed, causing expiry of the retry timer. If so, the representative retries with a different emulator of the vnode. Of course, this emulator may also fail. Hence the representative tries to contact each of the vnode’s emulators in turn If all emulators for the vnode are inaccessible, the representative adds its own node ID as a persistent child znode to /local/state/stalled/v.

In both cases, it is possible that some other representative may have already obtained the vnode’s state. If so, and the desired state can be found in /local/state/c/v, the representative copies this state and takes no further action.

5.3.3 Recovery from representative failure

All nodes in an SL keep track of the status of representatives. On the failure of a representative, leader election is used to choose a new representative, which takes on the responsibilities of the failed representative. This is particularly simple with ZooKeeper.

5.3.4 Emulator behaviour on loss of consensus within the SL

When ZooKeeper is used for intra-SL consensus, the service becomes unavailable when it loses quorum. If an emulator detects that the ZooKeeper service has lost quorum, for example, by receiving a ‘loss of quorum’ message from one of the live ZooKeeper servers, it stalls, that is, stops responding to state requests. This is to prevent a ‘split brain’ in case of an SL partition (see Section 9.2). When service is restored, the nodes need to rejoin using a rejoin process, discussed in Section 5.6.

5.4 Monitor actions

The monitor is responsible for several distinct sets of actions, as discussed in this section.

5.4.1 Byzantine consensus on transaction order

In the second round of each cycle, monitors in the SLs of a BG come to a consensus about transaction order. Since entire SLs can be Byzantine, it is necessary to use a BFT consensus protocol to achieve consensus. At the end of the consensus, each monitor obtains a total ordering of client transactions, and a quorum certificate that certifies this ordering with signatures. Monitors also learn about which other monitors have failed, either due to crash failure of their SL, Byzantine failure of their SL, or BG partition.

Recall that clients send their transactions to multiple nodes in multiple SLs, which causes duplicate transactions to enter the system. Duplicate transactions are detected using the per-transaction client-generated nonce, and must be removed during the computation of the BG quorum certificate. It is possible for client-initiated transactions to be received by different nodes at different times, and may hence be part of different RCanopus cycles. To deal with this situation, during the BG consensus protocol, monitors buffer a hash of recently-submitted transactions, removing duplicate transactions as necessary.

5.4.2 Identifying and removing failed nodes from the list of BG emulators

Monitors are responsible for updating the global consensus service to remove failed nodes from their SL (detected using the intra-SL consensus protocol) from the list of emulators for the enclosing BG. This requires a quorum certificate from the BG to prevent Byzantine monitors from causing trouble. Hence, the monitor first contacts its peer monitors in its BG and proposes to them that a specific node in its SL has failed. Each monitor sets a timer and independently tries to contact the potentially failed node. If a quorum of monitors agrees that the node has indeed failed, then they generate a quorum certificate, which is then submitted to the BG leader to update the global consensus service, removing this node from the list of the BG’s emulators.

5.4.3 Initiating a response to system partition

Monitors are responsible for initiating the response to a network partition. Recall that at the start of each consensus cycle, a representative from each SL obtains a list of emulators for each vnode at the BG level and above from their BG leader, who is their local representative for the global consensus service. If there is a system partition, BG leaders in the minority partition(s) will (eventually) fail to achieve quorum and SL representatives will receive a ‘loss of quorum’ message666Even if they do obtain a stale list of emulators, they will stall when trying to contact these emulators, preserving safety.. This results in the stalling of all nodes in all BGs in the minority partition, as desired. These nodes should rejoin the system only by using the node re-join process, where they are asked to rejoin by their BG leader (accompanied by a global quorum certificate).

We now consider the situation, when, during a network partition, representatives may obtain a stale list of emulators corresponding to inaccessible BGs, stalling representatives who are waiting for a response from them. These stalled representatives add their ID to a list of stalled representatives in /local/stalled/I. When the monitor in the same SL as the representatives detects that all the representatives in their SL are stalled waiting for state from a remote BG I, it initiates a Byzantine consensus on the status of this BG. On invocation, monitors in the BG try to achieve consensus on the state of the stalled BG. Specifically, each monitor checks its local /local/stalled/I path to decide whether or not its representatives think that the BG has become uncontactable. If so (because all of its representatives were unable to contact the SL), then it agrees to the proposal that the BG be marked as failed.

The outcome of the consensus is either agreement that the BG has failed, or a quorum of nodes respond with the state of the BG thought to have failed, which is then shared with the monitors in the BG. If all the monitors in the BG agree that BG I is indeed inaccessible, then the BG leader initiates consensus on this at the global level, as discussed in Section 5.5.

5.4.4 Recovery from monitor failure

Nodes in each SL keep track of the status of the monitor. On monitor failure, leader election is used to choose a new monitor, which takes on the responsibilities of the failed monitor. In particular, the new monitor replaces the old monitor as a member of the BG.

5.4.5 Maintaining a quorum of ZooKeeper servers

To increase resilience, an SL’s monitor watches all of its ZooKeeper followers. If a follower fails, the monitor requests one of the non-follower live nodes to become a member of the ZooKeeper service.

5.5 BG leader actions

In each BG, a leader election amongst the monitors is used to elect a BG leader. The set of BG leaders is collectively responsible for providing the global consensus service discussed in Section 4.3. The alternative CM service is discussed in Appendix B.

5.5.1 Responding to membership requests

BG leaders respond to BG membership requests from representatives with a list of vnodes at the BG level and higher in the system and the IP addresses of their emulators. For each BG, the BG leader also responds with the size of its quorum, so that it is possible to verify that a response from an emulator has the requisite number of signatures. The response itself is signed by a quorum of BG leaders.

5.5.2 Dealing with and recovering from a system partition

When a BG comes to BFT consensus that some set of other BGs are unreachable, its BG leader contacts its BG leader peers to initiate a BFT consensus on membership. At the end of this consensus, BG leaders in the super-majority know of the set of BGs unreachable from their partition, and they inform the monitors in their own BGs (along with a quorum certificate) to not stall waiting for inputs from these inaccessible BGs. This allows recovery of liveness. BG leaders in the minority partition(s), however, stall, and this causes their BGs to stall, as desired.

This process is expensive, but is only resorted to rarely: only in case of a BG failure or system partition. Moreover, it works correctly even if a BG’s failure detectors are unreliable: what is important is not that a BG’s failure be perfectly known, but that the rest of the BGs agree to ignore it.

To preserve safety, once emulators in some BG have been informed by their BG leader that some other BG has failed, they should not respond to requests from representatives in the failed BG until either their BG leader informs them that the formerly failed BG has re-joined using the periodic global BFT consensus, described next. Moreover, nodes in the BG do not respond to clients, also to preserve safety.

5.5.3 Periodic resynchronization

BG leaders periodically establish consensus on the global set of BGs and their emulators using a BFT consensus protocol. Specifically, during each such global consensus, the BG leaders agree on (a) the set of BGs in the system (b) the set of emulators for each BG, and (c) the quorum size for each BG. BGs and SLs are only allowed to become part of the system during this synchronization point. Thus, in the period between synchronizations, which we call an epoch, BG membership is static, other than removal of BGs, subsequent to the detection and consensus on a BG failure or system partition event.

One outcome of the synchronization is a global quorum certificate per epoch that certifies the set of participating BGs for that epoch. This must be signed by at least BG leaders. The global quorum certificate, along with a corresponding set of BG-specific quorum certificates, fully describes the set of transactions in a cycle and the BGs who participated in the cycle, and this description is both non-manipulable and immutable. Hence, this certificate can be replayed to newly joined BGs to resynchronize them.

While synchronization process is expensive, doing is once every, say, 100,000 cycles, amortizes the cost777In practice, it is likely that BG additions will be rare. It may hence be sufficient to require BG additions to be done out-of-band, and manually, with periodic consensus only used for refreshing the list of BG emulators or dealing with system-wide network partitions.

Note that because the set of emulators for each BG is only updated during the periodic global consensus, the list of emulators could be stale, missing a newly live potential emulator or leading to the global service responding with an emulator that is actually not live. These errors are benign, due to the use of multiple emulators for each vnode. If a live node is missed, other live nodes are available to respond to queries from representatives. Symmetrically, if a node thought to be live is actually dead, or behind a partition, this is identical to the case when the node dies after the membership response. So, this level of staleness does not pose a problem.

5.6 Node/SL/BG Join/Re-join Protocol

A new or a newly-live node must first make its presence known to the intra-SL ZooKeeper (or equivalent membership service). It does so by adding a sequential ephemeral node to /local/members/. If the node happens to be the lowest numbered node, it becomes both a monitor and a representative, and if it is one of the lowest- numbered nodes, it serves as a representative but not a monitor. If the number of ZooKeeper servers is fewer than required by quorum, then this node is also made a ZooKeeper server by the current ZooKeeper leader.

Special care must be taken when a node joins an SL that is thought by its BG peer SLs to be dead due to either a crash failure or a network partition. If a newly-joined node discovers that it has restored quorum to the SL’s ZooKeeper, then it knows that the SL is recovering from a crash failure. Thus, its (perhaps newly-elected) monitor sends announcements to monitors in peer SLs proposing that it now be considered alive. If a quorum of monitors, using a BFT consensus algorithm, agree to this, then the SL’s status is updated by other monitors in the BG. Since every RCanopus cycle requires one round of BFT consensus within each BG, this is synchronized with BFT consensus on transaction order. The new SL is given all the state that it missed by one of its peers, which is the set of its missed transactions along with their quorum certificates.

Similarly, if a BG wishes to rejoins the system after a system-wide network partition, it must wait for the next periodic global consensus. At this time, newly-live BGs recover their state from peer BGs by obtaining missing transactions (with corresponding BG quorum certificates), that they can verify using global quorum certificates for missed epochs.

6 RCanopus safety and liveness

In this section, we prove that RCanopus is always safe and live when the situation permits. Specifically, we prove the following theorem:

Theorem 1 (Safety and liveness).

The RCanopus system provides the following guarantees:

  1. Safety: At the end of every consensus cycle, all live nodes agree on the same order of write transactions from all clients.

  2. Liveness: In the absence of network partitions, every live node completes every consensus cycle, despite Byzantine failures in up to SLs in BG ; up to Byzantine-failed BG leaders; and up to BG crash failures. If there is a partition, SLs/BGs in the super-majority partition, if such a super-majority partition exists, will be live and other SLs/BGs stall.

Proof.

(sketch) It has already been established the Canopus is safe [12]. Thus, this proof is in two parts. First, in Section 7, we enumerate all possible faults in the system. Then, in Sections 8-12 we consider the impact of each fault on safety and liveness. We show that despite faults, the safety provided by Canopus continues to hold. Moreover, liveness is lost only under the conditions enumerated in the statement of the theorem. ∎

7 Fault categories

This section enumerates the potential faults that may occur in the RCanopus system and their potential impacts. Over and above standard message failures, such as losses, duplication, and corruption, faults arise from three causes: crashes, Byzantine failure, and network partition and they can affect either a node, an SL, a Byzantine group, or the entire system888We ignore message failures based on our assumption of reliable communication channels.. Hence, we can enumerate all fault categories as follows:

Entity Crash Partition Byzantine Masked by
Node Several cases (see below) - Several cases (see below) Peer nodes in SL
SL Possible, due to node failures Possible Several cases (see below) Peer SLs in BG
BG () If membership quorum then no, Possible If failures then no, CM
else yes else yes
System - Possible - Majority partition

Table 1: A broad categorization of all possible faults in the system. “-” indicates the fault is either not possible or ignored.

Nodes can fail by crashing. If nodes fail during the first round of a cycle before they share their state with at least one peer, their state will not be accessible by other nodes and these transactions are lost to the system. However, if they are able to communicate their transactions to at least one other node before failing, these transactions may become available to the rest of the system. Node failures are masked by their SL peers, unless there are too many node failures within an SL, in which case the SL itself fails. Nodes can also launch several types of Byzantine attacks. The failure of a node that acts as a representative or a monitor has other consequences, depending on the nature of the role, as discussed in more detail below.

If too many nodes in an SL fail, then the SL ZooKeeper loses quorum, and the entire SL is stalled. This manifests itself as a SL crash failure. It is unlikely, but possible, that an SL partitions. This may also result in loss of quorum in the minority partition or both partitions. If so, the net result is either a set of node failures (similar to the node case above) or an SL crash failure. In case of partition, to prevent loss of safety due to a split brain, the minority partition must stall. Moreover, to preserve liveness with safety, the majority partition needs to achieve consensus on the set of failed/unreachable nodes. Finally, SLs can manifest Byzantine failures, and these are masked by their Byzantine group.

By definition, a Byzantine group with more than SL members is resilient to Byzantine failures of up to SLs. If there are more than failures, to preserve safety, the entire BG should stall, causing a BG crash failure. As with an SL, in case of partition, to prevent loss of safety due to a split brain, the minority partition must stall. Moreover, to preserve liveness with safety, the majority partition needs to achieve consensus on the set of failed/unreachable SLs. We use the same BFT consensus protocol both for consensus on transaction order and on membership.

Finally, the entire system is susceptible to partition failure. These are due to crashes of multiple BGs or a network partition. These failures need to be handled by the surviving BGs, if possible. As with SLs and BGs, in case of partition, to prevent loss of safety due to a split brain, the minority partition must stall. Moreover, to preserve liveness with safety, the majority partition needs to achieve consensus on the set of failed/unreachable BGs/SLs.

Finally, we need to deal with a special case, where a subtree of the system subtended by the root completes a cycle, then fails, before communicating with the rest of the system. It should be obvious that this can cause a violation of safety.

Given this discussion, we enumerate the potential set of faults in the system in Table 2. In the next section, we discuss the impact of each class of failure on system safety and liveness.

Failure class Fault Impact on safety or liveness
F1: Node crash Node failure in first round Safety: inconsistent views on failed node’s state
F2: Node crash Emulator failure Liveness: no response to representative
F3: Node crash Representative failure Liveness: SL missing state updates
F4: Node crash Monitor failure SL may lose liveness
F5: Node crash ZooKeeper node failure Liveness: May lose ZK quorum
F6: Node Byzantine Attack on transaction numbering Safety: Transactions from this node
likely to be first or last in global order
F7: Node Byzantine Emulators non-responsive Liveness: Same as SL crash failure
F8: Node Byzantine Broker ignores client Safety: Client DoS
F9: Node Byzantine Emulator lying Safety: Any message from emulator may be a lie

F10: SL crash
SL failure Safety: Unable to learn SL’s state
Liveness: Peer SLs may stall

F11: SL partition
SL splits into partitions Safety: Possibility of split brain;
Liveness: No SL partition may have quorum,
leading to nodes stalling
F12: SL Byzantine Messages from SL not trustworthy Safety: Cannot rely on any message
from the failed SL
F13: BG crash BG fails Safety: Unable to learn BG’s state
Liveness: peer BGs may stall
F14: BG partition SLs on different sides cannot Safety: Inconsistent views of SL’s state
communicate
F15: System partition BGs on opposite sides cannot Safety: Inconsistent views of BG’s state
communicate
F16: Early exit A subtree of BGs subtended by the Loss of safety
root completes a cycle and fails
Table 2: Potential system faults.

8 Impact of node failures on safety and liveness

Note that, independent of its type, one outcome of a node failure is to cause the eventual removal of its corresponding ephemeral znode in its local ZooKeeper. The node’s peers, who have access to this ZooKeeper service, can thus achieve consensus on its state.

8.1 F1: Node failures during the first round

In the first round of each consensus cycle, nodes share their state (i.e., list of client transactions) with each other. For safety, all nodes in the SL need to agree on the same set of transactions. This is achieved by the one cycle delay in sharing transactions, and the fact that the duration of a round is much larger than the time taken to establish consensus within an SL. Consequently, all nodes that are live at the end of first round of reach consensus on the set of transactions that belong to the first round of cycle . Only these transactions are shared within the BG. This procedure results in safety despite node failures. Liveness of the SL is achieved as long as there are a sufficient number of live nodes within an SL to achieve quorum for the intra-SL consensus protocol.

8.2 Node failures during the second round

In the second round of each cycle, monitors use a BFT consensus protocol to achieve consensus as discussed in Section 9.3. Such a protocol is tolerant of node failures, treating them similar to Byzantine failures. Specifically, as long as there are enough live nodes and SLs to obtain a quorum certificate, the BG maintains both safety and liveness despite multiple node and SL faults999Of course, network partitions can cause the BG to lose liveness. This is discussed later..

8.3 Node failures during subsequent rounds

We need to deal with the possibility of a node failure causing either an emulator or a representative to fail. We deal with each case in turn.

8.3.1 F2: Emulator crash failure

If a vnode ’s emulator fails, this has no impact on safety. If a representative that relies on this emulator is able to make progress by attempting to obtain state from one of ’s peers, it does so, maintaining liveness. An SL only stalls if all nodes descending from fail, which corresponds to the failure of one or more BGs. The impact of such a failure on liveness is discussed in Section 10.1.

8.3.2 F3: Representative crash failure

If one of the representatives in an SL fails, then its ephemeral znode will be removed from the path /local/members/ in the local ZooKeeper. On detecting this, the znode with the next sequence number (if one exists) is promoted to become a representative, and takes over the responsibility of fetching remote state from the failed representative, maintaining liveness.

Note that if the number of live nodes in the SL is the SL may have fewer than representatives. If all the representatives in an SL fail, this indicates that the SL itself has failed. The impact of SL failure on liveness is discussed in Section 9.1.

8.4 F4: Monitor failure

If the SL’s monitor fails, then its ephemeral znode will be removed from the path /local/members/ in the local ZooKeeper. On detecting this, the znode with the next sequence number (if one exists) is promoted to become a monitor, and takes over it’s responsibilities, preserving liveness.

8.5 F5: Dealing with ZooKeeper server crash failure

A subset of nodes in each SL serve as ZooKeeper servers and elect a leader amongst themselves. It is possible for one of these server nodes to fail (or become unreachable due to a network partition). However, as long as the majority of ZooKeeper stay alive, the service remains available. Thus, if failed servers are detected and replaced with due diligence, the ZooKeeper service should be resilient to server failures. If multiple server failures causes ZooKeeper to lose quorum, then the SL stalls and is considered to have crashed, whose impact on liveness is discussed in Section 9.1.

8.6 F6: Byzantine attack on transaction numbering

With the Merkle-root based computation, neither Byzantine nodes nor clients can game the system. Transaction ordering is completely dependent on other transactions in the current cycle. Furthermore, the information is hidden until the time of aggregation, which happens in parallel with batch sorting. Lastly, there is no overhead involved in this procedure, because the Merkle root is needed for the blockchain verification anyway.

8.7 F7: Byzantine attack: emulators non-responsive

Another type of Byzantine attack by a node serving as an emulator is to not respond to representatives. If this happens, then, from the representative’s perspective, the situation is identical to node failure, and the mitigation approach is also the same.

8.8 F8: Byzantine node ignores client

A failed node can only ignore messages from its clients; it cannot create fraudulent transactions on behalf of its clients because it does not have the client’s private key. To prevent transaction loss, recall that a client sends its transactions to at least nodes in distinct SLs in the same BG.

8.9 F9: Byzantine emulator lies

Due to the potential for Byzantine failures, any message coming from an emulator is inherently unreliable. To mitigate this problem, every message from every emulator is accompanied by a quorum certificate and is checked by the receiving representative.

9 Impact of SL failures on safety and liveness

9.1 F10: SL crash failures

It is possible that enough nodes in the SL fail that there is a loss of quorum in its ZooKeeper service. In this case, the SL will suffer from a crash failure, and, to preserve safety, all live nodes in the SL should stall. This will appear to other SLs in the BG as a network partition. We discuss how this is handled in Section 10.2, which deals with BG partitions.

9.2 F11: SL partition

An SL typically is comprised of servers on the same rack and connected by a single switch. Hence, it is very unlikely to partition. Nevertheless, for the sake of completeness, we consider the impact of the partitioning of an SL on system safety and liveness. Two cases are possible: one of the SL partitions has a quorum of ZooKeeper servers (i.e., it is the majority partition), or no SL partition has a quorum of ZooKeeper servers.

If there is no majority partition, all nodes in the SL stall, and the situation is identical to the SL crashing (Section 9.1), since stalled emulators will not respond to state requests when ZooKeeper quorum is lost (even with a null response).

If there is a majority partition, then, by definition, the majority partition has a quorum of ZooKeeper servers. These servers establish consensus that all nodes in the minority partitions are unreachable because their znodes will be missing in /local/members, which automatically implies that these nodes are no longer in the SL.

When a node is able to re-establish connection with the ZooKeeper service, it should rejoin the SL using the join protocol in Section 5.6; other nodes in the SL should not respond to excluded nodes.

9.3 F12: SL Byzantine failure

It is possible for a Byzantine SL to attack the other SLs in its BG by:

  1. giving inconsistent responses to other SLs when asked for transactions that it received from its clients, which form its own state

  2. lying about membership in its own SL

We consider each in turn.

9.3.1 Byzantine attacks on transactions

Recall that each client submits its transactions to SLs. Moreover, these transactions are signed with the client’s private key and the set of transactions are ordered by a hash on the transaction. In the second round of each cycle, these SLs use a BFT consensus algorithm to compute a consistent order of client transactions (Section 4.2). Thus, even with failures, a malicious SL cannot tamper with, reorder, or create fraudulent client transactions. This implies that, at the end of round 2 in each cycle, it is possible to obtain a Byzantine fault-tolerant ordering of the write transactions. along with the quorum certificate, mitigating Byzantine attacks on transactions.

In subsequent rounds, state queries received by emulators for their ancestor vnodes are responded to with the state computed using the BFT algorithm, along with a quorum certificate computed for each SL. Thus, in subsequent rounds, the only possible attack by a malicious node would be to omit the transactions from one or more Byzantine groups. However, there can be at most malicious SLs in a Byzantine group. Recall that representatives from other BGs contact at least emulators (from different SLs) from each Byzantine group, so are guaranteed at least one valid response. This prevents the attack.

9.3.2 Byzantine attacks on membership

Within a BG, the only Byzantine attack on membership possible is for a malicious SL to lie about the nodes in its SL when participating in the BG membership service (Section 4.2). It could pretend that it has some nodes that it doesn’t actually have, or not list all of its members. However, there is no advantage to lying. If the SL claims a node to be a member, but it is not one in reality, then requests to this node will not result in responses. On the other hand, if the SL does not reveal one of its nodes, then this node will not be available to use as an emulator. In either case, the SL can, at worst, deny certain requests, but cannot cause harm.

10 Impact of BG failures on safety and liveness

In this section, we discuss how to deal with the case where there are more than SL crash failures in a BG, stalling its BFT consensus protocol, or when the BG partitions.

10.1 F13: BG crash failure

It is possible that enough SLs in the BG fail that there is a loss of quorum in the BG membership service. In this case, the BG will fail. Live nodes in the BG detect this as a loss of the BG’s BFT consensus service. In this case, to preserve safety, all live nodes should stall. This will appear to other BGs in the system as a network partition, and handled accordingly, as discussed in Section LABEL:sec:syspart.

10.2 F14: BG partition

RCanopus tolerates a BG partition (or, equivalently, SL crash failure) by design. Recall that the monitors of SLs in a BG need to achieve BFT consensus in each cycle. During a network partition event, the SL monitors in the super-majority partition (if one exists) will view inaccessible SLs as having failed, and will create a quorum certificate that only includes transactions from nodes in the SLs in the super-majority partition. Monitors, and hence nodes, in the minority partition(s) stall, lacking the quorum to achieve BFT consensus.

When the partition heals, all nodes in the minority partition(s) will need to explicitly rejoin the system, and nodes in the super-majority partition should ignore messages received from these nodes until they have explicitly rejoined the system.

Note that if the BG partitions such that all partitions have less than a super-majority of SLs, then the BG loses liveness.

11 F15: Impact of system partition on safety and liveness

We now discuss the case where a network partition event separates some set of BGs from others. Such a partition is nearly identical to the crash failure of one or more BGs in that, from the perspective of the other BGs, some BGs are (perhaps temporarily) unreachable and therefore potentially dead. What is important, however, is that nodes in BG that are actually alive, but in a minority partition must stall to avoid having a ‘split brain’ situation, compromising safety.

When using the CM to deal with partition tolerance, we show in the Appendix, that using the CM leads to safety always, and liveness when possible.

The alternative is to use the global consensus service to deal with partitions. Recall that BG leaders use a BFT consensus protocol to update their view of global BG membership when monitors in a BG come to BFT consensus about the inaccessibility of a remote BG. This consensus is possible only in a super-majority partition, thus stalled representatives and nodes can be unstalled only in such a partition. Thus, all nodes in minority partitions will stall (guaranteeing safety) and nodes in the super-majority partition (should one exist) will eventually achieve liveness. Note that if the system partitions such that all partitions have less than a super-majority of BGs, then the system loses liveness.

12 F16: Early exit

It is easiest to understand the problem of early exit from a concrete example. Consider the situation where the RCanopus system consists of four BGs subtended by the root. Suppose that one of the BGs obtains all the information it needs from the other BGs, sends information back to its clients, then fails, but before communicating any data to the other BGs. In this case, the remaining BGs will exclude the failed BG’s transactions, but clients can receive inconsistent results, compromising safety. This situation holds whenever a subtree of BGs subtended by the root all fail (or are partitioned) before communicating with the remainder.

To prevent this, recall that nodes only commit state computed in cycle at the end of their cycle . To see this this preserves safety, note that for a BG to reach the end of its cycle, it must receive input from every other live BG earlier in that cycle. But this is only possible if was live at the end of the th cycle. Thus, if has reached the end of the th cycle, it (and all other live BGs who have reached the end of that cycle) can be sure that was live at the end of the th cycle, which means that ’s contribution to global state in cycle can be safely committed.

Note that if did indeed fail then will stall before it reaches the end of the th cycle, triggering a BFT consensus on excluding in the th cycle. In this case, all BGs need to achieve consensus on removing ’s state from the system to achieve liveness with safety. However, we expect this to happen rarely, so the recovery mechanisms can be somewhat complex. In contrast, in the common case, consensus on safety is achieved with only a slight delay and no loss of throughput. In short, a one-cycle-delay allows a node to establish consensus about safety without having to consult an additional out-of-band consensus algorithm.

13 Summary of mitigation mechanisms

Table 3 summarizes the mitigation mechanisms presented in this document.

Failure class Fault Mitigation
F1: Node crash Node failure in first round Atomic broadcast, batch timers, one-cycle-delay
F2: Node crash Emulator failure global consensus service updates emulator list;
On timeout, retry after a
F3: Node crash Representative failure Promote live node
F4: Node crash Monitor failure Another node promoted to monitor status
F5: Node crash ZooKeeper node failure Promote node to ZK server;
Stall on quorum loss
F6: Node Byzantine Attack on transaction numbering Random nonce computed as hash on transaction
BG orders on nonces
F7: Node Byzantine Emulators non-responsive Global consensus service updates emulator list;
On timeout, retry after a
F8: Node Byzantine Node launches DoS attack Standard DoS defenses
F9: Node Byzantine Node ignores client Client sends transactions to distinct SLs
F10: Node Byzantine Emulator lying All emulator responses must be accompanied
by quorum certificate
F11: SL crash SL failure BGs tolerate up to SL
crashes due to the use of BFT consensus
F12: SL partition SL splits into partitions Access to ZK quorum gives
majority liveness but minority stalls
F13: SL Byzantine Messages from SL not trustworthy All emulator responses must be accompanied
by quorum certificate
F14: BG crash BG fails Global consensus on the exclusion of the failed BG
F15: BG partition SLs on different sides cannot BGs tolerate partition due to
communicate the use of BFT consensus within the BG
Minority partition stalls; super-majority is live
F16: System partition BGs on opposite sides cannot Consensus on the exclusion of BGs in minority
communicate partitions reached by BGs in supermajority
F17: Early exit A subtree of BGs subtended by the One-cycle delay before commitment
root completes a cycle and fails
Table 3: Mitigation techniques

14 Conclusion

This document presents the design of the RCanopus system, which maintains several essential aspects of the Canopus protocol, but adds several mechanisms to make it resilient to a variety of faults. A detailed analysis of potential faults shows that RCanopus can tolerate Byzantine failure, network partitioning, node crashes, and message loss. We how how this can be achieved without sacrificing pipelining and the Canopus round-based massively-parallel communication pattern.

In future work, we plan to implement and test RCanopus. We also plan to integrate into well-known permissioned blockchain systems such as HyperLedger Fabric and Parity’s Substrate.

References

  • [1] Abraham, I., Gueta, G., and Malkhi, D. Hot-stuff the linear, optimal-resilience, one-message bft devil. arXiv preprint arXiv:1803.05069 (2018).
  • [2] Allavena, A., Wang, Q., Ilyas, I., and Keshav, S. LOT: A robust overlay for distributed range query processing. Tech. rep., CS-2006-21, University of Waterloo, 2006.
  • [3] Bano, S., Sonnino, A., Al-Bassam, M., Azouvi, S., McCorry, P., Meiklejohn, S., and Danezis, G. Consensus in the age of blockchains. arXiv preprint arXiv:1711.03936 (2017).
  • [4] Bessani, A., Sousa, J., and Alchieri, E. E. State machine replication for the masses with bft-smart. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on (2014), IEEE, pp. 355–362.
  • [5] Castro, M., and Liskov, B. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS) 20, 4 (2002), 398–461.
  • [6] Fischer, M. J., Lynch, N. A., and Paterson, M. S. Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM) 32, 2 (1985), 374–382.
  • [7] Hadzilacos, V., and Toueg, S. A modular approach to fault-tolerant broadcasts and related problems. Tech. rep., 1994.
  • [8] Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference (2010), vol. 8, Boston, MA, USA.
  • [9] Junqueira, F., and Reed, B. ZooKeeper: distributed process coordination. ” O’Reilly Media, Inc.”, 2013.
  • [10] Ongaro, D., and Ousterhout, J. K. In search of an understandable consensus algorithm. In USENIX Annual Technical Conference (2014), pp. 305–319.
  • [11] Poke, M., Hoefler, T., and Glass, C. W. Allconcur: leaderless concurrent atomic broadcast. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (2017), ACM, pp. 205–218.
  • [12] Rizvi, S., Wong, B., and Keshav, S. Canopus: A scalable and massively parallel consensus protocol. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies (2017), ACM, pp. 426–438.
  • [13] Stewart, C., Chakrabarti, A., and Griffith, R. Zoolander: Efficiently meeting very strict, low-latency slos. In ICAC (2013), vol. 13, pp. 265–277.

Appendix A Appendix: ZooKeeper

ZooKeeper [8, 9] is a robust coordination service. It exposes a simple API, inspired by the file system API, that allows developers to implement common coordination tasks. Applications use the ZooKeeper API to manipulate data nodes, called znodes, which are organized in the form of a tree. Znodes can contain uninterpreted data in the form of a byte array. Figure 4 illustrates a znode tree. In the figure, the ID of the master is stored under the znode /master and the ID of each worker (such as foo.com:2181) is stored in znodes under /workers.

Figure 4: ZooKeeper tree (from  [9])

The ZooKeeper API is simple:

  • create /path data: Creates a znode named with /path and containing data

  • delete /path: Deletes the znode /path

  • exists /path: Checks whether /path exists

  • setData /path data: Sets the data of znode /path to data

  • getData /path: Returns the data in /path

  • getChildren /path: Returns the list of children under /path

A znode can be either persistent or ephemeral. Ephemeral znodes that are automatically deleted if the client that created it deletes it, crashes, or closes its connection to a ZooKeeper server. A znode can also be set to be sequential, in which case, during creation, ZooKeeper assigns it a unique, monotonically increasing integer, enforcing consensus on creation order.

Clients can register with ZooKeeper to receive notifications of changes to znodes (also called setting a watch). Notifications are one-shot, which means that they after each notification a new watch needs to be set. Moreover, while ZooKeeper guarantees that that clients observe changes to the ZooKeeper state according to a consistent global order, changes to znode state may happen between the time of notification and read, so the definitive state of a znode is only known at the time of read.

Zookeeper is implemented as a distributed system running on a set of servers. One of these servers is called the leader and the rest of the servers are followers. If the leader fails, then the followers elect a new leader.

The ZooKeeper reconfiguration API allows the number of servers to be dynamically reconfigured. This is necessary to handle failures. Note that the minimum ZooKeeper configuration has three servers, of which at least two need to be live. If the number of live servers drops below two, then the service becomes unavailable. However, in this case, there is no one else for the remaining node to talk to, so this does not pose a practical problem.

Appendix B The Convergence Module (CM)

The Convergence Module (CM) is a mechanism for ensuring convergence in all-to-all communication despite network partitions and failures of components. It offers the following advantages:

  1. It maintains safety in an asynchronous environment.

  2. An administrator-defined policy can be defined to determine how to deal with transactions proposed by unavailable BGs.

The high-level approach is applicable to all three layers of RCanopus (intra-SL, inter-SL/intra-BG, and inter-BG) but for concreteness this section describes specifically the variation applicable to the inter-BG layer, where it is implemented using a Byzantine fault-tolerant replicated state machine (RSM) deployed on the same physical infrastructure as the BGs. For example, one replica of the CM can be hosted in each BG as long as the total number of BGs is at least where is an upper bound on the number of simultaneous replica failures.

More concretely, the CM service is a collection of CM nodes, each comprising an RSM replica for fault-tolerant distributed coordination, and a remote procedure call (RPC) server for interaction with BGs. The RPC server is able to read the state of the RSM replica, but cannot write it directly. State changes are instead accomplished by submitting commands from the service handler of the RPC server to the RSM by way of the co-located RSM replica.

One of the difficulties with using replicated state machines and consensus in a Byzantine environment is that Byzantine processes can in some cases propose invalid commands or inputs. That is, even though agreement is guaranteed on some decision or sequence of decisions, these decisions themselves may be invalid because they are based on erroneous or malicious inputs. We deal with this issue using a combination of techniques that ensure the following guarantees with respect to decisions made using the intra-BG BFT consensus and the CM replicated state machine:

  1. Integrity: each decision was agreed to by a sufficiently large quorum of processes (i.e., a supermajority).

  2. Validity: each decision is consistent with the protocol, regardless of whether it was proposed by an honest node or a Byzantine node.

The correctness properties are ensured using various forms of certificates, which typically contain a collection of signatures from quorum of nodes. Details are provided in Section B.3.

We begin by describing a simplified version of the protocol in which the BGs communicate with the CM (i.e., with the local CM node) when they approach the end of round 3 of every consensus cycle. Specifically, this occurs when the BG has received inputs from BGs including itself, and has either received inputs from or timed out on the remaining BGs. Each BG sends a BG_REPORT message to all the CM replicas indicating the set of other BGs from which transactions were successfully received up to this point in the current cycle, as well as a hash of the transactions for each peer BG.

For each cycle, a CM node waits until it has received BG_REPORT messages from of the BGs, and then waits further until each remaining BG either sends its message, or is suspected to have failed due to a timeout.101010In this case “failed” means unreachable rather than crashed. Suspicions of failure are corroborated by other CM nodes as follows to prevent false alarms by Byzantine nodes: failure is declared when a quorum of CM nodes all suspect (i.e., have timed out on) a particular BG. Messages relaying failure suspicion are signed by CM nodes and contain both the BG’s ID and the cycle number to prevent replay attacks. A collection of such signatures from distinct CM nodes for the same BG and cycle number comprise a failure certificate. If a failure certificate cannot be computed for some BG, then the CM node eventually receives a BG_REPORT for that BG indirectly from some other CM node.

As a running example, consider the case of three BGs: BG1, BG2, and BG3. Suppose that BG1 in cycle C1 receives inputs from BG2 and BG3. If BG2 becomes unreachable in cycle C1 with respect to BG3 and the CM, then the CM node might receive the following two BG_REPORT messages from BG1 and BG3, respectively:

[C1, BG1:hash1, {BG2:hash2, BG3:hash3}]

[C1, BG3:hash3, {BG1:hash1}]

The BG_REPORT is called complete in the first case, indicating that the BG has received transaction inputs from all BGs, and incomplete otherwise.

Suppose that a CM node eventually computes a failure certificate for BG2, and creates a graph representation of cycle C1 as shown in Figure 5. The vertices of the graph represent BGs and the edges indicate that one BG has a copy of another BG’s transactions. The direction of the edge is from the BG that issued the transactions to the BG that received a copy of the transactions.

BG1

BG2

BG3

Figure 5: Graph representation of CM’s inputs for cycle C1.

Next, the CM node computes the outcome of C1 using a configurable policy. In one variation, it computes a maximum subset of vertices with out-degree greater than or equal to , where is the total number of non-failed vertices (i.e., vertices representing live BGs). Such vertices represent proposals that are fully replicated across all non-failed BGs. For C1, because BG2 is unavailable, which is indicated in Figure 5 using a dotted circle around BG2. Any such vertex is implicitly excluded from the computation, along with any edges incident on it. The outcome of the CM’s graph analysis in C1 is therefore the maximum subset of non-failed vertices with in-degree one: {BG1, BG3}. The remaining BG, namely BG2, is then added to the set of faulty nodes (FN) for cycle C1. This means that BG2’s inputs are excluded from C1, and does not imply BG2’s deletion from the group membership of the system. Next, the CM node proposes a command to the RSM that associates two records with cycle C1: the mapping {BG1:hash1, BG3:hash3} representing the collection of transactions committed in cycle C1, and the mapping FN = {BG2:failcert2} representing the BGs deemed to have failed and their failure certificates. Each CM node eventually receives the command, and is able to report to its own BG the outcome of cycle C1. The CM node finally delivers to the BG a CM_REPLY message that contains the command and corresponding certificate (see Section B.3). A BG that receives such a message from the local CM node will stop waiting for any BG in FN, and continues to contact other BGs to retrieve any missing transaction inputs.

The graph analysis is performed in parallel by different CM nodes for each cycle, and this may lead to differing views on the liveness of a particular BG, hence to different outputs. The certification scheme described in Section B.3 ensures that the decision of the CM is well-defined for each cycle despite this. Specifically, in the event that multiple decisions are committed in the CM RSM, the first state transition is treated as authoritative and the others are ignored. Duplicate decisions should nevertheless be avoided in the interest of performance, and several optimizations can be used for this purpose:

  • The CM nodes submit a command to the RSM for cycle only if they have not yet executed a state transition command for .

  • Randomized timeouts can be used prior to submitting an RSM command.

  • A distinguished leader CM node can be responsible for submitting the RSM command, in which case duplicate state transitions for the same cycle would occur only if the leader suffers a Byzantine failure or if two leaders exist temporarily because of inaccurate failure detection. A RAFT-style term-based leader election algorithm can be used in this context with extensions for Byzantine fault tolerance (e.g., a node cannot start leader election until is has computed a failure certificate for the current leader).

Now suppose that BG2 was merely slow and not faulty, and continues to participate in the next cycle C2. Since BG2 is added to FN specifically for cycle C1, BG1 and BG3 continue to attempt communication with BG2 in cycle C2.111111An optimization that avoids this is described later on in Section B.2. The CM node may therefore receive the following messages in C2:

[C2, BG1:hash1, {BG2:hash2, BG3:hash3}]

[C2, BG2:hash2, {BG1:hash1, BG3:hash3}]

[C2, BG3:hash3, {BG1:hash1, BG2:hash2}]

Figure 6 shows the corresponding graph computed by the CM. The outcome of the graph computation for C2 is different from C1 because BG2 participates fully. The CM node then associates {BG1:hash1, BG2:hash2, BG3:hash3} and FN = {} with cycle C2 by issuing a command to the RSM.

BG1

BG2

BG3

Figure 6: Graph representation of CM’s inputs for cycle C2.

Next, the CM may receive the following messages in cycle C3 if BG2 is removed from the BG by the administrator in cycle C2:

[C3, BG1:hash1, {BG3:hash3}]

[C3, BG3:hash3, {BG1:hash1}]

Figure 7 shows the corresponding graph computed by the CM. The CM then associates {BG1:hash1, BG3:hash3} and FN = {} with cycle C3 by issuing a command to the RSM, similarly to cycles C1 and C2.

BG1

BG3

Figure 7: Graph representations of CM’s inputs for cycle C3.

Next, suppose that BG2 comes back as BG4 in cycle C4. Then in cycle C4 the CM may receive the following messages:

[C4, BG1:hash1, {BG3:hash3, BG4:hash3}]

[C4, BG3:hash3, {BG1:hash1, BG2:hash2}]

[C4, BG4:hash4, {BG1:hash1, BG3:hash3}]

This scenario indicates a return to steady state operation in which each BG is able to communicate with every other BG. Then the CM computes the graph shown in Figure 8 and associates {BG1:hash1, BG3:hash3, BG4:hash4} and FN = {} with cycle C4 using the RSM since all vertices have out-degree two.

BG1

BG4

BG3

Figure 8: Graph representations of CM’s inputs for cycle C4.

In the running example above, the output of the RCanopus protocol for a given cycle is determined by the set of nodes recorded in the CM RSM. For cycles C2 and C4, the output is the union of the inputs of all the BGs in the system. For C1 and C3, the output is the union of the inputs of BG1 and BG3 only.

In a variation of the protocol, the graph analysis selects vertices whose out-degree is sufficient to achieve an administrator-defined replication factor . For example, to ensure a replication factor of at least under the assumption that a BG never crashes permanently, the CM would select a maximal subset of vertices with out-degree at least . (We assume that a BG stores its own inputs, hence only additional copies are required at other BGs.) In this optimized version of the protocol, the CM may associate additional information with each consensus cycle to identify the BGs that hold copies on the transaction inputs that constitute the output.

To summarize, the CM uses communication successes in a cycle, along with a graph algorithm, to determine the set of BGs that can be considered live for the cycle. This information is broadcast to all the BGs, who then exclude transactions from any BGs that were identified by the CM as failed (meaning unavailable). The cycle completes only at this point. Section B.1 discusses how the CM can be removed from the critical path in failure-free operation.

b.1 Bypassing the CM in the absence of failures

In the interest of performance, the BGs must avoid coordinating with the CM in every cycle in the common case of failure-free operation. The protocol can be modified to remove the CM from the critical path in this case, and invoked only when needed to settle the outcome of a consensus cycle that is affected by a failure. Letting denote the depth of the processing pipeline, this is accomplished by having each BG broadcast meta-data related to the outcome of cycle to the other BGs as part of its input in cycle , which begins only after cycle completes.121212This optimization requires that a BG participate for one additional cycle after it is deleted from the system’s group membership. In particular, a BG that is deleted in cycle must still offer its meta-data (but no transactions) in cycle before disengaging entirely from the protocol. At the completion of cycle , each BG can retroactively analyze the meta-data received for cycle

to classify this cycle as belonging to one of two disjoint categories:

  1. A CM-assisted cycle is one where at least one BG received assistance from the CM, in which case the outcome of the cycle is decided explicitly by the CM.131313“Received assistance” means that a BG reported to the CM that it is stalled on another peer BG, and was told the outcome of the cycle by the CM.

  2. A cycle is unassisted otherwise, in which case its outcome is the union of the inputs of all the BGs in the system.

We say that the commitment of cycle is delayed by cycles because the outcome of cycle is not known until the end of cycle .

As hinted earlier, the protocol is modified by including meta-data regarding cycle in the input for cycle . The meta-data is packaged as a BG_META message indicating whether or not this BG received assistance from the CM in cycle . The BG_META message comes in three flavors:

  1. A BG_META-NO_ASST message indicates that the BG decided not to seek assistance from the CM, and hence did not receive assistance.

  2. A BG_META-ASSISTED message indicates that the BG sought and received assistance from the CM.

  3. A BG_META-DENIED message indicates that the BG sought assistance from the CM and was denied because its request came too late (i.e., at a time when the CM had already begun assisting some BG with a later cycle). This case is discussed in more detail shortly.

To ensure a correct categorization of cycles, the meta-data reported by BGs must be in agreement with the state of the CM meaning that the cycle is CM-assisted if and only if at least one BG generates a BG_META-ASSISTED. To that end, BGs must follow certain rules:

  1. The decision of a BG to seek (or not seek) assistance from the CM is committed in a Byzantine fault tolerant manner and certified using the techniques described in Section B.3. For performance, this decision can be combined with consensus on the BG’s input for some future cycle (e.g., cycle ). The certificate corresponding to this decision is attached to the BG_META message.

  2. Once a BG records a decision to seek assistance, it cannot complete the cycle without submitting a request to the CM and receiving a response. The request message is an incomplete BG_REPORT. The response message is either a CM_REPLY, which was discussed earlier, or a CM_DENY, which is explained shortly.

  3. A CM_REPLY message is attached to a BG_META-ASSISTED meta-data message to prove that a BG received assistance. Similarly, a CM_DENY message is attached to a BG_META-DENIED meta-data message to prove that a BG was denied. The certification scheme discussed in Section B.3 ensures that a BG can only obtain either a CM_REPLY or CM_DENY for a given cycle, and not both.

As an example, suppose that there are three BGs – BG1, BG2, BG3 – in a system with a pipeline of depth . Suppose further that the proposals of these three BGs are exchanged successfully in cycle C1 without the CM’s assistance. Then each BG adds a BG_META-NO_ASST message to its input for cycle C2. At the end of cycle C2, presuming no failures, each BG computes the following mapping from BGs to their meta-data:

C1: {BG1 BG_META-NO_ASST, BG2 BG_META-NO_ASST, BG3 BG_META-NO_ASST}

The number of keys in this mapping (three), and the fact that all values are BG_META-NO_ASST, indicate collectively that cycle C1 was unassisted. All BGs know this once C2 is complete because in this case all BGs receive meta-data from all other BGs.

Next, consider a failure scenario in which BG1 is unable to receive data from BG2 in cycle C3, and requests (as well as receives) assistance from the CM in cycle C3. Then BG1 includes a BG_META-ASSISTED message in its input to cycle C4. The outcome of cycle C4 depends on the meta-data received by a given BG at the end of cycle C4. If all the meta-data messages are BG_META-NO_ASST or BG_META-DENIED, then C3 is unassisted (not the case in this example). If at least one meta-data message is a BG_META-ASSISTED then C3 is CM-assisted. For example, this occurs in the current example if a BG computes the following mapping for C3:

C3: {BG1 BG_META-ASSISTED, BG2 BG_META-NO_ASST, BG3 BG_META-NO_ASST}

The same case applies if a BG computes a subset of this mapping as follows:

C3: {BG1 BG_META-ASSISTED, BG2 BG_META-NO_ASST}

Finally, if there is one or more missing meta-data message, and no BG_META-ASSISTED is received, then further investigation is required to resolve the outcome of the cycle. For example, this occurs if a BG computes the following mapping for C3 in C4:

C3: {BG2 BG_META-NO_ASST, BG3 BG_META-NO_ASST}

In this case cycle C4 itself is CM-assisted because the input of BG1 for C4 is not received by the BG that computes the above mapping, and so the status of C3 (i.e., assisted vs. unassisted) is known already to the CM provided that it maintains the following invariant: the CM cannot assist with cycle if it has already begun assisting with cycle where is the depth of the pipeline. Maintaining this invariant ensures that the CM can be queried directly to determine the outcome of C3 once C4 is known to be CM-assisted. Enforcing the above invariant is difficult as it leads to a race condition in which a slow BG requests assistance for an earlier cycle once the CM has already started assisting a later cycle . A CM node deals with this race condition as follows: before replying to any request for assistance with , it ensures that a supermajority of BGs have progressed past cycle after either receiving assistance from the CM or receiving transaction inputs from all other BGs in cycle . More concretely, the CM waits until it either has a complete BG_REPORT for cycle from a supermajority of BGs, or else until some BG issues an incomplete BG_REPORT for cycle , indicating a request for assistance. In the latter case the CM offers assistance by returning a CM_REPLY message. Any request for assistance (i.e., incomplete BG_REPORT) with cycle received after this point is replied to with a precomputed CM_REPLY if was CM-assisted, and with a CM_DENY message otherwise. The denial message indicates to the requesting BG that is unassisted and any missing inputs can be obtained by querying any supermajority quorum of BGs.

Theorem 2 (safety).

If two BGs determine the outcome of a cycle then either both decide that is CM-assisted or both decide that is unassisted.

Proof.

Suppose for contradiction that two BGs, say BG1 and BG2, reach conflicting decisions regarding cycle . Suppose that BG1 decides that is CM-assisted, and BG2 decides that is unassisted. Since BG1 decides that is CM-assisted, one of three events occurred:

  1. BG1 requested and received assistance from the CM in cycle , and produced a BG_META-ASSISTED; or

  2. BG1 received a BG_META-ASSISTED from another BG in a later cycle; or

  3. BG1 received incomplete meta-data for cycle and determined that was CM-assisted by querying the CM directly (i.e., received a CM_REPLY and not a CM_DENY).

In cases (i) and (iii), it follows that BG1 reaches the correct conclusion since it communicates directly with the CM, and since the certification scheme described in Section B.3 ensure that the CM nodes produce a uniform decision for a given cycle. In case (ii), the BG_META-ASSISTED includes a corresponding CM_REPLY, which is certified and proves that the cycle was CM-assisted.

Next, consider how BG2 reached the incorrect conclusion that is unassisted. According to the protocol, BG2 must have received only BG_META-NO_ASST and BG_META-DENIED meta-data messages from all BGs in cycle , and yet one of these BGs did in fact receive assistance in cycle (i.e., was offered a CM_REPLY for cycle ). A BG_META-NO_ASST includes a certificate that proves the BG committed not to seek assistance from the CM in cycle . On the other hand, a BG_META-DENIED includes a CM_DENY message, which proves that the cycle was unassisted. Both cases contradict the observation that the BG under consideration received assistance from the CM in cycle . ∎

Theorem 3 (liveness).

Suppose that the protocol begins to process cycle and that the pipeline depth is . Suppose further that each BG individually maintains safety and liveness. If there exists a quorum of at least BGs that are able to exchange and process messages sufficiently quickly to avoid triggering timeouts in the protocol, then eventually every BG in progresses to a cycle .

Proof.

The protocol begins with each BG collecting inputs from peer BGs. Every BG in is able to receive such inputs from all other BGs in , and then either receives inputs from or times out on every BG outside of . If the BG receives inputs from all BGs then it proceeds directly to cycle with a BG_META-NO_ASST meta-data message for cycle , and the theorem holds. On the other hand, if any of the inputs is missing then the BG requests assistance from the CM by sending a BG_REPORT message to its local CM node. (This case can be avoided if an unavailable BG is excluded by the CM for cycle while assisting with some earlier cycle, or temporarily removed from the group by the administrator.) At this point the CM node follows one of several execution paths.

Case A: the CM node has already computed a CM_REPLY or CM_DENY for cycle . Then the CM node returns the same response to the BG under consideration.

Case B: the CM node has not yet computed a CM_REPLY or CM_DENY for cycle . In this case the CM node first ensures that the outcome of cycle has been determined, which involves contacting a quorum of BGs, analyzing their BG_REPORT messages, and in some cases computing one or more failure certificates. Timely communication with BGS and corresponding CM nodes in ensures that this part of the protocol completes eventually. Next, the CM node either attempts to assist the BG, or replies with a denial if it has already computed a certified CM_REPLY (i.e., provided assistance) for cycle or higher.

Subcase B1: the CM node has already computed a certified CM_REPLY for cycle or higher. Then the CM node replies to the BG with a CM_REPLY message if such a message has been processed by the RSM and certified for cycle , or else it replies with a CM_DENY message, which is also be certified. In the former case, the reply to the BG is pre-computed. In the latter case, the CM_DENY command can be certified using the CM nodes corresponding to .

Subcase B2: the CM node has not yet computed a CM_REPLY for cycle or higher. Then the CM node proceeds with graph analysis based on the inputs of the BGs in and possibly other BGs outside of . Every BG in reports its BG_REPORT to the CM node, and for every BG outside of , either a failure certificate is computed by CM nodes, or a BG_REPORT message is retrieved for cycle . Next, the CM node proposes a command to its RSM based on the computed CM_REPLY. This command is accepted by the RSM, particularly the CM nodes corresponding to the BGs in , but there is no guarantee that it will be certified because a conflicting command for cycle may have a already been proposed. If the command is certified successfully, then the CM node returns a certified CM_REPLY response to the BG. Otherwise the CM node discovers that either a CM_REPLY message was already processed for cycle or for cycle or higher in an earlier RSM command, and was successfully certified. In the former case, the CM node replies with the earlier CM_REPLY command after ensuring that the command is certified. In the latter case, the CM replies with a CM_DENY message to the BG. ∎

b.2 Optimization for long network partitions

A network partition may cause some BGs to lose contact with other BGs in the system for an extended period of time. The CM bypass mechanism, as described earlier, is ineffective during such a partition because every cycle must be CM-assisted, which introduces a substantial performance overhead. The problem can be remedied by excluding the inputs of BGs selected in the graph analysis for multiple consensus cycles. That is, in a CM-assisted cycle , the decision to exclude the inputs of some BG persists beyond cycle for one or more additional cycles starting at , where is the depth of the pipeline. The additional information regarding excluded BGs can be embedded in the CM_REPLY message, which is attached to the BG_META-ASSISTED messages in the CM bypass mechanism. An excluded BG resumes computation later on without having to change its ID and execute a join protocol. Peer BGs continue to send messages to the excluded BG until then, but do not expect replies. Alternatively, a temporary group membership change can be made using the same mechanism as for ordinary group membership, which remains available from inside a sufficiently large partition (i.e., as long as a supermajority of BGs can be reached).

b.3 Certification of consensus decisions

This section pertains to two types of decisions:

  1. The BG’s choice of transaction inputs for a given consensus cycle, which may include some meta-data regarding a past cycle (see Section B.1).

  2. The CM’s decision on the outcome of a cycle, which is based on graph analysis.

Since both decisions are reached using a black box BFT consensus, it is possible to ensure integrity by computing a quorum certificate for a decision – a collection of signatures from a supermajority () quorum of servers. However, the quorum certificate created in an application-agnostic manner inside the BFT black box does not by itself ensure validity in a Byzantine environment. This is because an invalid decision may be committed based on the input of a Byzantine node. Worse yet, multiple conflicting decisions may be committed.

One solution to the problem is to modify the implementation of the BFT consensus and filter out both invalid and duplicate decisions before they are committed. In this case a conventional quorum certificate is sufficient to ensure both integrity and validity. If the BFT implementation cannot be modified, for example because it is closed source or managed by a third party provider, then We solve the problem by applying additional signatures on top of the quorum certificate that attest to the fact that the decision is singular and intended by the protocol. The number of additional signatures must be sufficient to ensure that at least one signature comes from an honest node (i.e., at least ). This node must be aware of all prior decisions reached by the BFT consensus so that it can detect invalid or duplicate decisions. An invalid decision must not be signed, and only the first decision in a sequence of duplicate decisions may be signed.

Example: A representative from a BG proposes that the BG’s transaction input in the inter-BG layer for some cycle is a set . A second representative proposes a different set (e.g., same transactions but different meta-data for an earlier cycle in the CM bypass optimization). An honest server proposes to the intra-BG BFT consensus, and a Byzantine node subsequently proposes . The BFT records both decisions – followed by . The additional signatures will ensure that is identified correctly as the BG’s input for cycle . If the quorum certificates produced for and include the index number of the corresponding decision, then any honest node identifies as the first valid decision computed by the BFT consensus for cycle , signs , and refuses to sign .