With the recent explosion in popularity of decentralized digital currencies, it is becoming more imperative than ever to have algorithms that are fast, efficient, easy to run, and quantifiably safe. These digital currencies typically rely on some “consensus” mechanism to ensure that everyone has a consistent record of which transactions occurred, to prevent malicious actors from sending the same money to two different honest actors (referred to as “double spending”). More traditional digital currencies that rely on proof-of-work consensus , such as Bitcoin and Ethereum, struggle to obtain low transaction times and high throughput, with theoretical results showing that proper scaling is impossible without fundamental changes to these protocols . Meanwhile, XRP has since its inception been both relatively fast and scalable . Rejecting such proof-of-work algorithms, XRP uses a consensus algorithm in the sense of research literature , where a group of nodes collaborates to agree on an ordering of transactions in the face of arbitrary asynchrony and some tolerated number of arbitrarily behaving parties. It has long been known that such consensus protocols can be made very efficient .
For XRP the concern is thus less about how to improve the efficiency of the protocol, and more about how to enable easy “decentralization”. Traditional consensus algorithms assume a complete network where all nodes agree on who is participating in consensus. However, in a real scenario where a consensus network is run by actually independent parties with their own beliefs, regulations, and motivations, it would be effectively impossible to guarantee that everyone agrees on the same network participants. Further, trying to make such a system amenable to open participation would immediately open the door to a Sybil attack  wherein a single entity gains control of a substantial fraction of the network and wreaks havoc. Thus these classical consensus algorithms are a poor choice for use in a decentralized network.
The XRP Ledger Consensus Protocol (XRP LCP) resolves this issue by allowing partial disagreement on the participants in the network while still guaranteeing that all nodes come to agreement on the ledger state. The set of participants that a node considers in the network is referred to as that node’s unique node list or UNL. In this setting the consistency of the network state is guaranteed by an overlap formula that prescribes a lower bound for the intersection of any two correct nodes’ UNLs. As described in the original whitepaper , this lower bound was originally thought to be roughly of the UNL size. An independent response paper  later suggested that the true bound was roughly of the UNL size. Unfortunately, both of these bounds turned out to be naive, and in a sister paper to this paper  Chase and MacBrough prove that the correct bound is actually roughly . Although this bound allows some variation, we would prefer a bound somewhat closer to the original expectation, to allow as much flexibility as possible. Chase and MacBrough also show that when there is not universal agreement on the participants, it is possible for the network to get “stuck” even with UNL agreement and no faulty nodes, so that no forward progress can ever be made without manual intervention.
To solve these issues, this paper proposes a new consensus protocol called Cobalt, which can be used to power decentralized digital currencies such as XRP. Cobalt reduces the overlap bound to only , which gives much more flexibility to support painless decentralization without the fear of coming to an inconsistent ledger state. Further, unlike the previous algorithm, Cobalt cannot get stuck when the overlap bound is satisfied between every pair of honest nodes.
Another advantageous property of Cobalt is that the overlap condition for consistency is local. This means two nodes that have sufficient overlap with each other cannot arrive at inconsistent ledger states, regardless of the overlaps between other pairs of nodes. This property makes it much easier to analyze whether the network is in a safe condition. For a network that can potentially be (mis-)configured by humans, it is very important to be able to easily recognize when the network unsafe.
Further, Cobalt always makes forward progress fully asynchronously. Similar to the well-known consensus algorithm PBFT , the previous algorithm, XRP LCP, required assuming a form of “weak asynchrony” where throughput could be dropped to by slightly-higher-than-expected delays or a few faulty nodes. But in practice, it is difficult to quantify what level of delay is “expected” in a decentralized open setting, where nodes can be in arbitrary locations around the globe and have arbitrarily poor communication speed. With Cobalt however, performance simply degrades smoothly as the average message delay increases, even with the maximal number of tolerated faulty nodes and an actively adversarial network scheduler. In a live network, breaking forward progress could do a lot of damage to businesses that rely on being able to execute transactions on time, so this extra property is very valuable.
Decentralization is important primarily for two reasons: first, it gives redundancy, which protects against individual node failures and gives much higher uptime; second, it gives adaptability, so that even in the face of changing human legislation, the network can conform to those changes without needing a trusted third party that can exert singular control over the network. One of the core insights of Cobalt is that these two properties of decentralization can be separated to give better efficiency while maintaining redundancy and adaptability. Like many other decentralized consensus mechanisms, Cobalt performs relatively slowly when used as a consensus mechanism for validating transactions directly. Thus instead of using Cobalt for transactions directly, we only use it for proposing changes to the system (“amendments” in the XRP Ledger terminology). Meanwhile a separate network with universal agreement on its participants can run a faster consensus mechanism to agree on a total ordering for the transactions. Changes to the members of this “transaction network” are executed as amendments through Cobalt. In this setup, the transaction network running a fast consensus algorithm gives both speed and redundancy, while the governance layer running Cobalt gives adaptability.
Using Cobalt together with a fast, robust transaction processing algorithm like Aardvark  or Honeybadger  gives all the same benefits of full decentralization while vastly improving the optimal efficiency. Further, in appendix A we present a simple protocol addition that enables the security requirements of the transaction processing algorithm to be reduced to the security requirements of Cobalt; thus even if every single transaction processing node fails, as long as the consistency requirements of Cobalt are met then every node will continue to agree on the ledger state. Other ideas for using using a decentralized algorithm to delegate a consensus group such as dBFT  do not share this property, and instead require additional assumptions about the delegated group to guarantee consistency, weakening the system’s overall security. The proposed addition adds only a slight latency overhead to the transaction processing algorithm.
We stress that this does not reduce the benefits of decentralization, as the transaction processing nodes only have the role of ordering transactions. Cobalt nodes still validate transactions on their own, are guaranteed to still accept the same transactions, and since client transactions are broadcast over the peer-to-peer network, the transaction processing nodes cannot even censor transactions since the Cobalt nodes could identify this behavior and eventually elect a new group of transaction processing nodes that don’t censor transactions. Delegating the job of ordering transactions to a dedicated group is purely an optimization, and does not harm the robustness of the network in any way.
In section 2 we describe our network model and the problem we’re trying to solve. In section 3 we summarize the existing results in the area and justify the need for a new protocol. In section 4 we present the details of the Cobalt algorithm and prove that it satisfies all the properties we require of it. In appendix A we describe an extension that can be used to reduce the security requirements of other consensus algorithms to Cobalt’s security requirements, and in appendix C we include an extra proposition which shows that Cobalt is actually reasonably efficient, but which doesn’t fit into the flow of the rest of the paper.
2 Network Model and Problem Definition
Let be the set of all nodes in the network. An individual node in is referred to as , where is some unique identifier, such as a cryptographic public key. We do not assume all parties (or any party) know the identities of every node in , nor even the size of . We assume that every pair of nodes has a reliable authenticated communication channel between them. This can be implemented in a reasonable way by using a peer-to-peer overlay network and cryptographically signing messages. Clearly, nodes cannot be made to respond to requests from arbitrary parties, since this immediately opens up an avenue for distributed denial of service attacks . We assume however that any node has some way of making requests of any every other node if it is willing to “put in some effort”. For instance, nodes might charge a modest fee or require some proof-of-work to respond to a request from an untrusted node. This makes DDOSing the network infeasible while allowing untrusted nodes to make requests of other nodes.
A node that is not crashed and behaves exactly according to the protocol defined in section 4 is said to be correct. Any node that is not correct is Byzantine. Byzantine behavior can include not responding to messages, sending incorrect messages, and even sending different messages to different parties. Note that in the original analysis of XRP LCP , it was assumed that Byzantine nodes cannot send different messages to different nodes, since it was implicitly assumed that in a peer-to-peer network such behavior would be easily identifiable. However, in our subsequent re-analysis  we dispensed with this assumption, since a network partition could potentially allow irreversible damage to be done before such behavior is correctly identified. Not making this assumption is canonical in the research literature on consensus algorithms , so we do not make it here either.
We further make the following nonstandard definition: a node is actively Byzantine if it sends some message to another node that it would not have sent had it been correct. A node can be Byzantine without being actively Byzantine; for example, a node that crashes is Byzantine but not actively Byzantine. A node which is not actively Byzantine is honest.
Every node has a unique node list or UNL, denoted . A node’s UNL is thought of as the set of nodes that it partially trusts and listens to for making decisions. may or may not include itself. The UNLs give structure to the network and allow a layered notion of trust, where a node that is present in more UNLs is implicitly considered more trustworthy and is more influential. We sometimes say that listens to if .
For most of the Cobalt protocol, we further assume that every honest node only has a single communication function, called broadcast. The statement that “ broadcasts the message ” means sends to every node that listens to . While not strictly necessary, this assumption makes the protocol analysis slightly simpler and is powerful enough on its own to develop the Cobalt protocol. The only exception to this rule is in section 4.1 for distributing threshold shares, which requires sending different messages to different nodes.
We also require that if an honest node broadcasts a message , then even if crashes or otherwise behaves incorrectly in any way, it eventually sends to every node that listens to it, or else no node receives from . This is reasonable from an implementation standpoint if messages are routed over a peer-to-peer network: as long as a node doesn’t send contradictory messages, a message sent to one party should eventually be received by all listening parties. We note that this requirement is needed only for guaranteeing liveness, not consistency.
We define the extended UNL to be the “closure” of ’s UNL, which recursively contains the set of nodes in the UNL of any honest node in . Formally, this is defined inductively by defining and then defining to be the set of all nodes in the UNL of any honest node in . We then define the extended UNL of to be the set . Intuitively, a node’s extended UNL represents the entire network from the perspective of ; any node that could possibly have an effect on either directly or indirectly is in .
A node also maintains a set of essential subsets, denoted , where . Intuitively, whereas a node’s UNL is the set of all nodes that it listens to for making decisions, its essential subsets refine how it makes decisions based on the messages it receives from those nodes. The original XRP Ledger consensus algorithm had no notion of essential subsets, and instead used a predefined “quorum” defining how many nodes in needs to hear from to make a decision. The direct analogue of this model would loosely be to let be the set of all subsets of of size at least . It follows immediately from proposition 25 that using this model with quorums as suggested in the XRP whitepaper, Cobalt guarantees consistency for all nodes with roughly pairwise UNL overlaps.
Despite the fact that the original UNL formalism can be transferred to the essential subset model, in our model we consider the essential subsets as central and the UNL as more or less incidental. We expect a node’s UNL to typically be derived automatically from its essential subsets rather than the other way around, and it is used only for bookkeeping and making some results about the algorithm easier to express.
If for some node , we define and define two additional parameters, and . These latter two parameters must always satisfy the following inequalities:
Effectively, represents the maximum allowed number of actively Byzantine nodes in required for guaranteeing safety while represents the number of correct nodes in required for guaranteeing liveness. and can be specified by node operators individually for each as a configuration parameter; however, if two essential subsets contain the same nodes but different values of or , we consider them to be distinct essential subsets. Equation 1 is just parameter sanity; equation 2 enforces that unless more than nodes in are actively Byzantine, then any two subsets of nodes must intersect in some honest node, which is used to guarantee consistency; without equation 3, forward progress cannot be guaranteed to hold for any node listening to even when every single node is correct. Note that if and , then all of these inequalities hold.
We make no implicit assumptions about the actual number of faulty nodes in any given essential subset , nor about the total number of faulty nodes in the network. Nor do we implicitly assume any common structure to the arrangement of the essential subsets between nodes. Instead, we will explicitly show which assumptions about the allowed Byzantine nodes and the allowed essential subset configurations are needed to guarantee each result. Doing this is useful because it turns out that certain properties like consistency require much weaker assumptions than other properties like liveness. In particular, we will show that consistency is actually a “local” property, which makes it very easy to analyze when consistency holds, and if the stronger assumptions required for liveness are ever violated, the network can at least eventually reconfigure itself to a new live configuration without having ever become inconsistent.
We call the problem we would like to solve democratic atomic broadcast, or DABC. DABC formalizes exactly the properties that are needed to implement a decentralized “governance layer” that can be used to agree in a fair and safe way on a set of protocol rules that evolves over time.
Formally, a protocol that solves DABC allows an arbitrary (but finite) number of proposers – whose identities may be unknown in advance or not universally agreed upon, and an arbitrary number of which can be Byzantine – to broadcast amendments to the network. Each node can choose to either support or oppose each amendment it receives, and then each node over time ratifies some of those amendments and assigns each ratified amendment an activation time, according to the following properties:
DABC-Agreement: If any correct node ratifies an amendment and assigns it the activation time , then eventually every other correct node also ratifies and assigns it the activation time .
DABC-Linearizability: If any correct node ratifies an amendment before ratifying some other amendment , then every other correct node ratifies before .
DABC-Democracy: If any correct node ratifies an amendment , then for every correct node there exists some essential subset such that the majority of all honest nodes in supported , and further supported being ratified in the context of all the amendments ratified before .
DABC-Liveness: If all correct nodes support some unratified amendment , then eventually some new amendment will be ratified.
DABC-Full-Knowledge: For every time , a correct node can run a “waiting protocol” which always terminates in a finite amount of time, and afterwards know every amendment that will ever be ratified with an activation time before .
We will expand on these properties in section 4.4.1 with the appropriate network conditions required for each individual property to hold. Although Agreement and Linearizability are clear and familiar from traditional atomic broadcast definitions, some explanation may be needed for the remaining three properties.
Democracy formalizes the idea that any amendment should be supported by a reasonable portion of the network. One might hope that Democracy could be strengthened to require that the majority of correct nodes in all of ’s essential subsets must have supported . Unfortunately, since we don’t assume universal agreement on participants, it might not be possible for a node to wait until it knows that every essential subset of every correct node has sufficient support for , since there might be essential subsets that the node doesn’t know about. The Democracy condition we do use seems like a reasonable compromise, and additionally it implicitly weights a node’s voting power by the number of nodes that trust it. For example, if some essential subset is maintained by every single node then that subset alone could potentially pass amendments, whereas a subset only maintained by a few nodes would need to work together with other subsets to pass amendments. The stronger Democracy property does hold in complete networks.
Most atomic broadcast algorithms use a “Validity” or “Censorship Resilience” property in place of Liveness that ensures a correct proposer (or client in usual terminology) will eventually have its amendment (transaction) ratified (accepted). Unfortunately, this doesn’t work in our case since not every proposer may be able to broadcast its transaction to the entire network, and further an amendment might become invalid if a contradictory amendment is ratified before it. The latter issue could be solved by removing invalidated amendments post-facto, but doing so would be unnecessarily inefficient with our protocol. Instead we use Liveness, which is equivalent to these stronger properties as long as the proposer can broadcast throughout the network and no amendments which contradict are ratified first.
For plain transaction processing, Agreement and Linearizability are the only properties needed by an atomic broadcast algorithm to guarantee consistency. Amendment processing adds a further layer of complexity though: nodes need to start acting according to the specifications of a ratified amendment at some point. Very subtle and difficult to detect bugs could surface if two nodes are running different versions of a protocol due to asynchronous knowledge of the set of ratified amendments. We rectify this issue by guaranteeing Full Knowledge, which gives nodes a way to always synchronize their active amendments. Note though that for a globally distributed network, synchronized clocks can’t be assumed to exist, so each protocol built on top of a Cobalt network should first run consensus to agree on a starting time. Then every Cobalt node can agree on exactly which version of the protocol to run. This is done for example in the XRP Ledger, by agreeing on a “ledger close time” for each block, which can be used as a starting time for the consensus protocol that agrees on the next block.
To model correctness of the algorithm, we consider a network adversary that is allowed to behave arbitrarily. The network adversary controls delivery of all messages as well as all Byzantine nodes. The only restrictions we make on the adversary is that it cannot break commonly accepted cryptographic protocols and eventually delivers every message sent between correct parties.
Due to the FLP result , a consensus algorithm (and in particular a DABC algorithm, which is a special type of consensus) cannot be guaranteed to make forward progress in the presence of arbitrary asynchrony. Thus the established convention is to ensure that consistency holds even in the presence of arbitrary asynchrony, but weaken the liveness property somehow. Two common variants are to assume liveness only holds during periods with stronger synchrony requirements  
, or to only make liveness hold eventually with probability   .
The former technique seems unsuitable for a wide-area network whose success is critical. Regardless of the heuristic likelihood of an attack breaking liveness for an extended period of time, it would be best to be mathematically confident that such an attack is infeasible. Thus we opt for the latter option for Cobalt. Although older randomness-based consensus protocols use local random values to guarantee termination, these protocols are highly inefficient in practice, requiring either exponential expected time to terminate, or asymptotically fewer tolerated faults. Newer protocols starting with typically use a “cryptographic common coin” that uses threshold signatures to generate a common random seed that cannot be predicted in advance by a computationally bounded adversary. Cryptographic common coins are very efficient, but do not immediately extend to the open network model, where the notion of a “threshold” is undefined. We thus begin section 4.1 with defining and implementing a suitable adaptation to our model which is almost as efficient and suitably powerful to develop Cobalt.
3 Other Work
In complete networks where all nodes trust each other equally, there has been much research on Byzantine fault tolerant consensus algorithms, both weakly asynchronous ones and fully asynchronous ones. Notable examples include PBFT , SINTRA , Aardvark , and more recently Honeybadger . Most of these algorithms can be made democratic using a similar democratic modification of reliable broadcast as the one presented in section 4.2.2.
PBFT and Aardvark are both very fast and seem to have basic adaptations to our model, although the view change protocol requires some modification since the cryptography it uses is not fully expressive in our model (for an idea of how these changes might look, see appendix A where we develop a ”view change” protocol that works in our model). However, leader-based algorithms like PBFT and Aardvark require agreement on a set of possible leaders, and if all of these leaders were to fail at once there would obviously be no way to guarantee forward progress, so these algorithms require stronger network assumptions than Cobalt. Additionally, neither of these protocols is guaranteed to make forward progress fully asynchronously, which makes them satisfy weaker properties than Cobalt. The protocol extension presented in appendix A though is loosely modeled after a simplified form of PBFT; to avoid the previously mentioned issue of needing an extra security assumption, we use Cobalt to agree on the set of possible leaders so that even if every leader fails at once eventually Cobalt can find new leaders to suggest transactions.
Meanwhile, adapting asynchronous leaderless algorithms like SINTRA and Honeybadger presents another difficulty in our model since we can’t assume any specific number of honest nodes are capable of reliably broadcasting, so the reduction to asynchronous common subset used in these algorithms doesn’t work. Adapting SINTRA seems especially difficult because of its significant use of threshold cryptography, for which it’s not clear what an adaptation to the open model would even look like.
Alchieri et al.  designed an early attempt to weaken the complete-network restrictions of classical algorithms, resulting in a Byzantine consensus algorithm that works when not all nodes know the identities of all the participants. However, in their model every node is still trusted equally, so trying to use their algorithm in an open network would immediately allow for a single entity to gain unreasonable control over the network, commonly known as a Sybil attack .
Schwartz et al. developed an algorithm that works in a similar model to ours . It guarantees safety based on “overlap conditions” that require that every pair of nodes trust enough nodes in common. Unfortunately, Chase and MacBrough later showed that the real safety condition is much tighter than originally thought, and further the algorithm can get stuck in certain networks where two UNLs disagree only by a single node . Further, safety is a global condition: if two nodes have sufficient overlap with each other but some other nodes don’t have sufficient overlaps, then those two nodes might end up in inconsistent states anyway. This is problematic both from a usability perspective (checking safety requires checking overlaps rather than overlaps) and from a pragmatic perspective (my safety should not depend on the bad decisions of other nodes). Schwartz’s protocol is also only weakly asynchronous, and is also not “robust” in the sense that a small number of Byzantine nodes can prevent the protocol from ever terminating. In a live network where businesses depend on forward progress, this could be a serious problem.
More recently, Mazières described a novel protocol for solving consensus in incomplete networks . Mazières uses a network model which is similar to ours111In particular, the “quorum slices” of Mazières’s paper appear very similar to our definition of “essential subsets”. However, the way in which Mazières’s algorithm uses quorum slices to determine support is different from the way Cobalt uses essential subsets: in fact, the “quorum slices” in our model would be actually be all the sets of nodes in whose intersection with every essential subset has size at least . and enables very loosely-coupled network topologies to remain consistent by utilizing trust-transitivity to dynamically expand the set of nodes listened to for making decisions.
However, the concrete condition for safety is again a global condition, and seems very difficult to analyze in practice. Although the author provides a way to decide if a given Byzantine fault configuration is safe for a given topology, the condition is difficult to check in networks where each node has many quorum slices, and further there is no obvious way to input a topology and get a clear metric of how tolerant it is to Byzantine faults. This could lead to building up under-analyzed, frail topologies that seem safe but spontaneously break as soon as a single Byzantine node starts behaving dishonestly. Mazières justifies the safety of the system by comparing it to the Internet, which is a robust system that similarly takes advantage of transitive connections. In practice though, the Internet suffers transient failures due to accidental misconfigurations relatively frequently . This is not a serious problem for the Internet since it can only fail by temporarily losing connectivity; in contrast, a consensus network cannot be repaired after forking without potentially stealing money from honest actors. We therefore prefer an algorithm that is more restrictive but easier to analyze clearly; and regardless, if a node desires the greater flexibility of Mazières’ protocol, then it can transitively add its peers’ essential subsets out-of-protocol and get the same exact benefits. Finally, Mazières’ protocol is again only weakly asynchronous and not robust.
In an attempt to resolve the inefficiency of proof-of-work, many decentralized currencies are moving towards proof-of-stake, in which a node’s “mining power” is tied to the amount of funds it locks up as collateral . Although traditional proof-of-stake algorithms only guarantee asymptotic consensus and so are not applicable to our problem definition (in particular their safety depends on synchrony assumptions), another interesting avenue is to use a proof-of-stake algorithm to give nodes weighted voting power and develop a distributed consensus algorithm that is safe as long as enough of the total weighted voting power belongs to honest nodes. This idea is explored in Kwon’s Tendermint protocol . These protocols make decentralization easy because there is no fear of becoming inconsistent due to a misconfiguration, while avoiding Sybil attacks by tying voting power to a limited resource.
Tendermint is again not robust and requires weak asynchrony, but it seems likely that a fully asynchronous algorithm like SINTRA or Honeybadger could be adapted to this setting. However, assuming the system uses hierarchical threshold secrets in the sense proposed by Shamir  for instantiating common coins, then making the set of possible voting power weights even moderately fine would rapidly degrade the performance of the system, until just reconstructing a single coin value might take minutes to compute, regardless of how many participants the network has. Further, Tendermint-like protocols require listening to every node in the network, which quickly becomes inefficient in very large networks, and is only made worse when trying to adapt to full asynchrony, which typically requires messages to be exchanged to reach consensus.
Another issue is that stake in a system’s success is not necessarily correlated with understanding how best to improve the system. For verifying transactions – the use case Tendermint was designed for – it is easy to justify tying authority to stake, since the behavior that best benefits the system is obvious and undebatable: simply run the protocol exactly as specified. For application to a governance system however, it is entirely possible for actors with good intentions to make poor decisions about how the system should operate. By allowing participants to explicitly delegate who they believe to be trustworthy, Cobalt can give authority to those who are best at making good decisions for the future of the network, rather than those who are simply incentivized against attacking the network.
Perhaps most importantly though, using proof-of-stake for determining voting power would be a poor decision for the XRP Ledger, since at the time of writing this paper, Ripple the company owns a majority of the XRP in existence, putting a dangerous amount of authority in a single location. Although Ripple is highly incentivized not to abuse this power since a loss of faith in XRP could render Ripple’s XRP holdings worthless, if nothing else this gives hackers a single point of entry with which they could take over the entire network due to a careless human error.
4 The Cobalt Protocol
In this section we describe the details of Cobalt, a protocol that solves democratic atomic broadcast in the open network model presented in section 2. Before describing the full Cobalt protocol, we first detail certain lower level primitives that are used as part of the Cobalt algorithm. Although most of these primitives are familiar tools in the complete network model, to the author’s knowledge no one else has adapted these primitives to fit our model, so we present novel instantiations of them. Since none of these protocols have been presented in our network model before, we prove by hand that every protocol is correct.
In all proofs, we make no implicit assumptions about the network connectivity or the number of Byzantine faults controlled by the adversary. If we need to assume some network connectivity or limitation on the tolerated Byzantine faults, we will state that assumption in the proposition.
Before delving into the protocols, we first develop some definitions and describe two mechanics that we use repeatedly in our protocols. These two mechanics underlie most of the basic techniques for developing consensus protocols in the complete network model, so adapting them to our model will allow us to easily adapt protocols for two of our lower level primitives, reliable broadcast and binary agreement.
Two nodes and are said to be linked if there is some essential subset such that fewer than nodes in are actively Byzantine faulty. We say some property is local if the property holds between two nodes iff those two nodes are linked, regardless of whether any other nodes in the network are linked. Local properties are nice because they ensure that poorly configured nodes cannot harm correctly configured nodes. We will later prove that consistency is a local property, which we stress is very important for making the network topology easy to analyze. To the author’s knowledge, Cobalt is the first incomplete network consensus algorithm for which consistency is a local property; for instance, locality does not hold for either the original XRP Ledger Consensus Protocol  nor the protocol of Mazières .
Similarly, two nodes and are fully linked if there is some essential subset such that at least nodes in are correct, at most nodes in are actively Byzantine faulty, and . Note that if is greater than , then we still allow nodes to be faulty, as long as they are not actively Byzantine (e.g., they can be crashed). Also note that full linkage implies linkage. While linkage is important for consistency, full linkage is important for forward progress.
A node is healthy if it is honest and at most nodes in each of its essential subsets are not healthy. This definition can be made non-cyclical by considering a sequence of sets starting with as the set of actively Byzantine nodes and the set of nodes with too many nodes in one of its essential subsets, then taking the unhealthy nodes to be the union across the . Healthy nodes are exactly the nodes that cannot be made to accept and/or broadcast random messages at the suggestion of actively Byzantine nodes. is unblocked if it is healthy and correct, and at most nodes in each of its essential subsets are not unblocked. Blocked nodes can be arbitrarily prevented from terminating by the Byzantine nodes.
A node is strongly connected if every pair of healthy nodes in are fully linked with each other. Strong connectivity represents the weakest equivalent of “global full linkage”: from ’s perspective, everyone in the network is fully linked. With a bit of effort, nonlocal properties can usually still be salvaged as only requiring strong connectivity rather than actually requiring that every pair of correct nodes in the network be fully linked. This is still somewhat nicer than requiring global full linkage, as at least no poorly configured nodes that you don’t know about can harm you.
The final definition we need is weak connectivity. A node is weakly connected if is fully linked with every healthy node in . Weak connectivity is in general much easier to guarantee than strong connectivity, since it doesn’t place any requirements on how other pairs of nodes are fully linked with each other. Note though that strong connectivity only technically implies weak connectivity for healthy nodes. Generally weak connectivity is needed to guarantee that the network “treats you fairly” and doesn’t come to decisions that seem wrong to you based on what you receive from your essential subsets.
The following two lemmas provide the fundamental basis underpinning our algorithms.
Let be any honest node, and let be any correct node which is fully linked with . Then if receives some message from nodes in every essential subset , then eventually will receive from nodes in some essential subset .
Since and are fully linked, by definition there is some essential subset . Thus if receives some message from nodes in every essential subset , then in particular it receives from nodes in . At most of these nodes could have been actively Byzantine, so using equation 2,
where the last inequality uses the definition of full linkage. Therefore at least non-actively Byzantine nodes in must have broadcast . Since we assume that honest nodes can only communicate by sending the same message to everyone in that listens to them, these honest nodes must have also sent to , so eventually will receive from nodes in . ∎
Let be any correct node, and let be any correct node which is linked to . Then if receives some message from nodes in every essential subset , then cannot receive a message that contradicts from nodes in every essential subset .
By definition of linkage, there must be some such that at most nodes in are actively Byzantine. By the same equations as in lemma 1 (minus the last inequality, which requires full linkage), if receives from nodes in then more than honest nodes in sent . Since honest nodes cannot broadcast both and , fewer than nodes in can send to . ∎
In light of the previous lemmas, we make two more definitions. A node sees strong support for a message if receives from nodes in every essential subset . Similarly, sees weak support for a message if receives from nodes in some essential subset .
Using these definitions, lemma 1 can be phrased as “fully linked nodes have enough overlap to where if one node sees strong support then the other will eventually see weak support”, and lemma 2 can be phrased as “linked nodes have enough overlap to where they cannot simultaneously both see strong support for contradictory messages”. It turns out that relating nodes in these two ways is enough to recover most of the techniques used in developing BFT algorithms from the complete network case, allowing us to easily adapt many algorithms to our model.
4.1 Cryptographic Randomness
Before we can define the Cobalt protocol, one remaining piece needs to be developed. As mentioned at the end of section 2, Cobalt uses cryptography to generate common pseudorandom values that are unpredictable by the network adversary in order to sidestep the FLP result .
Let be a probability space with probability measure . We define a common random source or CRS to be a protocol where nodes can sample at any time, and then output some value according to the following properties:
CRS-Consistency: If any honest node outputs , then no honest node linked to it ever outputs .
CRS-Termination: If is strongly connected and every unblocked node in samples the CRS, then every unblocked node in eventually produces an output.
CRS-Randomness: Suppose is correct and weakly connected, at most nodes in every essential subset are controlled by the adversary, and eventually outputs . Then for any value produced by the adversary before any healthy node in has sampled the CRS, with overwhelming probability for negligible .
The last property formalizes the idea that the adversary cannot get a significantly better prediction of the random output than it would by just randomly picking a value from .
We postpone describing the concrete details of this protocol until appendix B.
4.2 Reliable Broadcast
Reliable broadcast, or RBC, is a basic primitive that allows a specified broadcaster to send a message to the network, and guarantees that even if the broadcaster is Byzantine faulty, it must send the same message to every node. For the protocol definition, the broadcaster may or may not be a node within the network; however, when using RBC within Cobalt we only ever use it in the context where the broadcaster is a node in the network.
More formally, a reliable broadcast protocol is any protocol where a specified broadcaster entity inputs an arbitrary message, and every node can accept some message, subject to the following properties:
RBC-Consistency: If any honest node accepts a message , then no honest node linked to it ever accepts any message .
RBC-Reliability: If is strongly connected and any healthy node in accepts a message , then every unblocked node in eventually accepts .
RBC-Validity: If is honest and inputs the message , then any healthy node that accepts a message must accept .
RBC-Non-Triviality: If is honest and can broadcast to every correct node in the network, then eventually every unblocked node will accept .
Most researchers combine Consistency and Reliability into one property, but we keep them separate since the network assumptions needed for Consistency are so much weaker. Most researchers also combine Validity and Non-Triviality, since its assumed that every node can broadcast to the entire network. Since in our network model we do not assume that all nodes have communication channels between them, might be isolated from the rest of the network, so combining these properties doesn’t work.
In the complete network model, the canonical reliable broadcast protocol is due to Bracha . Our protocol is closely modeled after Bracha’s protocol, and behaves exactly the same in the complete network case.
The protocol begins by having broadcast to everyone listening to it. After that, each node (including , if is a member of the network) runs the following protocol222In our protocol descriptions, we use the underscore notation to refer to “any possible value”..
Upon receiving an message directly from , broadcast if we have not yet broadcast .
Upon receiving weak support for , broadcast if we have not yet broadcast .
Upon receiving strong support for , broadcast if we have not yet broadcast .
Upon receiving weak support for , broadcast if we have not yet broadcast .
Upon receiving strong support for , accept .
When multiple instances of reliable broadcast might be running at the same time, we tag each message with a unique instance id to differentiate them.
Step 2 is not technically necessary, but it makes it somewhat easier to reliably broadcast to the network. Note that since we assume that every message is cryptographically signed by the sender, if we also include the public key of (which may not be known to all nodes) in the instance tag, then in step 1 we could actually broadcast even if we only receive from a single node, as long as we also include ’s signature with it. This would make it even easier for nodes to reliably broadcast to the network. The only security risk for allowing more nodes to reliably broadcast is the possibility of allowing spam to congest the network; since spammers can be eventually excluded, there is little value in trying to make it harder for nodes to reliably broadcast.
Reliable broadcast can be split into two phases: the “echo” phase and the “ready” phase, distinguished by the labels on the messages from each phase. Roughly speaking, the echo phase serves to guarantee that everyone accepts the same message (consistency) while the second phase guarantees that if anyone accepts a message then so does everyone else (reliability).
Suppose two correct nodes and are linked and they accept the messages and , respectively. Then .
Although consistency is local as the previous proposition shows, unfortunately the stronger property of reliability is not local.
Suppose is strongly connected and two healthy nodes broadcast and , respectively. Then .
By steps 3 and 4 of the reliable broadcast protocol, an honest node can only broadcast for some message if either it received strong support for , or it received weak support for . In the latter case, if is healthy then this implies in particular that some healthy node in broadcast before . Since there are only a finite number of nodes in , there must exist some healthy node in that broadcast before any other healthy node in its UNL. In particular, must have broadcast due to having received strong support for .
Thus if two healthy nodes broadcast and , respectively, then we can assume that there are two healthy nodes such that received strong support for while received strong support for . Since is strongly connected by assumption, and are linked, so by lemma 2 . ∎
If is strongly connected and any healthy node accepts the message , then every unblocked node will eventually accept .
Since every pair of healthy nodes in are fully linked by assumption, if accepts then by lemma 1, eventually every unblocked node in will eventually see weak support for . By lemma 4, no healthy node in can have previously broadcast for any , so by step 4 of the RBC protocol, eventually every healthy and correct node in broadcasts . In particular, if , then every healthy and correct node in eventually broadcasts , so if is unblocked then eventually receives strong support for . Thus accepts by step 5 of the protocol. ∎
If is honest, then no healthy node can accept a message not broadcast by .
This follows from a simple analysis of the protocol by noting that a healthy node can’t broadcast without either receiving from or receiving from another healthy node. Thus if only broadcasts , then no healthy node will broadcast for any . By similar logic, no healthy node will broadcast for any , so no healthy node will ever see enough messages to accept . ∎
If is correct and can broadcast to every correct node in the network, then eventually every unblocked node will accept .
Since every node can receive from , every healthy and correct node will broadcast , so eventually every healthy and correct node will broadcast , so eventually every unblocked node will accept . ∎
The RBC protocol defined in section 4.2.2 satisfies the properties of a reliable broadcast algorithm in the open network model.
4.2.4 Democratic Reliable Broadcast
We will also find useful a slight variation on RBC called democratic reliable broadcast or DRBC.
A DRBC protocol is similar to RBC except it allows nodes to choose whether to support or oppose messages that are broadcast, and replaces non-triviality with the following properties:
DRBC-Democracy: If any healthy node is weakly connected and accepts the message , then there exists some essential subset such that the majority of all honest nodes in supported .
DRBC-Censorship-Resilience: If a can broadcast to every correct node in the network, and all correct nodes support , then eventually every unblocked node will accept .
One can easily transform the above RBC protocol into a DRBC protocol by specifying that each node only broadcasts an message iff it supports (note though that a node may still need to broadcast even if it doesn’t support ).
If any healthy node is weakly connected and accepts the message , then there is some essential subset such that the majority of honest nodes in supported .
If any healthy node in broadcasts , there must have been a healthy node that was the first healthy node in to broadcast . Then must have seen strong support for . By weak connectivity, and are fully linked (and in particular, linked), so there must be some essential subset such that at least honest nodes in broadcast , while at most honest nodes in did not broadcast . By equation 2, , so the majority of honest nodes in must have supported . ∎
The modified protocol defined in section 4.2.2 satisfies the properties of a democratic reliable broadcast algorithm in the open network model.
Consistency, reliability, and validity all still hold with the modified algorithm, since none of the proofs for those properties in theorem 8 assume that any nodes are guaranteed to broadcast an message. Democracy is proven in proposition 9.
The proof of Censorship Resilience is identical to the proof of RBC-Non-Triviality, since if every correct node supports then eventually every healthy and correct node will broadcast . ∎
4.3 Binary Agreement
The other low level primitive we need is asynchronous binary Byzantine agreement or ABBA. ABBA is the most basic consensus primitive: every node inputs some bit, and then all the nodes agree on a single bit that was input by some honest node.
More formally, an ABBA protocol allow each node to input a single bit, and then every node outputs a single bit according to the following properties:
ABBA-Consistency: Two honest, linked nodes cannot output different values.
ABBA-Termination: If is strongly connected and every unblocked node in provides some input to the algorithm, then eventually every unblocked node in terminates with probability .
ABBA-Validity: If any unblocked node outputs , then some unblocked node must have input .
The above definition of Validity is common in the complete network model, but it turns out to be too weak for our purposes. Indeed, an algorithm that only satisfies the above Validity property could decide even if some totally isolated honest node were the only node that voted . We thus actually need a stronger notion of validity to guarantee correctness of Cobalt:
ABBA-Strong-Validity: If any unblocked node outputs , then there is some chain of unblocked nodes , where for all , , and the node input .
Although rather awkward, the Strong Validity property turns out to be just strong enough for our purposes.
Our ABBA protocol is based off of a binary agreement protocol designed for complete networks by Mostéfaoui et al. . The protocol by Mostéfaoui et al. is fully asynchronous and uses a CRS in the form of a “common coin”. It takes longer on average to terminate compared to an earlier protocol in the same model developed by Cachin et al. ; unfortunately it seems impossible to develop a simple adaptation for Cachin et al.’s protocol, since the cryptographic proofs it uses to justify messages don’t seem to work in our model333Of course, threshold signatures as used in Cachin et al.’s original specification don’t work in our model. But even replacing threshold signatures with multisignatures, if a node broadcasts a “main message” voting after seeing valid “pre messages” voting from every , then because not all nodes know each other’s essential subsets, the validity proof of this main message only proves to that some sent valid pre messages voting to ; but then still doesn’t know if there might be some node for which no sent valid pre messages voting to . Thus a Byzantine node could send opposite valid main messages to two nodes that don’t know about each other, and guarantee that they never agree.
For the protocol, we use a sequence of common random sources that each sample uniformly from for every .
The protocol works as follows, run from the perspective of :
Upon receiving weak support for for some binary value , broadcast if we haven’t yet broadcast .
Upon receiving strong support for , output and terminate.
Set for all . Upon providing an input value , set and .
Upon receiving weak support for , broadcast .
Upon receiving strong support for , add to and broadcast if we have not already broadcast .
For every essential subset , wait until there exists some subset , such that and from every node in we received for some (possibly different for different nodes in ). Then broadcast .
For every essential subset , wait until there exists some subset , such that and from every node in we received for some (possibly different for different nodes in ).
Sample from and place its value in .
If , then set . If for some , then set . If in fact , then additionally broadcast if we have not yet broadcast .
Set and return to step 4.
The above protocol is defined asynchronously, so that once you get to some step in the protocol you keep running that step forever if its logic has not been satisfied by the time you get to the next step. So for instance, the logic involving the messages in steps 1 and 2 should be continuously checked even after you get to the later steps.
The original protocol of Mostéfaoui et al. did not use the messages or the messages. The messages are necessary for guaranteeing consistency is a local property. The messages are necessary because our definition of a CRS is weaker than a true common coin as assumed in the original protocol. The use of messages in step 8 ensures that if any node gets to step 10 with , then the value of is practically independent of the value of .
If two honest nodes and are linked, then they cannot output different binary values.
The above proposition shows why we use the message. Note that the part of the protocol involving the message is not present in Mostéfaoui et al.’s algorithm. The original version instead has nodes that get for some round wait until they sample some CRS with that returns . This change is not fundamental to the open network model (indeed, the original version works fine in our model, and our version works fine in Mostéfaoui et al.’s model). However, as shown in 11, adding the message makes agreement a local property, which is a great bonus in the open network model. Thus we prefer the modified version, even though it incurs an extra communication round. Without using the message step, the above proposition does not hold, since nodes can realize ABBA has terminated in different rounds, and unlinked nodes in a late terminator’s UNL can shift their opinions to the opposite value after the earlier node has already terminated.
We now move onto proving termination and validity. These properties are significantly more involved than agreement, so we try to break the proofs into the smallest chunks possible.
Each round of the binary agreement protocol described in section 4.3.2 breaks roughly into three phases. Similar to the case of RBC, the phases can be divided by the labels on the messages involved in each phase: the first phase is the “initialization” phase, and comprises steps 5 and 6 involving the messages; the second phase is the “auxiliary” phase in steps 6 and 7 that involves the messages; the third phase is the “confirmation” phase in steps 6 and 8 that involves the messages.
We begin by proving lemmas representing the correctness of the initialization phase.
If is unblocked and adds to , then there is some chain of unblocked nodes , where for all , , and .
If adds to , then certainly some unblocked node must have broadcast by the logic in step 6 for adding a value to . But an unblocked node only broadcasts if either or there was some unblocked node in its UNL that broadcast before did. By repeating, we successively build up the chain of unblocked nodes until we eventually reach some unblocked node that had , since is finite implying that at some point we must reach an unblocked node that sent before any other unblocked node in its UNL. ∎
If is strongly connected and any honest node adds to , then every unblocked node will eventually add to .
Identical to the proof of proposition 5. ∎
If is strongly connected, every unblocked node in gets to step 4 for round , and no unblocked nodes in terminate in round , then eventually every unblocked node adds some value to .
For convenience, given some essential subset define the majority input to be the binary value set for by the majority of unblocked nodes . Then once all these unblocked nodes get to step 4 in round , if any unblocked node listens to there must be at least unblocked nodes in , so will eventually receive messages from more than nodes in , causing to broadcast according to the condition in step 5.
Let be some unblocked node. Suppose every essential subset has the same majority vote . Then since , is fully linked with every unblocked node in , so eventually every unblocked node in broadcasts by the preceding paragraph. Thus adds to in step 6, and by lemma 13 every node also eventually adds to .
It remains to show the case where every unblocked node in maintains two essential subsets with . But in this case by the first paragraph every unblocked node in eventually broadcasts both and . Thus every unblocked node eventually adds both and to . ∎
Note that in the previous lemma the reason why we needed to specify “no unblocked nodes in terminate in round ” is because a node can possibly terminate at any time if it receives enough messages, and therefore stop participating before adding a value to .
We now move onto the auxiliary phase.
By lemma 14, eventually every unblocked node in broadcasts an message in round . Further, by lemma 13 if any unblocked node broadcasts then eventually every unblocked node adds to . Thus for any unblocked node , every unblocked node in will broadcast for some which is eventually added to , so eventually can progress to step 8 since there are at least unblocked nodes in every essential subset . ∎
Finally, we make three quick lemmas about the confirmation phase.
Identical to the proof of lemma 15. ∎
The final lemma for this phase shows why the confirmation phase is needed. It prevents the adversary from “gaming” the CRS to learn the value it returns in advance and using that information to artificially coordinate the system to prevent termination.
If is strongly connected and some healthy node progresses to step 10 in round with , then for some negligible .
In order for to progress to step 10 in round with , must have received strong support for . By strong connectivity of , then any healthy node that samples in step 9 must have done so after receiving strong support for from some healthy node in . By lemma 15, it cannot be the case that one healthy node in broadcast while another healthy node broadcast ; thus the value of must have been determined before sampled . Since samples randomly from , by CRS-Randomness for negligible . ∎
We need two more quick lemmas that don’t tie into either of the above “phases”, but rather deal with the correctness of the overall algorithm.
If is unblocked and outputs the value , then there is some chain of unblocked nodes , where for all , , and the node broadcast due to the logic in step 10.
Identical to the proof of lemma 12. ∎
If is strongly connected, and in some round a healthy node gets to step 10 with where is the value obtained from the random oracle , then for every , any healthy node that begins round does so with .
Suppose in round every healthy node that begins round does so with . By taking the contrapositive of lemma 12, one finds that every healthy node that gets to step 10 in round must do so with . Thus every healthy node that begins round does so with .
Therefore by induction it suffices to show that if in some round a healthy node gets to step 10 with , then every healthy node that begins round does so with . But by lemma 15 and the assumption that is strongly connected, any healthy node that gets to step 10 in round must do so with either or . In the former case, continues to round with . In the latter case, takes the value obtained from as ; but by CRS-Agreement outputs the same random value as , so again continues to round with . ∎
Now with all of those lemmas out of the way, we can finally prove the correctness of the overall algorithm.
If is unblocked and outputs , then there is some chain of unblocked nodes , where for all , , and the node input .
We work backwards from to extend the chain until it reaches an unblocked node that input .