Cross-Chain State Machine Replication

06/14/2022
by   Yingjie Xue, et al.
0

This paper considers the classical state machine replication (SMR) problem in a distributed system model inspired by cross-chain exchanges. We propose a novel SMR protocol adapted for this model. Each state machine transition takes O(n) message delays, where n is the number of active participants, of which any number may be Byzantine. This protocol makes novel use of path signatures to keep replicas consistent. This protocol design cleanly separates application logic from fault-tolerance, providing a systematic way to replace complex ad-hoc cross-chain protocols with a more principled approach.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/22/2020

TaiJi: Longest Chain Availability with BFT Fast Confirmation

Most state machine replication protocols are either based on the 40-year...
09/22/2020

A Formally Verified Protocol for Log Replication with Byzantine Fault Tolerance

Byzantine fault tolerant protocols enable state replication in the prese...
02/04/2022

Alea-BFT: Practical Asynchronous Byzantine Fault Tolerance

Traditional Byzantine Fault Tolerance (BFT) state machine replication pr...
06/24/2021

Stream-based State-Machine Replication

Developing state-machine replication protocols for practical use is a co...
11/18/2019

Can 100 Machines Agree?

Agreement protocols have been typically deployed at small scale, e.g., u...
04/15/2019

White-Box Atomic Multicast (Extended Version)

Atomic multicast is a communication primitive that delivers messages to ...
10/09/2021

Evaluation and Ranking of Replica Deployments in Geographic State Machine Replication

Geographic state machine replication (SMR) is a replication method in wh...

1. Introduction

In the state machine replication (SMR) problem, a service, modeled as a state machine, is replicated across multiple servers to provide fault tolerance. SMR has been studied in models of computation subject to crash failures (Lamport1998; OngaroO2014) and Byzantine failures (PBFT), in both synchronous (synchot) and asynchronous (duan2018beat) timing models.

This paper proposes an SMR protocol for a model of computation inspired by, but not limited to, transactions that span multiple blockchains. The service’s state is replicated across multiple automata. These replicas model smart contracts on blockchains: they are trustworthy, responding correctly to requests, but passive, meaning they undergo state changes only in response to outside requests. Like smart contracts, replicas cannot communicate directly with other replicas or observe their states. Active agents initiate replica state changes by communicating with the replicas over authenticated channels. Agents model blockchain users: any number of them may be Byzantine, eager to cheat other agents in arbitrary (but computationally-bounded) ways.

This SMR protocol guarantees safety, meaning that Byzantine agents cannot victimize honest agents, and liveness, meaning that if all agents are honest, then all replicas change state correctly.

Although cross-chain SMR and conventional SMR have (essentially) the same formal structure, their motivations differ in important ways. Conventional SMR embraces distribution to make services fault-tolerant. By contrast, individual blockchains are already fault-tolerant. Instead, cross-chain SMR is motivated by the need for interoperability across multiple independent chains. For example, suppose Alice and Bob have euro accounts on a chain run by the European Central Bank, and dollar accounts on a chain run by the Federal Reserve. They agree to a trade: Alice will transfer some euros to Bob if Bob transfers some dollars to Alice. Realistically, however, Alice and Bob will never be able execute their trade on a single chain because the dollar chain and the euro chain will always be distinct for political reasons. They could use an ad hoc cross-chain swap protocol (Herlihy2018; tiersnolan), but an SMR protocol has a cleaner structure, and generalizes more readily to more complex exchanges. So Alice and Bob codify their trade as a simple, centralized state machine that credits and debits their accounts. They place a state machine replica on each chain, and execute the trade through an SMR protocol that keeps those replicas consistent. While both replicas formally execute the same steps, the euro chain replica actually transfers the euros, the dollar chain replica actually transfers the dollars, and the SMR protocol ensures these transfers happen atomically.

A conceptual benefit of SMR over ad-hoc protocols is separation of concerns. Expressing a complex financial exchange as a (non-distributed) state machine frees the protocol designer to focus on the exchange’s incentives, payoffs, and equilibria, without simultaneously having to reason about timeout duration, missing or corrupted communication.

Prior SMR protocols assume some fraction of the participants (usually more than one-half or two-thirds) to be non-faulty. By contrast, for cross-chain applications it does not make sense to assume a limit on the number of Byzantine agents. Instead, this model’s SMR protocol protects agents who honestly follow the protocol from those who don’t, all while ensuring progress when enough agents are honest.

This paper makes the following contributions. We are the first to consider the classical SMR coordination problem in a distributed system model inspired by cross-chain exchanges. The model itself is a formalization of models implicit in earlier, more applied works (bitcoin; GarayKL2015; Herlihy2018; HerlihyLS2021). Fundamental coordination problems in this model have received little formal analysis. We propose a novel SMR protocol adapted for this model. Each state machine transition takes message delays, where is the number of agents, of which any number may be Byzantine. This protocol makes novel use of path signatures (Herlihy2018) to keep replicas consistent. This SMR structure cleanly separates application logic from fault-tolerance, providing a systematic way to replace complex ad-hoc cross-chain protocols with a more principled approach.

This paper is organized as follows. Section 2 describes the cross-chain model of computation, Section 3 gives examples of automata representing various kinds of cross-chain exchanges, Section 4 describes our cross-chain SMR protocol, Section 5 discusses optimizations and extensions, and Section 6 surveys related work.

2. Model of Computation

Our model is motivated by today’s blockchains and smart contracts, but it does not assume any specific blockchain technology, or even blockchains as such. Instead, we focus on computational abstractions central to any systematic approach to exchanges of value among untrusting agents, whatever technology underlies the shared ledger.

The system consists of a set of communicating automata. An automaton is either an active, untrusted agent, or a passive, trusted replica. An agent automaton models a blockchain client such as a person or an organization. Agents are untrusted because they model untrusted blockchain clients. A replica automaton models a smart contract (or contract), a chain-resident program that manipulates ledger state. Contract code and state are public, and that code is reliably executed by validators who reach consensus on each call. Replicas are trusted because they model trusted contracts.

Reflecting the limitations of today’s blockchains, agents communicate only with replicas (clients can only call contract functions), and replicas do not communicate with other replicas (contracts on distinct chains cannot communicate). Replica can learn of a state change at replica only if some agent explicitly informs of ’s new state. Of course, must decide whether that agent is telling the truth.

Like prior work (Herlihy2018; HerlihyLS2021; tiersnolan; ZakharyAE2019), we assume a synchronous network model where communication time is known and bounded. There is a time bound , such that when an agent initiates a state change at a replica, that change will observed by all agents within time . In our pseudocode examples, the function returns the current time. We do not assume clocks are perfectly synchronized, only that clock drifts are kept small in comparison to .

We make standard cryptographic assumptions. Each agent has a public and a private key, with public keys known to all. Messages are signed so they cannot be forged, and they include single-use labels (“nonces”) so they cannot be replayed.

The agents participating in an exchange agree on a common protocol: rules that dictate when to request replica state changes. Instead of distinguishing between faulty and non-faulty agents, as in classical SMR models, we distinguish only between compliant (i.e. honest) agents who honestly follow the common protocol, and deviating (i.e. Byzantine) agents who do not. Unlike prior SMR models, which require some fraction of the agents to be compliant, we tolerate any number of Byzantine agents111 If all agents are Byzantine, then correctness becomes vacuous..

3. State Machines

Because applications such as cross-chain auctions or swaps are typically structured as multi-step protocols where agents take turns transferring assets in and out of escrow accounts (HerlihyLS2021; tiersnolan; ZakharyAE2019), the state machine is structured as a multi-agent game. For simplicity, agents make moves in round-robin order. (In practice, agents can sometimes skip moves or move concurrently.)

Formally, a game

is defined by a decision tree

, where is a set of agents, is a set of moves, is a set of non-final states, and is a set of final states disjoint from . includes a distinguished initial state . The function defines which moves are enabled at each non-final state, defines which agent chooses the next move at each non-final state, , defines which state is reached following a move in a non-final state. This successor function induces a tree structure on states: for and , if then and . Finally, the utility function

is given by a vector of real-valued functions on final states, indexed by agent:

. For each agent and state , measures ’s preference for compared to its preference for the initial state. Informally, is negative for states where ends up “worse off” than it started, positive for states where ends up “better off”, and zero for states where is indifferent. An execution from state is a sequence where each , , , and . We divide executions into rounds: move takes place at round . Game trees are finite and deterministic, hence so are executions.

Not all game trees make sense as abstract state machines. We are not interested in games like chess or poker where one agent’s gain is another agent’s loss. Instead, we are interested in games where all agents stand to gain. A protocol is a rule for choosing among enabled moves. As mentioned, agents that follow the protocol are compliant, while those who do not are deviating. More precisely, is compliant in an execution if it follows the protocol: if , then . An execution is compliant if every agent follows the protocol: for , .

A mutually-beneficial protocol guarantees:

  • Liveness: Every compliant execution leads to a final state where for all .

  • Safety: Every execution in which agent is compliant leads to a final state where .

The first condition says that if all agents are compliant, they all end up strictly better off. The second says that a compliant agent will never end up worse off, even if others deviate. Establishing these properties is the responsibility of the game designer, and preserving them is the responsibility of the SMR protocol.

Both agents and the state machine itself can own and exchange assets. We keep track of ownership using addresses: each agent has an address, , and the state machine has an address . Let be the domain of addresses, and the domain of assets.

We represent state machines in procedural pseudocode. The block marked State defines the machine’s state components. The state includes an account map, , mapping addresses and assets to account balances. We will often abuse notation by writing in place of when there is no danger of confusion. Other state components may include counters, flags, or other bookkeeping structures.

At the start of the state machine execution, the agents initialize the state executing the block marked Initialize. An agent triggers a state transition by issuing a move, which may take arguments. Each move has an implicit argument that keeps track of which agent originated that move. In examples, a move is defined by a Move block, which checks preconditions and enforces postconditions. To capture the Byzantine nature of agents, every non-final state has an enabled move, which leaves the state unchanged, except for moving on to the next turn. (Usually, deviates from the protocol.) The keyword ends the execution for the sender.

The example state machines illustrated in this section favor readability over precision when meanings are clear. For clarity and brevity, we omit some routine sanity checks and error cases. Our examples are all applications that exchange assets because these are the applications that make the most sense for the cross-chain model.

3.1. Example: Simple Swap

1:
2:
3:, , : bool := , ,
4:Agree() Each agent agrees to swap
5:if  then
6:     
7:else if  then
8:     
9:Complete() Any agent can complete the swap
10:if  then Not yet completed?
11:     if  then Both agents agreed?
12:         
13:         
14:         
15:               
16:     
17:
State Machine 1 Simple Swap

Algorithm 1 shows pseudocode for a simple swap state machine, where Alice and Bob swap one of her florins for one of his ducats. The block marked State defines the state components: the accounts map and various control flags. Each agent agrees to the swap (Line 5, Line 7), checking that the caller has sufficient funds. After both have agreed, either agent can complete the transfers (Line 9). If either agent tries to complete the transfer before both have agreed, the transfer fails, and no assets are exchanged.

3.2. Example: Decentralized Autonomous Organization (DAO)

1:
2: Initially 0
3: Initially 0
4: Initially
5:
6:if  then Make sure DAO has funds
7:     
8:VoteYes() LP casts votes
9:if  is enabled then
10:     if  then Sender has enough tokens
11:         
12:               
13:VoteNo() Symmetric with
14:
15:do nothing
16:Resolve()
17:if  is enabled and threshold voted yes then Fund 100 to Alice
18:     
19:     
20:     
State Machine 2 DAO State Machine

Consider a venture fund organized as a decentralized autonomous organization (DAO), where liquidity providers (LPs) vote on how to invest their funds. Algorithm 2 shows a state machine where the DAO’s LPs vote on whether to fund Alice’s request for 100 florins. Each LP holds some number of governance tokens, each of which can be converted to a vote. After the LPs vote, a director tallies their votes, and if there are enough yes votes, transfers the funds. The state consists of accounts, , and maps and counting yes and no votes.

Initialization (Line 6) ensures that the DAO’s own account is funded. Each LP votes in turn whether to approve Alice’s request (Lines 9-12). (As discussed in Section 5, these votes could be concurrent.) If an LP skips its turn, the tallies are unchanged (Line 15). After every LP has had a chance to vote, the director can ask for a resolution (Lines 17-20). If the caller is authorized and if a threshold number of votes were yes (Line 17), the funds are transferred to Alice from the DAO’s account. In either case, the execution ends.

3.3. Example: Sealed-Bid Auction

1:
2:
3: Hash of bid + nonce
4: Unsealed bid
5:SealedBid
6:if  is enabled then No moves out of turn
7:     
8:Unseal
9:if  is enabled then No moves out of turn
10:     if  then Sealed bid checks out
11:         if  then Bidder has enough money?
12:               Transfer bid
13:              
14:                             
15:Resolve() Bidder collects NFT or refund
16:if  is enabled and all bids unsealed then
17:     if 
18: then
19:         transfer NFT to Sender won
20:     else Sender lost, refund bid
21:         
22:               
State Machine 3 Sealed-Bid Auction State Machine

Consider a simple auction, where bids are placed on one ledger for an NFT asset maintained on another. As in the swap state machine, each agent funds its account before the execution starts. (In Section 4, we explain how bidders might move funds to the state machine while the execution is in progress.)

The state consists of the accounts, a set of sealed bids, , and a set of (plain-text) bids, . Each bidder conducts a simple commit-reveal protocol. First, the bidder issues a sealed bid constructed by hashing its bid together with a nonce (Line 5). After all bidders have submitted their sealed bids, each bidder submits their actual bid and nonce, which is checked by the state machine (Line 8). After all bids have been received and unsealed, any bidder can query the outcome by calling resolve() (Line 15). If the sender’s bid was highest (after breaking ties) (Line 18), the NFT is transferred, and otherwise the sender’s bid is refunded from the replica’s account(Line 21).

4. State Machine Replication Protocol

In this section, we define an SMR protocol by which multiple replica automata emulate a (centralized) state machine as defined in the previous section.

There are agents and assets, where each asset is managed by its own replica. Each replica maintains its own copy of the shared state. The SMR protocol’s job is to keep those copies consistent. We assume that agents have some way to find one another, to agree on the state machine defining their exchange, and to initialize replicas that begin execution with synchronized clocks.

The core of the SMR protocol is a reliable delivery service that ensures that the moves issued by the agents are delivered to the replicas reliably, in order. Reliable ordered delivery in the presence of Byzantine failures is well-studied (Bracha1987; guerraouiKMPSV2020; MendesHT2012; SrikanthT1987), but the cross-chain model requires new protocols because the rules are different. The principal difference is the asymmetry between agents, active automata who cannot be trusted, and replicas, purely reactive automata who can observe only their own local states, but who can be trusted to execute their own transitions correctly.

For example, suppose the protocol calls for Alice to send a move to replicas and , instructing them to transition to state . Each replica that receives the move validates Alice’s signature, checks that it is Alice’s turn, and that the move is enabled in the current state.

There are several ways Alice might deviate. First, she might send her move to replica but not . In the SMR protocol, however, agents monitor one another. Another compliant agent, Bob, will notice that has not received Alice’s move. Bob will sign and relay that move from to , causing to receive that move at most later than . As long as there is at least one compliant agent, each move will be delivered to each replica within a known duration.

Second, Alice might deviate by sending conflicting moves, such as “transition to ” to , but “transition to ” to . Here, too, Bob will notice the discrepancy and relay both moves to and , presenting each replica with proof that Alice deviated. Each replica will discard the conflicting moves, acting as if Alice had skipped her turn.

Third, Alice might send a move to that is not enabled in the current state. Replica simply ignores that move, acting as if Alice had skipped her turn.

Finally, Alice might not send her move to either replica. Each replica that goes long enough without receiving a move will act as if Alice had skipped her turn. In short, reliable delivery has only two outcomes: a valid move from Alice delivered to every replica, or no valid move delivered, interpreted as a , all within a known duration.

To summarize, the SMR protocol consists of three modules.

  • The front-end automata (Algorithm 6), one for each agent, provide functions called by agents, including initial asset transfers into the state machine, the moves, and final asset transfers out of the state machine. (Every compliant agent is in charge of ensuring that final asset transfers take place.)

  • The relay service (Algorithm 4) guarantees that moves issued by front-ends are reliably delivered to the replicas as long as at least one agent is compliant.

  • The replicas (Algorithm 5), one for each asset, process function calls sent by agents from front-end, maintain copies of the state, and manage individual assets.

4.1. Path Signatures

A request is a triple , used to indicate that agent requests move at the start of round . Let denote the result of signing a message with ’s secret key. A path of length is a sequence of distinct agents. We use for the empty sequence, and to append to the sequence : . A path signature (Herlihy2018; HerlihyLS2021) for is defined inductively:

Informally, path signatures work as follows. The round starts at time after initialization. Within time after the start of round , a receiver accepts the path signature directly from Alice. Within time , a receiver accepts originating from Alice and relayed through Bob, and within time , a receiver accepts a message originating from Alice and relayed through distinct agents. A path signature of length is live for a duration of after . If no message is received for a duration of after , then no message was sent.

Define the following functions and predicates on path signatures of length :

The function is the time elapsed since the start of the current round. The SMR protocol uses to determine whether a message should be accepted by replicas, and to determine whether the accepted message’s move can be applied. A path signature is well-formed if the signatures are valid and the signers are distinct. For brevity, replica pseudocode omits well-formedness checks. We use for the domain of requests, and for the domain of path signatures.

4.2. Reliable Delivery

The agents act as senders (indexed by ) and the replicas act as receivers (indexed by ).

Each receiver has a component: , where holds path signatures for moves originally issued by agent and received at chain .

Property 1 ().

Here is the specification for the reliable delivery protocol.

  • Authenticity: Every move contained in a path signature in was signed by .

  • Consistency: If any receiver receives a path signature indicating ’s move, then, within , so does every other receiver.

  • Fairness: If a compliant issues a move, then every receiver receives a path signature containing that move within .

Note that a deviating sender may deliver multiple moves to the same receiver.

Before issuing a move, a compliant agent waits until that move is enabled at some replica’s state (not shown in pseudocode). To issue the move, sends the path signature to every replica (Algorithm 6 Line 13). When replica receives a live path signature with a move originally issued by , places that message in (Algorithm 5 Line 15). From that point, the relay service is in charge of delivery.

A natural way to structure the relay service is to have each (compliant) sender run a dedicated thread that repeatedly reads replica buffers and selectively relays messages from one replica to the others (Algorithm 4). Lines 3-8 shows the pseudocode for relaying moves. Each relaying agent reads each receiver’s buffer (Line 4), and selects messages (more specifically, requests of moves) which are not already relayed by (Line 5). Each such message is sent to the other receivers (Line 6) by adding to the path and produces , and the message is recorded to avoid later duplication (Line 7). After reading the buffers, calls each replica’s function (Line 8) which causes the replica to check whether it can execute a move (see below).

Theorem 4.1 ().

Every move in was signed by .

Proof.

Every move issued by is wrapped in a path signature signed by (Algorithm 6, Line 13). ∎

Theorem 4.2 ().

If a path signature of the form is placed in any before , then a path signature for that request will be placed in every before .

Proof.

The first time any replica places in where , then must be live (Algorithm 5, Line 14). If that receipt occurred before , then liveness implies . If any signer in was compliant, then already relayed that message to every replica. Otherwise, if no compliant agent has signed that message, then , and within some compliant agent will relay the path signature to every replica. Each replica receives before , so the path signature is live, and will be placed in , for all . ∎

Theorem 4.3 ().

If a compliant issues at round , then a path signature containing appears in for all within time .

Proof.

Each compliant sends the path signature to each replica (Algorithm 6, Line 13). That path signature is received by each replica before has elapsed, so the path signature is live, and is placed in (Algorithm 5, Line 14). ∎

1: Initially empty
2:while Exchange is in progress do
3:     for all  do
4:         for all  do Inspect every path signature
5:              if  then Does it need to be relayed?
6:                  for all  do Append signature and relay                   
7:                   Don’t relay again                             
8:     for all  do Wake up replicas      
State Machine 4 Relay Protocol for

4.3. Initialization, moves, and Settlement

Each replica manages a unique asset. On the replica that manages asset , each agent has a long-lived account, denoted , that records how many units of are owned by . While an execution is in progress, each agent has a short-lived account at , denoted , tracking how many units of asset have been tentatively assigned to at replica . Each replica has address . Long-lived and short-lived accounts are related by the following invariant:

We assume replica is authorized to transfer assets, in either direction, between the calling agent’s long-lived account, , and the replica’s own long-lived account, .

At the start of the execution, each agent escrows funds by transferring some quantity of each asset from to (Algorithm 5, Line 8). If that transfer is successful, the replica credits ’s short-lived accounts: for all , sets equal to the amount funded (Line 10). Agent is then marked as funded (Line 11). Only properly funded agents can execute moves (Line 14).

What could go wrong? The transfer from to might fail because has insufficient funds. The replica at can detect and react to such a failure, but the other replicas cannot. To protect against such failures, each agent calls the function (Algorithm 6, Lines 8-10), which checks that all replicas’ account balances are consistent. Finally, each agent checks that every other agent has transferred the agreed-upon amounts (Line 4). This last test is application-specific: for the swap example, agents would check that the others transferred a specified amount of coins, while in the DAO example, LPs can transfer as many governance tokens as they like. If either test fails, the front-end refunds that agent’s assets (by invoking in Algorithm 6). In this way, if some agents drop out before initialization, they are marked as unfunded, and the remaining funded agents may or may not choose to continue. In the meantime, safety is preserved since each compliant agent who continues sees a consistent state across all replicas. Each compliant agent who leaves the execution gets their funds back, ensuring they end up no worse off.

This funding step takes time at most . Each agent then verifies that replicas are funded consistently. If not, the agent calls () to reclaim its funding, and drops out. This verification takes time , like any other move. For an execution starting at time , initialization completes before .

While the execution is in progress, transfers of asset between and are expressed as transfers between and , leaving the balance of unchanged. Replica also tracks ’s balances for other assets: is ’s view of ’s current short-lived balance for each asset .

When the execution ends, each calls each replica’s function (Algorithm 6, Line 26), to get its assets back. This function transfers units of asset from to (Algorithm 5, Line 38). Once an agent’s funds are redeemed, that agent is marked as not funded (Line 39). The function serves two roles: it can refund an agent’s original assets if the exchange fails, or it can claim an agent’s new assets if the exchange succeeds. If all agents are conforming and no one drops out, the execution proceeds and every agent ends up with a better payoff, ensuring liveness. Any compliant agent can drop out, either early with a refund (say, if it observes inconsistent funding), or at the execution’s end. Both choices ensure that all assets in short-lived accounts are moved to long-lived accounts.

1:
2: initially all
3: initially all
4: Replica of state machine
5: Next move to execute
6: Each round’s start time
7: Initial funding
8:transfer units of from to
9:if transfer was successful then
10:     for all  do      
11:     
12: Set start time for round 1
13:send Forward move to relay protocol
14:if  and  then Relay only live moves by funded agents
15:     
16: Relayer: check for executable move
17: Execute if enough time has elapsed
18:if   then Unique executable move?
19:      Execute it
20:      Set the start time of next round
21:else if  then Did we time out?
22:      Agent chose not to move
23:      Set the start time of next round
24: Dynamically add funding
25:if  then
26:     transfer units of from to
27:     if transfer was successful then
28:          Credit accounts
29:     else
30:          Freeze accounts      
31:
32:if  then check authorization
33:     for all  do
34:         if  then
35:                             
36: Settle accounts at end
37:if  then
38:     transfer units of from to
39:      No moves allowed after cashing out
State Machine 5 Replica for asset

Each replica (Algorithm 5) has its own copy of the state machine (Line 4). The replica can apply moves to the copy (Line 19), and the replica can determine whether a proposed move by a particular agent is currently enabled (Line 17).

Replica ’s function determines whether there is a unique move in to execute. Each time replica starts a round, it records the time (Line 20). If time then elapses without delivering the next move (Line 21), the missing move is deemed to be a (Line 22).

4.4. Dynamic Funding

In the auction state machine (Algorithm 3), each agent bids without knowing the others’ bids, but each agent does know the others’ maximum possible bids, because short-lived account balances are public. For example, if Alice initially funded her short-lived account with 200 coins and Bob with 100, then Alice knows she can win by bidding 101 coins.

What if agents could provide additional funding while the execution is in progress? For example, an English auction with multiple bidding rounds might pause between rounds to allow agents to add more coins to their short-lived accounts. If Alice bids 101 coins in the first round, Bob might respond by adding enough coins to his short-lived account to respond in the next round.

Such flexibility, however, comes at a cost. Consider two scenarios. In the first, Bob transfers a million coins from his long-lived account to his short-lived account. This transfer happens on the coin replica. When he honestly reports that transfer to the auction replica, Alice and Carol consistently and falsely report to the auction replica that the transfer never occurred. In the second scenario, Bob fails to transfer the million coins for lack of funds. When he dishonestly reports a successful transfer to the auction replica, Alice and Carol consistently and honestly report the transfer never occurred.

An ideal mechanism would ensure that in the first scenario, where Bob’s transfer succeeded on the coin replica, his claim would be accepted on the auction replica, and the auction would continue with Bob’s new funding. In the second scenario, however, where Bob’s transfer failed on the coin replica, his claim would be rejected on the auction replica, and the auction would continue without him. Unfortunately, these scenarios are indistinguishable to the auction replica, which cannot directly observe the transfer’s outcome on the coin replica. Alice and Carol know whether Bob is lying, but they have no way to “convince” the auction replica.

An imperfect way to support dynamic funding is to make it expensive for Bob to cheat. As part of initialization, each agent pays a deposit at each replica. After each round of bidding, the bidders execute a top-up round, similar to the initialization round. Each bidder may transfer additional coins from his long-lived account to the state machine’s coin account, a sum reflected in his short-lived account. If, for example, Bob’s transfer fails, he forfeits his deposit, along with any coins transferred in earlier rounds. At the end of the top-up round, the bidders call , withdrawing and quitting if accounts are incorrect. This mechanism is imperfect because it gives Bob the power to wreck the auction through a (perhaps deliberately) failed transfer, but if he does so, he pays a penalty.

Algorithm 6, Lines 14-16 shows pseudocode for the front-end’s function that dynamically adds funding to an execution. Just as for initialization, each agent specifies, for each asset , how many additional units of to transfer to the execution (Line 15). Each agent then checks that the new global funding state is consistent, and if not, it redeems and quits (Line 16).

Algorithm 5, Lines 25-30 shows pseudocode for replica ’s function. The function transfers the additional units of from to the replica’s account (Line 26). If the transfer is successful, the short-lived accounts are credited (Line 28). If the transfer is not successful, the agent is marked as unfunded (Line 30), causing the caller to forfeit his deposit, along with any previous asset transfers.

A second imperfect way to support dynamic funding is to limit Bob’s potential for mischief by giving Carol, the auctioneer, authority to expel bidders whose transfers fail. (Carol’s role could also be filled by a committee.) This mechanism is imperfect because it grants Carol the power to expel Bob under false pretexts, although she has no incentive to do so, since higher bids mean higher profits for her. Note that Bob does not lose any assets if he is unfairly expelled, so he ends up no worse off.

At initialization, Carol is installed as the leader at each replica (not shown). After each round of bidding, the bidders execute a verified top-up round (Algorithm 6, Lines 18-25) as an extension of the top-up round described earlier. Each agent executes a top-up round as before (Line 18). The leader (Carol) then inspects the accounts, records which agents’ transfers failed (Line 25), and instructs the replicas to mark those accounts as defunded (Line 22 and Algorithm 5, Lines 31-35), ensuring they will henceforth be ignored by the rest of the agents. Finally, all agents call before moving on to the next round of bidding.

1:
2:for all  do
3:
4:if  is not the agreed-upon amount then
5:      Get assets refunded
6:     
7: Check for malformed funding data
8:if  then
9:      Ask for asset refund
10:     
11:send
12:for all  do
13:      Add path sig and broadcast
14:
15:for all  do Send path sig
16:
17:
18: Call regular top-up
19:if  then Authorized to accept or reject top-up
20:     for all  do
21:               
22:     for all  do      
23:else
24:     wait for leader to deliver defund votes
25:
26: Reclaim assets
27:for all  do Redeem assets from each replica
State Machine 6 Front-end for agent

5. Remarks

Although it is mostly straightforward to implement replica automata as smart contracts, there are practical, blockchain-specific details (such as analyzing gas prices) that are beyond the scope of this paper.

We model abstract state machines in adversarial environments as sequential games, where agents take turns. In some applications, agents can make moves in parallel, as long as certain conditions hold. In the simple swap example, Alice and Bob can agree in parallel. In the DAO example, votes can be cast in parallel, and in the auction example, sealed bids can be submitted in parallel, bids can be unsealed in parallel, and outcomes can be resolved in parallel.

When is it safe to execute moves in parallel? First, parallel moves must commute, meaning that the moves’ order does not matter. Commutativity is essential because thos moves may appear in different orders at different replicas. Second, the moves should be strategically independent, meaning that no agent would change its move if it were aware of a parallel move. For example, Alice and Bob should not issue concurrent plain-text bids, because a corrupt validator might leak Alice’s bid to Bob, and allow Bob to change his bid in response. By contrast, Alice and Bob are free to issue sealed bids in parallel, because neither agent could benefit from observing the other’s sealed bid.

The principal cost of our SMR protocol is that emulating a state machine transition requires time . One way to reduce this cost is to proceed optimistically: each replica executes each enabled move as soon as it is received, and overlaps subsequent execution with waiting to detect duplicates. When the execution completes, if there are still moves waiting to be finalized, the protocol simply postpones termination until the absence of conflicts is confirmed. If a conflict is detected, then the execution must be rolled back, either to the point of conflict, or to initialization, and retried, perhaps without speculation. Speculation transforms an -move execution from time to time , for failure-free executions.

Recall Algorithm 1, where Alice trades her florin for Bob’s ducat. If Alice’s final move (Line 5) is delivered to the ducat replica but not the florin replica, then Alice collects both coins. This injustice cannot happen as long as Bob is compliant, but what if Bob’s machine crashes, or he is the victim of a denial-of-service attack? Technically, Bob would be at fault, but he could protect himself by enlisting additional agents, who do not appear in the state machine specification, simply to relay messages. (A similar observation applies to the Lightning payment network (lightning), where one can hire a watchtower service (watchtower) to act on the behalf of an agent that might inadvertently go off-line.)

How does the SMR protocol compare to specialized protocols? In compliant executions, the Nolan two-agent swap (tiersnolan) completes in time at most . Algorithm 1 requires to initialize, for each agreement (which could be done in parallel), and to complete the transfers. An ad-hoc multi-agent swap protocol (Herlihy2018) has latency, as does a comparable SMR protocol. Latency is harder to compare for executions in which agents deviate, since the SMR protocol may succeed in some scenarios where the ad-protocols fail.

Our generic SMR protocol may be useful in other contexts. Recall that we replicate computation as well as data because one contract cannot directly observe other’s state. Suppose instead that there is one main chain whose contracts are capable of producing proofs of its current state, and that these proofs can be checked by contracts at the other chains. (Examples in the literature include ZK rollups (buterin2021), and the certifying blockchain of Herlihy et al. (HerlihyLS2021) used to commit atomic cross-chain transactions.) The execution would take place only on the main chain, while the other chains need only track the state by checking proofs, using the generic SMR protocol to ensure that emphemeral accounts are replicated consistently, so assets are distributed correctly at the end of the execution.

Instead of producing a proof, the main chain could produce an optimistic rollup (buterin2021; Kalodner2018), a summary listing of the agents’ final ephemeral account balances. This summary is replicated via the generic SMR protocol at the chains, but asset distribution is delayed long enough to allow agents time to detect fraud, and if found, to publish a proof and collect a reward.

In these contexts, our SMR protocol serves as a trustless token bridge (Sidhu2020), effectively transferring tokens or coins from each replica chain to the main chain at the execution’s start, then transferring them back (to their new owners) at the end.

6. Related Work

State machine replication is a classic problem in distributed computing. Protocols such as Paxos (Lamport1998), Raft (OngaroO2014), and their immediate descendants were designed to tolerate crash failures. Later protocols (see Distler’s survey (distler2021)) tolerate Byzantine failures. As noted, these protocols are not applicable to cross-chain exchanges because of differences in the underlying trust and communication models.

Prior Byzantine fault-tolerant (BFT) SMR protocols  (distler2021) assume replicas may be Byzantine but clients are honest. By contrast, in our cross-chain model, replicas are correct (because they represent blockchains), but clients can be Byzantine (because they may try to steal one another’s assets).

Except for some randomized protocols222The following SMR protocols do not have explicit leaders (moniz2008ritas; duan2018beat; miller2016)., most prior BFT-SMR protocols assign one replica to be the leader, and the rest to be followers (sometimes “validators”). These protocols tolerate only a certain fraction of faulty replicas. Our cross-chain SMR protocol, by contrast, does not distinguish between leaders and followers, and tolerates any number of faulty agents.

An individual blockchain’s consensus protocol can be viewed as an SMR protocol, where the ledger state is replicated among the validators (miners). Validators are typically rewarded for participating (amoussouguenou; bitcoin; roughgarden2020transaction). Validators might deviate in various ways, including selfish mining (eyal2014selfishMining), front-running (daian2019flash), or exploiting the structure of consensus rewards (abraham2018; amoussou2020rational; amoussou2020rational2; bano2019sok; buterin2020incentives; daian2019snow; kiayias2017ouroboros; kothapalli2017smartcast; pass2017fruitchains; rocsu2021evolution; saleh2021blockchain; sliwinski2020blockchains). Individual blockchain SMR protocols are not applicable for cross-chain SMR because of fundamental differences in models and participants’ incentives.

An alternative approach to cross-chain interoperability is allowing blockchains to communicate states to one another. As pointed out by a recent survey (qasse2019inter), most such solutions work only for homogeneous blockchains. These protocols usually adopt external relayers/validators to relay/validate messages across chains. Any failure of those external players can harm the safety of the cross-chain system. Incentivizing external operators remains a challenge (robinson2021survey).

In failure models of some prior work, parties are classified as either rational, seeking to maximize payoffs, or Byzantine, capable of any behavior. First proposed for distributed systems 

(moscibroda2006selfish), this classification has been used for chain consensus protocols (lev2019fairledger; sliwinski2020blockchains; mcmenamin2021achieving). The rational vs Byzantine classification is equivalent to our compliant vs deviating classification for cross-chain exchanges where compliance is rational, a property one would expect in practice.

The notion of replacing an ad-hoc protocol with a generic, replicated state machine was anticipated by Miller et al. (MillerBKM17), who propose generic state channels as a cleaner, systematic replacement for prior payment channels of the kind used in the Lightning network.

References