Coded Merkle Tree: Solving Data Availability Attacks in Blockchains

10/02/2019 ∙ by Mingchao Yu, et al. ∙ University of Southern California University of Illinois at Urbana-Champaign University of Washington 0

In this paper, we propose coded Merkle tree (CMT), a novel hash accumulator that offers a constant-cost protection against data availability attacks in blockchains, even if the majority of the network nodes are malicious. A CMT is constructed using a family of sparse erasure codes on each layer, and is recovered by iteratively applying a peeling-decoding technique that enables a compact proof for data availability attack on any layer. Our algorithm enables any node to verify the full availability of any data block generated by the system by just downloading a Θ(1) byte block hash commitment and randomly sampling Θ(log b) bytes, where b is the size of the data block. With the help of only one honest node in the system, our method also allows any node to verify any tampering of the coded Merkle tree by just downloading Θ(log b) bytes. We provide a modular library for CMT in Rust and Python and demonstrate its efficacy inside the Parity Bitcoin client.



There are no comments yet.


page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Blockchains (e.g., Bitcoin [26] and Ethereum [34]

) maintain a ledger of ordered transactions, organized into a chain of blocks. Starting from the genesis block, network nodes extend the ledger by creating and appending more blocks, following specific block generation rules (e.g., the longest-chain rule is used in Bitcoin

[26]). The transactions in the received blocks are validated by full nodes which download the entire block tree. However, for better scalability, it is imperative for a blockchain to allow light nodes, which may only be interested in verifying some specific transactions.

In Bitcoin [26, 1], light nodes are implemented using the Simple Payment Verification (SPV) technique: a Merkle tree is constructed for each block using the transactions as the leaf nodes, and the Merkle root is stored in the block header. Utilizing the Merkle root, a light node can verify the inclusion of any transaction in a block through a Merkle proof. Light nodes and SPV have been leveraged extensively to scale computation and storage of blockchain systems over resource-limited nodes (e.g., smartphones) [3, 8, 4, 5, 25, 35, 18, 16, 19].

Besides inclusion, what is more important for a light node is to validate the transaction based on the ledger state. Due to limited resources, a light node cannot download the entire ledger. Instead, it could use the depth of the block that contains this transaction as a proxy. That is, the deeper this block is buried into the chain, the more confident the light node is about the validity of the transaction. However, for it to work, a majority of full nodes must be honest and must follow protocol. Further, there is a significant tradeoff between confirmation latency (due to the depth) and the security about transaction validity.

Therefore, efforts to study 1) the scenario of a light node being connected to dishonest majority of full nodes, and 2) how to achieve faster confirmation at light nodes are becoming a major research direction [26, 10, 17, 11]. The overall idea is to allow the light nodes to receive and verify fraud proofs of invalid transactions provided by full nodes. This allows us to make a minimal assumption on the connectivity of a light node: it is connected to at least one honest full node. Now, once an honest full node finds out that a transaction requested by some light node is invalid, the full node constructs a fraud proof which suffices the light node to infer the invalidity. One efficient construction that utilizes the roots of the intermediate state Merkle trees after executing a subset of transactions is proposed in [10]. While fraud proof is appealing, it is vulnerable to the so-called “data availability attack”, discussed next. Stating the data availability attack formally and solving it comprehensively is the main goal of this paper.

Data availability attack. A malicious block producer 1) publishes a block header, so that light nodes can check transaction inclusion; but 2) withholds a portion of the block (e.g., invalid transactions), so that it is impossible for honest full nodes to validate the block and generate the fraud proof.

Although the honest full nodes are aware of the data unavailability, there is no good way to prove it. The best they can do is to raise an alarm without a proof. However, this is problematic because the malicious block producer can release the hidden parts after hearing the alarm. Due to network latency, other nodes may receive the missing part before receiving the alarm and, thus, cannot distinguish who is prevaricating. Due to this, there is no reward and punishment mechanism that can properly reward honest full nodes while also deterring false alarms and denial-of-service attacks.

Therefore, for fraud proofs to work, light nodes must determine data availability by themselves. This leads to the following key question: when a light node receives the header of some block, how can it verify that the content of that block is available to the network by downloading the least possible portion of the block?

Need to encode the block. Since a transaction is much smaller than a block, a malicious block producer only needs to hide a very small portion of a block. Such hiding can hardly be detected by light nodes unless the entire block is downloaded. However, by adding redundancy to the data through appropriate erasure codes [22]

), any small hiding on the origin block will be equivalent to making a significant portion of the coded block unavailable, which can be detected by light nodes through randomly sampling the coded block with exponentially increasing probability. As a counter measure, a malicious block produce could instead conduct coding incorrectly to prevent correct decoding. Light nodes rely on honest full nodes to detect such attacks and prove it through an incorrect-coding proof.

For example, an Reed-Solomon (1D-RS) code [31] offers a detection probability of after sampling distinct coded symbols uniformly at random. But an incorrect-coding proof will consist of coded symbols, which is of the same size as the original block and thus is too large. This cost is alleviated to in [10] by using two-dimensional RS codes (2D-RS), at the costs of reduced sampling efficiency, and increased block hash commitments of Merkle roots to verify the coding correctness within each dimension. In addition, 1D-RS and 2D-RS codes have a high decoding complexity of and , respectively.

In summary, with erasure coding, a light node pays 3 download costs for data availability, including block hash commitments, symbol sampling, and incorrect-coding proofs. Among them, the incorrect-coding proof cost must be minimized to defend fake proofs, for which both 1D-RS and 2D-RS are sub-optimal.

Our contributions. In this paper, we propose SPAR (SParse frAud pRotection), the first data availability solution that promises order-optimal performance on all the metrics, including 1) a constant block hash commitment size; 2) a constant sampling cost for a given confidence level on data availability; 3) a constant incorrectly-coding proof size; and 4) linear decoding complexity (Table 1).

hash commitment
size (bytes)
# of samples to gain
certain confidence
about data availability
proof size (bytes)
Uncoded - -
2D-RS [10]
Table 1: Light node download costs and full node decoding complexity (: block size in bytes).

At the core of SPAR is a novel cryptographic hash accumulator called coded Merkle tree (CMT). Starting from the bottom, CMT iteratively encodes layers of the tree and uses the hashes of the coded layer as the data for the next layer. A light node can detect the availability of the entire tree through the Merkle proofs of bottom layer leaves. With the entire tree available, SPAR uses a novel hash-aware peeling decoder and a special ensemble of random LDPC codes to maximize sampling efficiency, minimize incorrect-coding proof to one parity equation, and achieves linear decoding complexity.

SPAR and CMT implementation. We have developed a complete and modular CMT library in Rust [2] and Python [7]. We have also implemented an instance of SPAR for the Bitcoin Parity client [33], where SPAR outperforms state of the art [10] by more than 10-fold in hash commitments, incorrect coding proof, and decoding speed.

Related works. This work was inspired by pioneering research in [10]. Besides this work, coding also improves scalability of blockchains in other areas: [28] studies the coding efficiency of distributed storage systems [9, 15, 14, 30]. In a related vein, [29] uses a combination of Shamir’s secret sharing [32] (for storing the headers and the encryption keys) and private distributed storage (for the blocks) to reduce the storage overhead while guaranteeing data integrity and confidentiality. [21] uses Lagrange coding to simultaneously scale storage, computation, and security in a sharded blockchain [24, 20], via coding across different shards.

2 Security Model

The core responsibility of a blockchain is to produce, verify, and accept/store valid data blocks in a consistent but decentralized manner. A data block, denoted by , is a byte string of length that carries a batch of transactions. is valid for acceptance if and only if every single transaction in it is valid (e.g., enough balance, no double spending). Thus incomplete data blocks are tantamount to being unacceptable. Data incompleteness is not a threat to a node that fully downloads the block. However, state-of-the-art blockchain systems also run light nodes which do not download the blocks in entirety. We next describe these two types of nodes formally (see Fig. 1).

Full nodes are able to produce blocks (e.g., by batching submitted transactions), and to download and verify blocks produced by other full nodes. Upon acceptance, they store the entire block locally. Upon rejection, they broadcast a fraud proof to alert the other nodes. We note, however, that malicious full nodes do not necessarily follow such requirements, and can act arbitrarily.

Light nodes can only afford to download a small amount of data from each block and perform simple computations such as hash checks and fraud proof verification, but not to operate on whole data blocks. By accepting a block , they only store its hash commitment . Here is a hash accumulator such as Merkle tree generator, which will allow it to use to verify the inclusion of any transaction in through a Merkle proof. Without loss of generality, we assume light nodes are honest, as they are not able to deceive full nodes.

Figure 1: System model showing how full nodes and light nodes are connected. Full nodes store all symbols of the block that they had access to (these symbols are shown without being crossed out). There is at least one honest full node but potentially many adversarial full nodes (labeled Adv). Light nodes query some random samples from the block producer and forward them to the other full nodes.

We assume the following model of connectivity between the nodes:

1. Connectivity graph: We assume the following connectivity model (a) Each light node is connected to a set of full nodes, with at least one honest full node. (b) There is at least one honest full node that is connected to all light nodes. (c) Light nodes need not be connected to each other. (d) There is a broadcast group that sends messages (usually hash commitments) to all nodes, full and light.

2. The network is synchronous (any message is received by all intended recipients within a fixed delay), and the communication is lossless and secure.

3. The network allows users to send messages anonymously.

Importantly, we allow dishonest majority, i.e., each light node can be connected to more malicious full nodes than honest ones. Due to this, a light node cannot determine the completeness of a block through its connected full nodes, via a majority vote for instance. Due to complete network coverage, once a block is produced and broadcast to the network, an honest full node will be able to download and verify it. In case it detects any fraud (e.g., double-spends), it will broadcast a fraud proof, so that all the other nodes will eventually receive this proof and reject the block. On the other hand, the light nodes only download the hash commitment , and will rely on the fraud proof to reject the block.

A malicious block producer is motivated to conduct a data availability attack, where it 1) does not fully disclose , so that honest full nodes are not able to verify ; and 2) broadcasts , so that itself and its colluding full nodes can forward to their connected light nodes and deceive them that the that satisfies is valid for acception. Thus, the key for a light node to protect itself from accepting a fraudulent block is to make sure that is fully available. This gives rise to the main problem we try to address in this paper:

Data availability problem: Upon receiving a hash commitment , how can a light node efficiently verify that a data block that satisfies is fully available to the system?

A simple strategy for a light node is to randomly sample portions of , and reject it if it does not receive all requested portions. Since the size of a transaction is usually much smaller than the block, a malicious block producer only needs to hide a very small portion (say, e.g., a few hundred bytes) of a fraudulent block, which can hardly be detected through random sampling.

A malicious block producer could also conduct selective disclosure: when requested by light nodes, the malicious block producer may select a subset of the light nodes and fully disclose their requested portions, as long as the total disclosed portions do not reveal . These light nodes will be deceived about the availability of and will accept it, as no fraud proof of can be produced.

Thus, as similarly done in [10], we characterize the security of the above described system using the following measures:

Soundness: If a light node has determined that a data block is fully available, then at least one honest full node will be able to fully recover this data block within a constant delay.

Agreement: If a light node has determined that a data block is fully available, then all the other light nodes in the system will determine that the data block is fully available within a constant delay.

Recently, an erasure coding-assisted approach was proposed in [10] to improve sampling efficiency and suppress the data availability attack. In the next section, we will motivate this approach and overview the challenges it faces.

3 Overview of Erasure Coding Assisted Approach

An erasure code evenly partitions a block of bytes into data symbols of bytes each as , and linearly combines them to generate a coded block with coded symbols, . The hashes of these coded symbols are accumulated to obtain the hash commitment of , which is published with . With a good erasure code, a block producer’s hiding of one data symbol is equivalent to making the value of many coded symbols unavailable to the system. In general, a pair of good erasure code and decoding algorithm yields a large undecodable ratio , which is the minimum fraction of coded symbols a malicious block producer needs to make unavailable to prevent full decoding. Such hiding can be caught by a light node with an exponentially increasing probability of through randomly sample coded symbols when is large, indicating that samples are sufficient. Below is an example.

Example 1

Uncoded v.s. coded sampling efficiency. Given a block of 4 data symbols , a block producer generates 8 coded symbols as follows:


To prevent decoding through hiding, a malicious block producer must either publish no more than 3 data symbols or no more than 5 coded symbols. Both will make at least 3 coded symbols unavailable to the system (). Such unavailability can be caught with a probability of after randomly sampling 2 distinct coded symbols. In contrast, without coding, the hiding of one data symbol can be caught with a probability of only .

To counter erasure coding assisted random sampling, a malicious block producer could conduct an incorrect-coding attack: It generates coded symbols that fail the parity equations (the equations describing the linear relations between coded symbols in Example 1) specified by the erasure code, and generates using these invalid coded symbols. This way, it can pass light node random sampling through hiding only one data symbol and publishing most of the coded symbols, which will not allow honest full nodes to correctly recover .

Fortunately, this attack can be detected by honest full nodes by comparing the decoded block with the commitment . Upon detection, an honest full node can generate an incorrect-coding proof, which consists of the coded symbols of failed parity equation(s) and appropriate hash commitments, so that light nodes can verify them and reject the block. Using Example 1, an incorrect coding proof about could be and with matching Merkle proofs, plus the Merkle proof of , which, however, does not match the value of .

To keep incorrect coding proofs small, [10] applies 2D-RS (2-dimensional Reed-Solomon) code. The data symbols are placed as a square, then a RS code is applied to every row/column. Merkle roots are then generated from the resulted rows/columns. With these Merkle roots as the block header, a light node can verify any incorrect row/column by decoding this row/column using coded symbols of it (from an incorrect-coding proof) and reproducing its root. Thus, 2D-RS offers light nodes 1) a header cost of , 2) a sampling cost of (due to growing Merkle proof size), and 3) an incorrect-coding proof size of , where is the block size in bytes.

In this paper, we propose SPAR (SParse frAud pRotection), the first solution to the data-availability problem that is order-optimal in all the above three metrics: a header cost of , a sampling cost of , and an incorrect-coding proof size of . To this end, SPAR leverages four core components:

  1. a novel hash accumulator named coded Merkle tree (CMT), which encodes every layer of the tree to protect the availability of the entire tree. This way, the Merkle proof of every coded symbol will be available, which will enable every parity equation to be committed and verified alone;

  2. a dedicated sampling mechanism that enables a light node to check the availability of the entire CMT by sampling bytes plus one Merkle root;

  3. a hash-aware decoding algorithm that is able to detect and prove any single failed parity equation, provided the Merkle proofs of all the coded symbols;

  4. a special ensemble of random LDPC (low-density parity check) codes with a constant parity equation size and a constant undecodable ratio under hash-aware decoding algorithm, which protects all CMT layers equally.

4 Detailed Description of SPAR

In this section, we describe the four core components of SPAR: the construction of the coded Merkle tree by the (honest) block producer, the sampling mechanism of the light nodes, the decoding and alerting operations of the honest full nodes, and the erasure codes used by SPAR. At the end of this section, we will summarize the action space of each node in the network.

4.1 Construction of coded Merkle tree

In SPAR, an honest full node detects and proves incorrect-coding using the membership proofs of all the coded symbols in one parity equation and the values of at least of these coded symbols. Since any parity equation can be compromised, a light node needs to make sure the membership proofs of all the coded symbols are available at honest full nodes. In other words, it needs to make sure the entire Merkle tree is available.

To this end, we propose CMT. At a high level, CMT applies erasure coding to every layer of the tree, where the data symbols of a layer are generated using the hashes of the coded symbols of its child layer. This way, a light node can check the availability of every layer through random samplings, whilst an honest full node can detect and prove the incorrect coding at any layer, with the help of the hashes of this layer provided at its parent layer.

More specifically, given a block of data symbols, a rate- () systematic erasure code with an undecodable ratio of is applied to generate coded symbols, where the first are the original data symbols and the remaining are called parity symbols (hence the name systematic). Then the the hashes of every coded symbols are batched as one data symbol for the next (parent) layer. This yields a total of data symbols for the next layer, which will be encoded using a smaller (in terms of ) rate- systematic code with the same undecodable ratio. This iterative encoding and batching process stops once there are only () hashes in a layer. These hashes are the root of the CMT, and will be included in the block header and published with the original data block.

CMT layer size reduces at a rate of . Thus, for CMT to converge. In addition, to enable efficient sampling of both data and parity symbols (will discuss next), batching is interleaved, namely, the coded symbols whose hashes are batched together consist of data symbols and parity symbols. An example of CMT with , , , and is illustrated in Fig. 2. Indeed, a classic Merkle tree is a special CMT with and .

Figure 2: Coded Merkle tree for data symbols using rate erasure codes. Each data symbol of a higher layer is constructed by batching the hashes of data symbols and parity symbols of its child layer.

4.2 Sampling mechanism of light nodes

Similar to a classic Merkle tree, the Merkle proof of a base layer symbol in CMT consists of all the sibling hashes between this symbol and the root. The only difference is that the number of sibling hashes per layer is now instead of 1, which effectively provides the light node one data symbol from every intermediate layer. Thus, when a light node randomly samples distinct base layer coded symbols, the associated Merkle proofs will automatically sample, at no extra cost, distinct data symbols from every intermediate layer w.h.p.

To properly check the availability of an intermediate layer, a light node should also randomly sample about parity symbols from this layer. To avoid downloading extra Merkle proofs for these parity symbols and to minimize the correlation between the samplings, SPAR samples parity symbols of intermediate layers probabilistically: For every pair of parent and child intermediate layer, if a parent layer data symbol is sampled, then with probability , one of its child parity symbols (thanks to interleaved batching) will be sampled uniformly at random. Thus, the response size of one sampling request will be:


where is the base layer symbol size, is the size of a hash (e.g., 32 bytes), is the size of the partial data symbol from an intermediate layer for Merkle proof, is the average size of probabilistically sampled parity symbol from an intermediate layer, and is the number of layers. See Fig. 3 for sampling response of a coded symbol on the based layer of the CMT in Fig. 2.

Finally, to counter selective disclosure conducted by the malicious block producer, a light node will make the requests separately, anonymously, with replacement, and with some delay between every two requests. This will prevent the malicious block producer from selectively deceiving any particular light node, or deceiving the set of light nodes that make requests at the beginning. Therefore, every light node will have the same chance to catch data availability attack.

Figure 3: Merkle proof in CMT, and probabilistic sampling of parity symbols in intermediate layers. The solidly circled symbols constitute a base layer coded symbol and its Merkle proof. For the two yellow parity symbols circled in dash, with a probability of , one of them (with equal chance) will be included in the proof for light node sampling purpose. So will the two blue parity symbols circled in dash.

4.3 Hash-aware peeling decoder and incorrect-coding proof

A hash-aware peeling decoder is similar to conventional LDPC peeling decoder. Given the hashes of all the coded symbols and coded symbols of a layer, it iteratively solves degree-1 parity equations and check each decoded symbol against its hash and associated parity equations (Algorithm 1). This way, the detection and proof of incorrect-coding is minimized to one parity equation.

The key condition for the peeling decoder to work is that the hashes of all the coded symbols are available. This is assured by CMT: By first downloading the root, the decoder will have all the hashes needed to decode the previous layer. Once this layer is successfully decoded, the decoded data symbols will provide all the hashes needed to decode its child layer. This top-down decoding continues until the data block is decoded, or incorrect-coding is detected at one of the layers. To prove a failed parity equation that consists of coded symbols, the decoder only needs to provide the Merkle proofs of these coded symbols, and the value of coded symbols. Note that the higher the failed layer, the shorter the Merkle proof of each symbol in the incorrect-coding proof.

In addition, the peeling decoder only works if 1) there are coded symbols available, and 2) that these coded symbols allow the recovery of all the data symbols. While the first condition is checked by light nodes through random sampling, the second condition requires us to find, for every layer, a erasure code whose undecodable ratio is under peeling decoding. The best performance is achieved if the codes are extremely sparse (with a small ) and have a large . We now present such an ensemble of LDPC codes.

4.4 Construction of Erasure Code

An erasure code can be described by an parity check matrix , where each column of describes a parity equation, such that for any valid codeword . In addition, every stopping set of corresponds to a set of coded symbols whose hiding will prevent the full recover of data symbols using peeling decoder. For an parity check matrix , a set of rows is called a stopping set if no column in has one non-zero element. Here is the submatrix of that only consists of the rows in .

1 Inputs: the hashes of all coded symbols and coded symbols;
2 Initial check: checks all the degree-0 parity equations (i.e., those whose coded symbols are all known). If any parity equation is failed, report an incorrect-coding proof and exit;
3 while not all the data symbols are recovered do
4       Find a degree-1 parity equation, which only has one unknown coded symbol;
5       Recover this coded symbol and verify it with its hash. If failed, report an incorrect-coding proof and exit;
6       Check all the associated degree-0 parity equations. If any parity equation is failed, report an incorrect-coding proof and exit.
7 end while
Algorithm 1 Hash-aware peeling decoding algorithm

Correspondingly, there is no parity equation that includes exactly one coded symbol among those indexed by . Thus, if this set of coded symbols are hidden, there is no degree-1 parity equation to recover them. Since the peeling decoder is essential for us to construct small incorrect-coding proof, the undecodable ratio of a block is equivalent to the stopping ratio of , which is the size of the smallest stopping set divided by .

While CMT admits any erasure codes, SPAR generates randomly using the methods introduced in [12, 23] and analyzed in [27]. Given two integers and that satisfies , we first generate a binary matrix that is a permutation of an by identity matrix (note that ). We then partition into slice, where each slice is a sub-matrix. Then if and only if slice-

contains an odd number of 1s, for

and . Such a random has the following three critical properties:

  1. It has a maximum row weight of , and a maximum column weight of ;

  2. It has a non-zero probability to have a stopping ratio of at least , where is a critical stopping ratio inherent to this method and is independent of ;

  3. It is NP-hard to find the minimum stopping set and determine the stopping ratio of .

Property 1 implies that the corresponding LDPC code has a maximum parity equation size of . Property 2 implies that we could provide the same undecodable ratio (thus same sampling requirements) for all layers. Both are desirable.

Nevertheless, Property 2 and 3 together imply that we, as the developers, are not able to determine whether the LDPC codes we generate are good () or not, for any reasonably large (e.g., ).

Fortunately, this problem can be easily solved through a bad-code proof. If an honest full node cannot fully decode the data symbols after receiving coded symbols, then this code is bad, and its small undecodable set has been found and hidden by a (very strong) malicious block producer. In this case, the honest full node can prove this bad code by broadcasting the indices of the coded symbols it is missing. Upon receiving a valid bad-code proof, all the nodes in the system reject the associated block, and regenerate a code for the failed layer using an agreed random seed. This seed can be drawn from a pre-defined number sequence or the block header of a previous block, so that no consensus protocol is needed.

In other words, we solve the NP-hard problem of finding good codes by exploiting the computational resources of malicious party. Once it finds a small undecodable set and hide it, the system can easily detect this, reject the block, and update the code. This way, the system will settle at a good code for every layer eventually. As we will show in the next section, the probability of generating good codes is extremely high, so that SPAR can settle at good codes very quickly without having light nodes accept any fraud blocks. In addition, since bad code is a rare event, a heavy incentive/punishment scheme can be applied to deter false bad-code proof. Thus, the download and verification cost of bad-code proof is amortized to negligible throughout the course of the system.

4.5 Protocol Summary: Actions of Different Node Types

  • Block producer (full node): (a) It generates CMT and broadcasts the CMT root to all nodes, as well as broadcasts the entire original block (not CMT, as it can be retrieved using the original) to the full nodes only. (b) On receiving sample requests from the light nodes, respond to them.

  • Light node: (a) On receiving a new CMT root (or a CMT root of a pending block from a new full node) , it makes separate, anonymous, and intermittent sampling requests with replacement to full nodes who claim that the block is available, as described in Section 4.2. (b) On receiving a sample, it broadcasts it to all connected full nodes. (c) If a node receives all requested samples, it assumes the block is available. (d) If a node does not receive all requested samples within a fixed time, it “pends” the block (i.e., keeps it in pending status). (e) If a node receives an incorrect-coding proof or bad-code proof, it rejects the block. In case of bad-code proof, it will also update the erasure code of the failed layer.

  • Other full node. (a) On receiving samples, it tries to recover the data block through both downloading the original data block from the other full nodes and collecting coded symbols forwarded by the light nodes. It will decode the tree from top to bottom using a hash-aware peeling decoder. (b) If an incorrect coding or a bad code has been detected, it will send the corresponding proof and reject this block. (c) If it has received/fully decoded a data block and verified it, it will declare the availability of this block to all other nodes and respond to sample requests from light nodes.

5 Performance Analysis

5.1 Security

Theorem 5.1

In SPAR, a block producer cannot cause the soundness and agreement to fail with a probability lower than

Here and are the number of symbols and undecodable ratio on the th layer of CMT, , and is the number of coded symbols each light node samples from the base layer.


Soundness: Soundness fails if a light node thinks that a block is available, and no full node is able to reconstruct the entire coded Merkle tree. We note that the reconstruction fails if any layer of the CMT cannot be recovered correctly. Let us focus on a single layer with coded symbols and an undecodable ratio of , and assume that the malicious block producer hides fraction of the coded symbols (and does not respond to requests for those symbols).

Case-1: Consider the case of . The probability of soundness failure for a node is given by the probability that a node receives all symbols that it samples, this probability is .

Case-2: Consider the case of . The soundness failure occurs if a full node cannot decode the entire block or is unable to furnish a incorrect-coding proof. The full node will fail to accomplish these tasks only when it is able to receive fewer than fraction of symbols. Define to be the total number of distinct symbols collected by the honest full node (). Let be the total number of light nodes, then is the total number of i.i.d. samples. Now we have


Here (3) is by counting the number of sampling instances that provide less then distinct symbols. is the binary entropy function. It is apparent that we would need large to make the above bound vanish exponentially with .

The probability of soundness failure is smaller than the maximum probability of the two cases.

Agreement: We will argue here that soundness implies agreement for our protocol. As defined, soundness ensures that a honest full node is able to decode the block. Once a honest full node decodes the block, it will let all light nodes know that it has that block. The light nodes have either already accepted the block or have “pend”-ed the block (the light nodes could not have rejected the block since it is a valid block). The light nodes that pended the block will query with the honest full node and eventually accept the block. Thus soundness implies agreement (since now every light node agrees on the availability of the block).

Theorem 5.1 implies that the security of SPAR increases exponentially with the number of samples each light node takes (), when the number of light nodes () is linear with the block size.

5.2 Costs and Complexity

A light node has three download costs: 1) the header, which is the CMT root of size ; 2) the random sampling, and 3) the incorrect-coding proof. In CMT, the header is the CMT root of size . The sampling cost can be computed using the average parity-symbol-sampled Merkle proof size given in (2) to be:


where is the size of a block, and the equations hold due to that 1) is a constant; and 2) is the base layer symbol size, which is a constant. The incorrect-coding proof size can be similarly computed as


where the first term is the size of coded symbols, and the second term is the size of Merkle proofs. Finally, since the hash-aware peeling decoder decodes one coded symbol using coded symbols in one parity equation, the decoding complexity is per symbol and, thus, is in total.

5.3 Choice of parameters

Our first key parameter is the coding rate . A smaller means more parity symbols and thus a potentially larger undecodable ratio and less sampling. But it will also increase the height of CMT, the Merkle proof size, and the decoding complexity. For a reasonable tradeoff, we choose .

Given , the next two parameters we should decide are a pair that satisfies for the random LDPC code generator, where is the maximum parity equation size. This gives us and requires us to find the critical undecodable ratio of the ensemble as a function of , which is provided in Table 2 based on the analysis in [27].

4 8 12 16 20
0.0795 0.124 0.111 0.0981 0.0877
Table 2: The critical undecodable ratio as a function of

Evidently, maximizes the critical undecodable ratio. In addition, it also admits a small incorrect coding proof that only requires only 7 coded symbols and 8 Merkle proofs. As a result, we choose .

5.4 How quickly does SPAR settle at a good erasure code?

Due to random code generation, each layer of SPAR eventually settles at a good code (with an undecodable ratio of at least ) after a few bad codes have been deployed by the system and potentially utilized by malicious block producer to hide the data. We study the number of such attacks (note that they will never succeed) and updates before SPAR can settle: intuitively, this number can be computed as , where is the probability that a randomly generated code has an undecodable ratio smaller than . Using an upper bound on characterized in [27] we can derive the settlement speed of SPAR as below. We note that most of the layers of CMT will immediately settle at a good code upon launching. The only exception is the layer with , which will settle after 7.7 bad codes, but without any fraudulent blocks been accepted. The proof is in Appendix 0.A.

Theorem 5.2

Using the random LDPC code generation technique described in Section 4.4, the expected number of bad erasure codes () a CMT layer with coded symbols will use before it settles at a good code is approximated in Table 3.

256 512 1024 2048 4096
0.886 5.3e-2 2.0e-3 1.3e-3 3.2e-4 3.2e-4
# bad codes 7.7 0.06 0.002 0.001 0.0003
Table 3: Number of bad code before settlement.

6 Implementation for Bitcoin and Experiments

We developed in Rust a Coded Merkle Tree library [2] for Parity Bitcoin [33] clients (see Appendix 0.B for more details). Our library integrates seamlessly into the current mainnet implementation, and requires minimal change on the block data structure (only need to add the root of CMT into the block header). Due to the change to block data structure, despite being minimal, our protocol is not directly compatible with existing Bitcoin clients. Development of a Bitcoin testnet with complete data availability proof functionalities enabled by CMT and SPAR, together with a Bitcoin Improvement Proposal (for Bitcoin Core), is an activity of current research; these activities are outside the scope of this paper.

We can combine the CMT library with the performance analysis in Section 5, and numerically evaluate SPAR’s light node download costs (header, sampling, and incorrect-coding proof) and full node decoding speed, for the typical Bitcoin block size of 1MB. We also compare the performance of SPAR with that of the 2D-RS based solution proposed in [10] using its C++/Python implementation [13, 6], for a wide range of block sizes as specified in Table 4:

parameter value notes
symbol size (B) 256
base layer to block size is thus 1 to 1024 MB
coding rate 0.25 thus
hash size (B) 32 SHA256 is used
target confidence 99% each light node keeps sampling until it is 99% confident
SPAR specific parameters
LDPC sparsity (6, 8) each parity equation has at most 8 coded symbols
stopping ratio 0.124 symbols must be hidden to prevent decoding
batching factor 8 CMT layer size reduction rate is as ordinary trees
CMT root size 256 hashes the same as 2D-RS header size for 1MB blocks
Table 4: Experiment Parameter Configuration

Header (Fig. 4(a)): A SPAR light node only downloads fixed hashes in header, whilst 2D-RS requires . Thus, the header download cost of SPAR becomes much smaller than 2D-RS with growing block size. For a 64MB block, the cost is only 0.01% of the block size in SPAR, but is 0.1% in 2D-RS.

Incorrect-coding proof (Fig. 4(b)): A SPAR incorrect-coding proof only involves coded symbols and their Merkle proofs, whilst 2D-RS requires . Thus, the incorrect-coding proof download cost of SPAR becomes much smaller than 2D-RS with growing block size . For a 64MB block, the cost is only 0.051% of the block size in SPAR, but is 0.48% in 2D-RS.

Sampling cost (Fig. 4(c)): 2D-RS has a higher undecodable ratio of compared to SPAR’s . Thus, for 99% confidence, distinctive samples are enough in 2D-RS, whilst SPAR requires if the adversary is strong enough to find, with NP-hardness, the size- stopping set. But under a realistically weak adversary that randomly selects CMT symbols to hide, SPAR only requires because our LDPC ensemble can tolerate an average of missing symbols. On the other hand, the over-sampling of each layer increases SPAR’s sampling cost. Thus, although both techniques offer sampling costs that quickly reduces with growing block size, the cost of SPAR is about 1016 (resp. 2.54) times of 2D-RS under strong (resp. weak) adversaries. However, in practice, one can further reduce SPAR sampling cost by increasing the header size , thus reducing the size of the Merkle proof of each symbol.

Decoding speed (Fig. 4(d)): SPAR’s sparse and binary encoding, at its current implementation level is already over 10 times faster than 2D-RS for all the tested block sizes.

(a) Header cost
(b) Incorrect-coding proof size
(c) Sampling cost
(d) Decoding speed
Figure 4: Communication costs and decoding speed of SPAR and 2D-RS.

7 Conclusion and Discussions

By iteratively applying a special ensemble of LDPC codes to every layer of a Merkle tree and batching the hashes of each coded layer into the data symbols of the next layer, we invented a novel hash accumulator called coded Merkle tree (CMT). Built upon CMT, we proposed a novel data availability verification system called SPAR, which allows the availability and integrity of the entire tree to be checked at constant costs.

SPAR can play a key role in scaling blockchain systems that incorporate light nodes because it empowers these nodes with real-time verification of data availability and integrity at small and constant costs. SPAR can also be used to scale the communication of sharded blockchain systems (e.g., [24, 20, 21]), where full nodes of one shard operate as light nodes of other shards, as SPAR allows them to efficiently check the availability and integrity of blocks in other shards.

Integrating SPAR into existing blockchain systems requires minimum changes and no extra bandwidth consumption. An honest block producer only needs to broadcast the original data block as usual and attach the CMT root in the block header. This is sufficient for other full nodes to reproduce the CMT and offer sampling services for light nodes. Our library for CMT in Rust for Parity Bitcoin clients maintains the same API as the standard Merkle tree module. Noting that classic Merkle trees are indeed special CMTs with coding rate and batching factor , our library readily replaces the standard module and is backward compatible.


This research was conducted under the auspices of Trifecta Blockchain Inc.


Appendix 0.A Proof of Theorem 5.2

Based on the proof of Theorem 8 of [27], we know that for an LDPC code that is randomly chosen from a ensemble, described in Section 4.4, the probability that the stopping distance of the code is smaller than is upper-bounded by


where for , , and as the only positive solution to ,

For , the above upper bound for small (e.g., ) becomes degenerated (i.e., reduces to the trivial bound of ). In order to obtain a good approximation for all the values of , we approximate the upper bounds on using a slightly smaller undecodable ratio of . Then, we evaluate the upper bounds on in (8), for all the considered values of in Theorem 5.2 to obtain the probabilities in the second row of Table 3.

We note that since 0.116 is very close to 0.124, SPAR’s inherent oversampling of intermediate layers will provide sufficient protection for data availability on these layers, so that the light node sampling cost will not increase.

Appendix 0.B Coded Merkle Tree Library

We developed in Rust a Coded Merkle Tree library [2] for Parity Bitcoin [33] clients. We modify the data structure of the block header to add a new field coded_merkle_roots_hashes, which are the hashes of the coded symbols on the last level of the coded Merkle tree constructed from this block.

To use the Coded Merkle Tree library on a block, we require the following input parameters from the users:

  • BASE_SYMBOL_SIZE: size of a symbol on the base level, and hence the number of systematic symbols on the base level.

  • AGGREGATE: number of hashes to aggregate into a symbol on the next level.

  • HEADER_SIZE: number of hashes stored in the block header. This also decides the total number of levels in the coded Merkle tree.

  • Codes for all levels of coded Merkle tree, in forms of sparse representations of their parity-check matrices.

Given the above parameters, Coded Merkle Tree implements the following key functionalities:

  • coded_merkle_roots: construction of the coded Merkle tree from the block content.

  • merkle_proof: generating the Merkle proof for any symbol on any level of coded Merkle tree. By design, this returns a set of symbols on the higher level.

  • sampling_to_decode: sampling symbols on the base level, together with their Merkle proofs.

  • run_tree_decoder: decode the entire coded Merkle tree level by level from the roots. Each level is decoded by running a hash-aware peeling decoder, using the decoded symbols on the previous level as the hash commitments.

  • generate_incorrect_coding_proof: 1) when a coding error is detected, this function returns symbols in a parity equation, and Merkle proofs for all symbols in that equation; 2) when the peeling process gets stuck before all symbols are decoded, this function returns the indices of the missing symbols as a stopping set.