Log In Sign Up

Analysis of Indexing Structures for Immutable Data (Full Version)

by   Cong Yue, et al.

In emerging applications such as blockchains and collaborative data analytics, there are strong demands for data immutability, multi-version accesses, and tamper-evident controls. This leads to three new index structures for immutable data, namely Merkle Patricia Trie (MPT), Merkle Bucket Tree (MBT), and Pattern-Oriented-Split Tree (POS-Tree). Although these structures have been adopted in real applications, there is no systematic evaluation of their pros and cons in the literature. This makes it difficult for practitioners to choose the right index structure for their applications, as there is only a limited understanding of the characteristics of each index. To alleviate the above deficiency, we present a comprehensive analysis of the existing index structures for immutable data, evaluating both their asymptotic and empirical performance. Specifically, we show that MPT, MBT, and POS-Tree are all instances of a recently proposed framework, dubbed Structurally Invariant and Reusable Indexes (SIRI). We propose to evaluate the SIRI instances based on five essential metrics: their efficiency for four index operations (i.e., lookup, update, comparison, and merge), as well as their deduplication ratios (i.e., the size of the index with deduplication over the size without deduplication). We establish the worst-case guarantees of each index in terms of these five metrics, and we experimentally evaluate all indexes in a large variety of settings. Based on our theoretical and empirical analysis, we conclude that POS-Tree is a favorable choice for indexing immutable data.


page 1

page 2

page 3

page 4


Analysis of Indexing Structures for Immutable Data

In emerging applications such as blockchains and collaborative data anal...

OSM-tree: A Sortedness-Aware Index

Indexes facilitate efficient querying when the selection predicate is on...

There is No Such Thing as an "Index"! or: The next 500 Indexing Papers

Index structures are a building block of query processing and computer s...

The Persistent Buffer Tree : An I/O-efficient Index for Temporal Data

In a variety of applications, we need to keep track of the development o...

Metrics and Ambits and Sprawls, Oh My

A follow-up to my previous tutorial on metric indexing, this paper walks...

Just-in-Time Index Compilation

Creating or modifying a primary index is a time-consuming process, as th...

Robust and Scalable Content-and-Structure Indexing (Extended Version)

Frequent queries on semi-structured hierarchical data are Content-and-St...

1. Introduction

In high stacks applications or highly regulated industry, accurate history of data is required for auditing and tracking purposes. The data in the network is also often vulnerable to malicious tampering. To support data lineage verificaton and mitigate malicious data manipulation, data immutability is essential for applications such as banking transactions and emerging decentralized applications, such as blockchains, digital banking and collaborative analytics. From the data management perspective, data immutability leads to two major challenges.

First, it is challenging to cope with the ever-increasing volume of data caused by immutability. For instance, any full node in the Bitcoin blockchain network needs to record all Bitcoin transactions in the history, which are currently more than 200GB in size, increasing day by day (7). As a consequence, a new node joining the network would need to spend hours or even days to download the entire history of transactions, which is a significant overhead. Another example is the sharing and storage of healthcare data for healthcare analytics. Data scientists and clinicians often make relevant copies of current and historical data in the process of data analysis, cleansing, and curation. Such replicated copies could consume an enormous amount of space and network resources. To illustrate, let us consider a dataset that (i) has 100,000 records initially, (ii) receives 1,000 record updates in each modification, and (iii) has a version created on each commit. Figure 1 shows the space and time required to handle all versions as functions of the total number of versions, as well as the space and time required if we disregard duplicate record among different versions111Run with Intel(R) Xeon(R) E5-1620 v3 CPU and 1 Gigabit Ethernet card. . It can be observed that (i) the space and time overheads are significant if all versions are to stored separately, and (ii) such overheads could be considerably reduced if we can deduplicate the records in different versions.

Figure 1. Data storage and transmission time improved by deduplication

The second challenge is that in case a piece of data is tampered with (e.g., malicious manipulation of crypto-currency wallets or unauthorized modifications of patients’ lab-test data), we have to detect it promptly. To address this challenge, the system needs to incorporate tamper-resistant techniques to support the authentication and recovery of data, so as ensuring data immutability. Towards this end, a typical approach is to adopt cryptographic methods for tamper mitigation, which, however, considerably complicates the system design.

Most existing data management solutions tackle the above two challenges separately, using independent methods that do not affect each other. In particular, they typically (i) ensure tamper evidence using cryptographic fingerprints and hash links (Nakamoto, 2009), and (ii) achieve deduplication with delta encoding (Maddox et al., 2016; Huang et al., 2017). Such decoupled design, however, incurs extra overheads that could severely degrade the system performance. For example, in the state-of-the-art blockchain systems such as Ethereum (10) and Hyperledger (15), tamper evidence is externally defined and computed on top of the underlying key-value store (e.g., LevelDB (21) or RocksDB (33)), which leads to considerable processing costs. In addition, delta-encoding-based deduplication (e.g., in Decibel (Maddox et al., 2016)) requires a reconstruction phase before an object can be accessed, which renders data accessing rather inefficient.

Motivated by the above issues, recent work (43; 44; 15) has explored data management methods with native supports for both tamper evidence and deduplication features. This leads to three new index structures for immutable data, namely, Merkle Patricia Trie (MPT) (Wood, 2014), Merkle Bucket Tree (MBT) (15), and Pattern-Oriented-Split Tree (POS-Tree) (Wang et al., 2018), all of them have been adopted in blockchain applications. To our knowledge, however, there is no systematic comparison of these three index structures in the literature, and the characteristics of each structure is not fully understood. This renders it difficult for practitioners to choose the right index structure for their applications.

To alleviate the above deficiency, this paper presents a comprehensive analysis of MPT, MBT, and POS-tree. Specifically, we make the following contributions:

  • We show that MPT, MBT, and POS-Tree are all instances of a recently proposed framework, named Structurally Invariant and Reusable Indexes (SIRI) (Wang et al., 2018). Based on this, we identify the common characteristics of MPT, MBT, and POS-Tree in terms of tamper evidence and deduplication.

  • We propose a benchmarking scheme to evaluate SIRI instances based on five essential metrics: their efficiency for four index operations (i.e., lookup, update, comparison, and merge), as well as their deduplication ratios, which is a new metric that we formulate to quantify each index’s deduplication effectiveness. We establish the worst-case guarantee of each index in terms of these five metrics.

  • We experimentally evaluate all three indexes in a large variety of settings. We demonstrate that they perform much better than conventional indexes in terms of the effectiveness of deduplication. Based on our experimental results, we conclude that POS-Tree is a favorable choice for indexing immutable data.

The rest of the paper is organized as follows. Section 2 presents the background and prior researches on the core application properties. Section 3 presents SIRI, along with an extended discussion on its significance and the explanation of three SIRI representatives. A theoretical analysis is conducted in Section 4 to reveal the operational bounds of SIRI while the experimental evaluation is reported in Section 5. We conclude this paper in Section 6.

2. Related Work

We first discuss the background and several primary motivations leading to the definition of SIRI.

2.1. Versioning & Immutability

Data versioning has been widely employed for tolerating failures, errors, and intrusions, as well as for analysis of data modification history. ElephantFS (Santry et al., 1999) is one of the first-generation file systems with built-in multi-version support. Successor systems like S4 (Strunk et al., 2000), CVFS (Soules et al., 2003), RepareStore (Zhu and Chiueh, 2003) and OceanStore (Kubiatowicz et al., 2000) improve the early design by maintaining all versions in full scope and upon each update operation. In databases, data versioning techniques are used for transactional data access. Postgres (Stonebraker and Rowe, 1986; Bernstein and Goodman, 1983), for example, achieved comparable performance to the database systems without versioning support. Fastrek (Chiueh and Pilania, 2005) enhanced Postgres with intrusion tolerance by maintaining an inter-transaction dependency graph based on the versioned data and relying on the graph to resolve data access conflicts. Multi-versioning is also used to provide snapshot isolation in database systems (Ports and Grittner, 2012; Berenson et al., 1995) although such systems usually do not store the full history versions. To directly access multi-versioned data, a number of multi-version data structures can be applied from the literature, such as multi-version B-tree (Lanka and Mays, 1991; Rodeh, 2008), temporal hashing (Kollios and Tsotras, 2002) and persistent data structures (R. Driscoll et al., 1989; Okasaki, 1999).

Immutable data are becoming versatile in emerging applications. For example, blockchains (10; 15; 27) maintain immutable ledgers, which keep all historical versions of the system status. Similarly, collaborative applications (6; 13) maintain the whole evolutionary history of datasets and derived analytic results, which enables provenance-related functionalities, such as tracking, branching, and rollback. A direct consequence of data immutability is that all stored data are inherently multi-versioned upon being amended. There exist a wide range of storage systems handling such data in either a linear manner, such as multi-version file systems (Santry et al., 1999; Strunk et al., 2000; Soules et al., 2003) and temporal databases (Ahn and Snodgrass, 1986; Salzberg and Tsotras, 1999; Tansel et al., 1993), or a non-linear manner, such as version control systems including git (11), svn (41) and mercurial (24), and collaborative management databases including Decibel (Maddox et al., 2016) and OrpheusDB (Huang et al., 2017). Git and Git-like systems are also used to manage the history and branches of datasets to achieve efficient query and space utilization (6; 37; 12).

2.2. Data-Level Deduplication

Deduplication approaches are invented to address the overhead of storage consumption when maintaining multi-versioned data. They are usually used in data level, meaning directly manipulate the raw datasets. For example, one of the most common approaches is record-level delta encoding. In this approach, the new version stores only the modified records, called delta, against the previous version. For instance, in collaborative management databases like Decibel and OrpheusDB, the system first compiles the delta between the current commit and the previous version and then only persists this delta instead of the whole dataset to store a new version of the data. As a result, it is effective to manage data versions when the deltas are small, despite the extra cost during data retrieval that needed to reconstruct the specified version of data. However, delta encoding is ineffective in removing duplicates among non-consecutive versions and different branches of the data. It is always needed to construct the specified version of data from all predecessor deltas even if the specified version of data is identical to a much older version of data.

To enable the removal of duplicates among any data versions, chunk-based deduplication can be applied. Unlike delta encoding, this approach works across independent objects. It is widely used in file systems (Paulo and Pereira, 2014; Xia et al., 2016), and is a core principle of git. In this approach, files are divided into chunks, each of which is given a unique identifier calculated from algorithms like collision-resistant hashing to detect identification. Chunk-based deduplication is highly effective in removing duplicates for large files that are rarely modified. In case an update leads to a change of all following chunks, i.e., the boundary-shifting problem (Eshghi and Tang, 2005), content-defined chunking (Muthitacharoen et al., 2001) can be leveraged to avoid expensive re-chunking.

2.3. Tamper Evidence

Security-conscious applications demand data integrity against malicious modifications, not only from external attackers but also from malicious insiders. Examples include outsourced services like storage (Kallahalla et al., 2003) or file system (Li et al., 2004), which often rely on cryptographic hash functions (e.g., SHA-256) and Merkle trees (Merkle, 1988) to achieve data integrity. However, without proper tamper evidence support from storage, applications have to implement ad hoc solutions and maintain them costly. For example, the emerging Blockchain systems (27; 17; 10) adopt linked blocks and Merkle trees to guarantee the immutability of the ledger. To support both efficient access and tamper evidence of data stored on the ledger, current blockchain systems contain bespoke implementations of Merkle-tree-like data structures on top of a simple key-value storage such as LevelDB (21) or RocksDB (33), which are not general or portable.

Emerging applications and systems have demonstrate the needs of data immutability and tamper-evidence. QLDB (2) builds ledger-like immutable transaction log and using cryptography for tamper-evidence to provide capability to maintain and verify accurate history of application data. Azure Blockchain Service (3) is a ledger service that enables users the ability to easily use blockchain services at scale to manage the data.

3. SIRI Indexes

Motivated by the aforementioned demands in emerging applications, a new family of indexes called Structurally Invariant and Reusable Indexes (Full Version)(SIRI), was proposed recently (Wang et al., 2018) to efficiently support tamper evidence and effective deduplication.

3.1. Background and Notations

As discussed previously, despite the basic lookup and update operations, the ultimate goal of SIRI is to provide native data-versioning and tamper-evidence features. Consequently, data pages used by SIRI needs deduplication capability due to the storage pressure introduced by the indexed multi-versioned data meanwhile specific methodologies, such as cryptographic hashing, should be applied to enable temper-evidence.

To better elaborate the SIRI candidates, we use following notations in the rest part of this paper. The indexing dataset is denoted as where represents its -th version. is employed to represent SIRI structures, and stands for one of its instances. The key set stored in is set as , where denotes the -th key. stands for the internal page set of , where represents the -th page.

3.2. Formal Definition

We now provide the formal and precise definition of SIRI adapted from (Wang et al., 2018).

Definition 3.1 ().

An index class belongs to SIRI if it meets the following criteria:

  1. Structurally Invariant. If and are two instances of , then

  2. Recursively Identical. If and are two instances of and , where , then

  3. Universally Reusable. For any instance of , there always exists page and another instance such that and .

In Definition 3.1, there are three features that SIRI must possess. The first feature, Structurally Invariant, is to ensure that the order of update operations does not affect the internal structure of the index, while the second feature, Recursively Identical, is to guarantee the efficiency when constructing a large instance from small ones. The third feature is to secure that the pages of the index could be shared among different instances. In Definition 3.1, the last feature, Universally Reusable, has been adapted222The reason behind the adaptation is that there is a mis-claim in the original version: was prefixed with “any” instead of “exist”. In fact, in practice, it is impossible to guarantee that every page can be shared in limited indexing operations..

Figure 2. Two B-trees containing same entries could have different internal structures.

3.3. Extended Discussion

Recursively Identical and Universally Reusable are both aimed at making the pages share-able among various instances. However, they focus on different aspects. The former attribute concentrates on providing performance improvement when designing the indexes – updates do not bring in harmful impacts since the performance is often dominated by accessing a vast number of shared pages. The latter is to secure the theoretical boundary of SIRI’s performance. The higher the ratio of shared pages each instance gets, the better performance SIRI could reach in terms of deduplication. In the limiting case, where the dataset and indexing operations are infinite, every page in a SIRI instance could find its copy used by other instances.

It is non-trivial to construct a SIRI from the conventional structures. In fact, for all multi-way search trees, only a small part of nodes is changed while the majority of the nodes can be reused in the new instance when an update operation is performed. Therefore, such a structure is Recursively Identical. Further, the usage of copy-on-write implementation naturally enables node sharing among versions and branches. Hence, multi-way search trees can be Globally Reusable when applying such technique. However, conventional multi-way search trees are not Structurally Invariant. Take B-tree as an example, Figure 2 illustrates the absence of the first property through two insertion operations. As can be seen, two B-trees contain identical sets of items but have different data and index pages. Meanwhile, hash tables are not Recursively Identical when they require periodical reconstructions as the entire structure may be updated and none of the pages can be reused. Surprisingly, tries, or radix trees, can meet all the three properties with copy-on-write implementation. Firstly, they are Structurally Invariant since the positions of the keys only depend on their own bits and consequently, the same set of keys always leads to the same set of paths, which are combined as the same tree nodes and the same tree structure. Secondly, being a multi-way search tree, they are Recursively Identical. Lastly, the copy-on-write implementation can satisfy Globally Reusable. However, radix trees are unbalanced and may suffer from low performance in practical.

Due to the appearance of the three properties, the adoption of the aforementioned data-level deduplication approaches can be seamlessly applied in index level for SIRI. The identical pages from different index instances for multiple versions of the data can be shared and therefore, the system can persist only one copy to save space. Another benefit of applying index-level deduplication is that the system can access a version of the data directly from different indexes and data pages instead of experiencing a reconstruction phase from the deltas. Overall, the nature of SIRI renders effective detection and removal of duplicates without prohibitive efforts, which conventional index structures can hardly offer.

Figure 3. Merkle Partricia Trie (MPT).

3.4. SIRI Representatives

In this section, we elaborate three representatives of SIRI, namely MPT, MBT, and POS-Tree. As mentioned in section 3.3, all representatives are Recursively Identical and Globally Reusable leveraging a copy-on-write implementation of their nodes and being multi-way search trees. Meanwhile, they are Structurally Invariant as stated in Section 3.4.1, 3.4.2, and 3.4.3.

3.4.1. Merkle Patricia Trie

Merkle Patricia Trie (MPT) is a radix tree with cryptographic authentication as shown in Figure 3. The key in the structure is a string representing the cryptographic hashes for integrity verification while the value is a byte string. Similar to the traditional radix tree, the key is split into sequential characters, namely nibbles. There are four types of node in MPT, namely null, branch, leaf and extension. The structures of those nodes are illustrated in Figure ???: (1) null node includes an empty string indicating that the node contains nothing. (2) branch node consists of a 16-element array and a value. Each element of the array is indexing a corresponding child node and stores a nibble. (3) leaf node contains a byte string called “encodedPath” and a value. (4) extension node also contains encodedPath and a pointer to the next node. Similar to Merkle Tree, the whole MPT can be rolled up to a single cryptographic hash for integrity verification. The most well-known usage of this data structure is in Ethereum (10), one of the largest blockchain systems in the world.

The radix tree, or prefix tree, is a popular indexing structure that organizes and retrieves data in its prefix order. Each path in a radix tree encodes a data key, where the node at level of a path represents the character in position of the corresponding key. Paths having the same parent share the same prefix encoded by ancestor nodes. The structure of a node at level can be represented as , where are possible characters at position of the data keys that share the same prefix until -th position. Each character represents a different path and is bound with a child node. In addition, the node stores the value of the data whose key has been fully represented until the parent node. To retrieve a record, it traverses the tree navigated by the characters of the target key. To insert a record, it traverses until the last node sharing the common prefix with previous data and inserts the new child nodes from the position.

The radix tree offers efficient prefix operations and invariant structures for the same content. However, the radix tree is not time- and space-efficient, especially when the tree is sparse, since it requires to store a node for each character even when there is no branch along the path. One optimization is path compression. Nodes having only one child can be combined with the child. New node types are introduced for such purpose: a) a branch node contains characters for branching and an optional value field; b) an extension node contains the combined sequence of characters, and the record value in case it reaches the end of the key or a link to the child branch node. Another optimization is to store binary-coded keys instead of characters. The direct advantage is that the branch node will have a fixed number of children, and instead of scanning and comparing the characters, the traversal can be navigated using the encoded key as the position of the child. The structure of a branch node is , where is the position of the child whose current encoded byte equals to . The structure of an extension node is . During data retrieval or insertion, the branch nodes behave similarly as the basic radix tree. However, for extension nodes, the whole sequence of encoded bytes will be compared with the encoded target key at the same position during data retrieval. During data insertion, if a partial match is found, a branch node will be created at first divergent byte, and two child nodes will be created for the original key and the inserting key.

3.4.2. Merkle Bucket Tree

Figure 4. Merkle Bucket Tree (MBT).

The Merkle Bucket Tree (MBT) is a Merkle tree built on top of a hash table as shown in Figure 4. The bottom-most level of the MBT is a set of buckets and the cardinality of the bucket set is called the capacity of MBT. Data entries are hashed to those buckets, and the entries within each bucket are arranged in sorted order. To build a Merkle tree, the cryptographic hash is calculated for each bucket. The hash values of a number of buckets are collected together to form a new Merkle node at the penultimate level of the MBT. Despite the two levels explained, the rest levels of MBT behaves the same as a typical Merkle tree. The number of entries contained in each Merkle node is called the aggregation of MBT. The structure of bucket node can be represented as where is the records contained in the bucket as a byte sequence. The Merkle node can be represented as , where is the cryptographic hash of i-th child, and m is the aggregation.

Lookup To perform a MBT index lookup, we first calculate the hash of the data key and obtain the index of the bucket the data resides if it exists. Since the tree structure is static, we can calculate the path from the root to the bucket node using the index. Finally, the records in the bucket are scanned through using binary search to find the target key.

Insert, update and delete The insert, update and delete operation of MBT undergo similar procedures. It first performs a target key lookup to check the existence of the key. With copy-on-write, the insert operation will create a new bucket node containing the new key-value pair that preserving the original order. The update operation will create a new node with updated value, while the delete operation will preclude the target key and value from the new node. Finally, the cryptographic hash value of the newly created node is calculated and propagated recursively to the root node.

The design of MBT undoubtedly takes the advantages of the Merkle tree and the hash table. On the one hand, MBT offers tamper-evidence with a low update cost since only the set of nodes lying on the lookup path need to be recalculated. On the other hand, the data entries can be evenly distributed due to the nature of the hash buckets in the bottom level of the structure. Overall, capacity and aggreation are the key parameters affecting the performance of MBT and determined before the deployment. From the query efficiency perspective, aggreation and capacity jointly determine the trade-off among the number of nodes needs to be traversed, the number of entries needs to be scanned within a Merkle node, and the average number of records needs to be scanned or calculated per bucket. From the space consumption perspective, a higher capacity and a lower aggreation result in more nodes that the MBT will accommodate, but smaller additional spaces required for updates and vice versa.

3.4.3. Pattern-Oriented-Split Tree

Pattern-Oriented-Split Tree (POS-Tree) is a probabilistically balanced search tree proposed in (Wang et al., 2018). The structure is in fact a customized Merkle tree built upon pattern-aware partitions of the dataset as shown in Figure 5. The bottom-most layer of POS-Tree is a sequence of data records that are globally sorted on the keys. The layer is divided into leaf nodes of POS-Tree using the content-defined chunking approach. From the first byte of the data layer, the hash value of the byte sequence within a fixed-length sliding window is calculated using an incremental hash function. The hash value is used to compare with a predefined pattern. If there is a match, a node will be created at the current element boundary, and the sliding window will continue from the first byte of the next element. If there is no match, the sliding window will shift one byte to calculate the next hash value. The procedure will continue until it reaches the end of the data layer. The internal layers of POS-Tree is a sequence of split key and hash pair, where the hash values are calculated from child nodes using cryptographic hash functions. Like the data layer, we detect the pattern within a fixed-length sliding window starting from the first byte until the last byte. Unlike the data layer, the hash values are directly used to compare with the pattern instead of redundantly calculating a new hash using the incremental hash function. The design improves the performance of POS-Tree by reducing the number of hash calculations, while preserving the randomness of chunking. The last node created in each layer may not contains a pattern. Such pattern-aware partitioning enhances the deduplication capabilities, and the node structure enables efficient indexing by comparing the split keys to navigate the paths like B-tree.

Since the hash calculation is expensive, optimizations are made to mitigate the overhead. One optimization is to batch the update and insert operations. The positions of inserted keys are stored temporarily. When the commit function is called, it runs the pattern detection and the hash calculation from the bottom-most layer to the root in one run. The batching approach drastically reduces the POS-Tree building time with proper batch size. Another optimization is to create a node when the size exceeds a certain threshold. This is to prevent the occurrence of huge chunks, which may lead to significant I/O overhead scanning the node.

The node of POS-Tree is immutable, i.e., a copy of the node will be created before applying the update. Moreover, POS-Tree adopts shadowing among instances, i.e., common nodes can be reused instead of creating redundant nodes separately. Meanwhile, POS-Tree support efficient diff and merge adopting a three-way merge process, which is based on a structurally-invariant assumption. Note that all the features above can be applied to MPT and MBT.

Figure 5. Pattern-Oriented-Splitting Tree (POS-Tree) resembling a B-tree and Merkle tree.

4. Theoretical Analysis

In this section, we provide a comprehensive theoretical analysis of the three SIRI representatives discussed previously. We first calculate the theoretical bounds for common index operations like lookup and update, as well as complex operations needed by emerging applications including diff and merge. In addition, we define the deduplication ratio as a metric for measuring the efficiency of deduplication provided by SIRIs.

In the following subsections, denotes the maximum capacity of distinct keys in an index; denotes the maximum length of a key string; denotes the number of buckets in MBT. denotes the expected number of entries per page/node in MBT or POS-Tree. denotes the different records between two instances. denotes the average storage size for a sole record.

4.1. Operation Bounds

In this section, the bounds of common operations are calculated accordingly.

4.1.1. Index Lookup

We first evaluate the lookup complexity of the three candidates.

  • MPT – The traversal upon MPT is the same as a normal radix tree with path compaction. The computing bound for lookup is the maximum between and . Since is often larger than in the real systems ( equals to 64-byte in Ethereum’s zeroith variable), the lookup complexity in MPT is in most of the time.

  • MBT – Unlike other structures having constant leaf node scanning time, the size of MBT’s leaf node is and it therefore costs with binary search to scan the node. As a result, the total lookup cost, consisting of node traversing part and leaf node scanning part, is .

  • POS-Tree – The complexity in our implementation is , similar as a radix tree.

4.1.2. Index Update

In all candidates, an update operation firstly incurs a lookup for the updating keys, and then leads to the creation of the copy of affected nodes and the calculation of their hash values. In our analysis, we treat the number of entries in each node , the length of each record and the length of hashed value as constant values. That is to say, the size of internal nodes , and the size of leaf nodes are constant unless explicitly stated. Therefore, the cost of new node creation and the crypto-hash function is constant, and the cost of update mainly depends on the cost of the lookup.

The calculation and comparison among the three candidates are listed below:

  • MPT. In most cases, the complexity of the update in MPT is .

  • MBT – For leaf nodes, as the size increases linearly with , the complexity of hash function and node copying is . Additionally, there is a hash function to map keys to bucket nodes, which has a cost of O(L). Hence, the complexity of update in MBT is .

  • POS-Tree – Similar as crypto-hash function, the cost of the rolling hash function p for detecting node boundary is also constant. This results in the update complexity of .

4.1.3. Indexes Diff

Diff is the operation that compares two index instances. It returns all records that are either present in only one index or different in both indexes. Therefore, Diff can be seen as multiple lookups in a naive implementation of the three candidates. The following bounds are calculated under this assumption. We directly give the results due to its triviality.

  • MPT or . As discussed previously, in most cases the complexity is the former.

  • MBT

  • POS-Tree –

4.1.4. Indexes Merge

Merge is the operation that combines all records from either indexes. The entire process of Merge contains two steps. The first step is to do a Diff operation between the instance to merge and the original instance, mark the different pages/nodes. The second step is to merge all the different nodes into the original instance. If there exist conflicts, namely a key in both instances with different values, the process must be interrupted and a selection strategy must be given by the end user to continue. The following calculation is based on the worst case when the merge process can be finished without interruption. Since the second step of the merge process is treated as operations in our analysis, the complexity of merge is dominated by the “diff” operation in the first step.

  • MPT or . In most cases, the complexity should be .

  • MBT

  • POS-Tree –

4.2. Deduplication Ratio

Persistent (or immutable) data structures demand a large amount of space for maintaining all historical versions of data. To alleviate space consumption pressure, the feasibility of detecting and removing duplicated data portions plays a critical role. In this section, we aim to quantify the effectiveness of such properties in indexes by defining a measurement called deduplication ratio.

4.2.1. Definition

Suppose there is a set of index instances , and each is composed of a set of pages . The byte size of a page is denoted as , we can derive byte count of set as:

The deduplication ratio of is defined as follows:


The quantifies the effectiveness of page-level data deduplication (i.e., sharing) among related indexes. It is the ratio between the overall bytes that can be shared between different page sets and the total bytes used for all the page sets. With a high , the storage is capable of managing massive amount of “immutable” data versions without bearing space consumption pressure. In the following subsections, we will use this metric to evaluate the three candidates accordingly.

4.2.2. Continuous Differential Analysis

In this part, we analyze a simple case that consists sequentially evolved indexes, i.e., the instance is derived from the instance. Each instance can be represented as a page set or a record set . The analysis of more complicated scenarios is treated as our future work. To ease our analysis, we assume that each instance differs its predecessor by ratio of a continuous key range , such that:

where denotes the record count in set , and denotes the minimum/maximum key in a set.

In the following analysis, we consider two scenarios:

  • Insertion of new records.

  • Update of existing records.

Merkle Bucket Tree. Since in MBT, the bucket size depends on the number of contained records, i.e.,

We denote the number of affected nodes on level in MBT as . Specially, the number of buckets (the leaf level) affected by differential is expressed as:

We can roughly summarize the total number of affected tree nodes, , as following:

The number of affected nodes in the continuous update can be calculated as:

Thus, the deduplication ratio for MBT is:

Surprisingly, the deduplication ratio is highly related to and has no direct connection with according to the analysis result.

Merkle Patricia Trie. In a simple MPT without any path compaction optimization, we have:

which indicates that


Inferred from the result, is affected by the distribution of stored keys since the length of the average length of the keys highly relate to the final deduplication ratio. In detail, the relationship between and determines whether the deduplication ratio of MPT is greater than or less than that of MBT.

POS-Tree. Similar to MBT, the calculation is as follows:

If we compare the analysis results of the three representatives, a conclusion can be made that MPT has the best deduplication ratio given the same in continuous differential scenario given proper query workloads and datasets (meaning ). Meanwhile, POS-Tree and MPT have equal bound for the deduplication ratio in this setting.

5. Experimental Benchmarking

Parameter Value
Dataset size() 1, 2, 4, 8, 16, 32, 64,
128, 256
Batch size() 1, 2, 4, 8, 16
Overlap Ratio 0, 10, 20, 30, 40, 50, 60,
70, 80, 90, 100
Write Ratio(%) 0, 50, 100
Zipfian parameter 0, 0.5, 0.9
Table 1. Parameter table for experiments

In this section, we evaluate three SIRI representatives, namely POS-Tree, MBT and MPT, through different experiments. First, the throughput and the latency of the indexes are measured to have an overview of how these structures perform in general cases. Second, the storage consumption, the deduplication ratio and the share ratio are evaluated to investigate the space efficiency among the candidates. Third, a breakdown analysis is given to show how each SIRI property affects the performance of the index. Finally, we integrate the structures in an existing database management system, Forkbase (Wang et al., 2018), to show how SIRI structures behave in real applications.

Our experiments are conducted on a single node server with Ubuntu 14.04, and equipped with an Intel Xeon Processor E5-1650 processor (3.5GHz) and 32GB RAM. To fairly compare the efficiency of the index structures in terms of page quantity and size, we tune the size of each index node to be approximately 1 KB.

5.1. Dataset

We use a synthesized YCSB dataset and two real-world datasets, Wikipedia data dump and Ethereum transaction data, to conduct a thorough evaluation of SIRI.

5.1.1. Ycsb

We generate the key-value dataset using YCSB according to the parameters shown in Table 1

. The lengths of the keys range from 5 bytes to 15 bytes, while the values have an average length of 256 bytes. The total number of records varies from 10,000 to 2,560,000. The dataset contains three types of workloads, read, write and mixed workload with 50% write operations. We use Zipfian distribution to simulate the scenarios where the data in the workload is skewed to different degree, where the Zipfian parameter

equals to 0 represents all records have equal possibility to be selected into the workload, while higher value means only smaller range of the records have extremely high possibility to be selected. We also generate the overlapped workloads to test the capability of deduplication with increasing similarity in the contents, as described in Section 5.4.2.

5.1.2. Wiki

The wiki dataset is real-world Wikipedia data dumps of the extracted page abstracts. The key of the dataset is the URL of the Wikipedia page, the length of which ranges from 31 bytes to 298 bytes and have an average of 50 bytes. While the value of the dataset is the extracted abstract in plain text format, the length of which ranges from 1 byte to 1036 bytes, having average of 96 bytes. We collect 6 data dumps covering the data changes in three months and divide the data into 300 versions. Each version has an average size of 855MB. We generate the read and write workload using keys uniformly selected from the dataset to test the throughput.

5.1.3. Ethereum Transactions

We use real-world Ethereum transaction data from Block 8900000 to 9200000, where the key is the 64-bytes block hash and the value is the RLP (Recursive Length Prefix) encoded raw transaction data. The length of the raw transaction ranges from 100 bytes to 57738 bytes with an average of 532 bytes. RLP is the main encoding method used to serialize objects in Ethereum, which is also used to encode raw transactions. In Ethereum, each block naturally makes a new version.

(a) = 0, write ratio = 0
(b) = 0, write ratio = 0.5
(c) = 0, write ratio = 1
(d) = 0.5, write ratio = 0
(e) = 0.5, write ratio = 0.5
(f) = 0.5, write ratio = 1
(g) = 0.9, write ratio = 0
(h) = 0.9, write ratio = 0.5
(i) = 0.9, write ratio = 1
Figure 6. Throughput on YCSB
(a) Wiki (b) Ethereum Transaction
Figure 7. Throughput on real-world dataset
(a) Wiki (b) Ethereum Transaction
Figure 8. Storage on real-world dataset

5.2. Implementation

In this section, we briefly describe the implementation of selected indexes and the baseline. For MPT, we directly port the Ethereum’s implementation (10) to our experiment environment, which adopts the path compaction optimization to improve the space and time efficiency. The implementation of MBT is based on the source code provided in Hyperledger Fabric 0.6 (15). We further make it immutable and add index lookup logic, which is missing in the original implementation. For POS-Tree, we use the implementation in Forkbase (Wang et al., 2018) with the rolling hash optimization mentioned in Section 3.4.3. Moreover, we further apply batching techniques, taking advantage of the bottom-up build order, to reduce the number of tree traversal and hash calculations significantly. Lastly, to compare SIRI and non-SIRI structures, we implement an immutable B+ tree with tamper evidence support, called Multi-Version Merkle B+ tree, as the baseline. We replace the pointers stored in index nodes with the hash of their immediate children and maintains an additional table from the hash to the actual address. For all structures, we adopt node-level copy-on-write to achieve the data immutability.

5.3. Throughput and Latency

We evaluate the three candidates and the baseline from a traditional view in this part, where throughput and latency are the major measurements.

5.3.1. Throughput

First, we evaluate the throughput using the YCSB dataset. We run the read, the write and the mixed workloads under diverse data size and skewness. The results are illustrated in Figure 6. It can be observed that the throughput of all indexes decreases as the number of total records grows. For POS-Tree, MPT and MVMB+-tree, this is caused by traversing a larger tree whose height enlarges logarithmically. For MBT, the reduction is caused by loading and scanning larger leaf nodes, which grows linearly. The result complies with the operation bound formulated in Section 4.1 and can be further verified in other experimental results within this section. In Figure 6(a)6(d) and 6(g), it can be seen that MBT outperforms the other three indexes in terms of read operations when the number of records does not exceed 640,000. This is natural since MBT has fixed and smaller tree structure in our experiment setting. When the number of records grows greater than 640,000, operations to load and scan large leaf nodes will dominate the processing time. As it is shown in Figure 11, the time to traverse the tree and load the nodes keeps constant, while time to scan leaf node keep increasing. Consequently, the performance of MBT drops quickly and becomes the worst among all structures.

(a) Read Balanced
(b) Read Skewed
(c) Write Balanced
(d) Write Skewed
Figure 9. Latency distribution
Figure 10. MBT cumulative latency
(a) Storage (b) Number of chunks
Figure 11. Single dataset scenario

By comparing Figure 6 horizontally, we can observe that the throughput of all data structures decreases drastically as the ratio of write operations increases. This is due to the cost of node creation, memory copy and cryptographic function computation. The absolute throughput drops over 9x in the comparison of the largest dataset for POS-Tree and baseline while it drops 24x for MBT. By comparing Figure 6 vertically, we can observe that there is no change in throughput for all index structures when changes from 0 to 0.9. Therefore, we can conclude that they are all resistant to data skewness.

It is also worth noting that, compared with MBT, the other three structures perform much more steady for both read and write workloads. Meanwhile, POS-Tree outperforms MPT in all cases and has comparable performance compared to our baseline index.

Next, we run the experiment on Wiki dataset. The system first load the entire dataset batched in 300 versions, and then execute the read and write workloads that is uniformly selected. Figure 7(a) demonstrates the results that is aligned with those in YCSB experiment.

Lastly, for experiments on Ethereum transaction data, we simulate the way blockchain stores the transactions. For each block, we build an index on transaction hash for all transactions within that block and store the root hash of the tree in a global linked list. Versions are naturally created at granularity of block. For write operations, the system will append the new block of transactions to the global linked list. While for lookup operations, it will scan the linked list for the block containing the transaction, and traverse the index to obtain the value. Figure 8(a) shows the results. We can observe that POS-tree surprisingly outperforms other indexes in write throughput. This is because we are building indexes for each block instead of a global index. Instead of insert/update operations, we performs batch loading from scratch. In this case, POS-tree’s bottom-up building process is superior to the MPT’s and MVMB+-Tree’s top-down building process, as it only traverses the tree and creates each node once. Another difference is that the read throughput is lower than the write throughput, because of the additional block scanning time.

5.3.2. Latency and path length

In this experiment, we measure the latency of each read and write operation and calculate the distribution with balanced and skewed data. For YCSB dataset, read-only and write-only workloads are fed into the indexes with balanced () and highly skewed ( distributions. The dataset used in this test contains 160,000 keys. The results are displayed in Figure 9. We run 10,000 operations and pictured the latency distribution. The x-axis is the range of latency and y-axis is the number of records fall in that latency range. It is seen that the general rankings among the indexes coincide with the previous conclusion – POS-Tree performs well for both read and write workloads while MPT performs the worst. Meanwhile, MPT has several peak points representing operations accessing data stored in different levels of the tree. MBT experiences the most dramatic changes between read and write workloads. It outperforms all the candidates in the read workloads but is worse than POS-Tree in write workload.

To take a closer observation of how the workloads affect our candidates, we further gather the traversed tree height of each operation for read-only workload with the largest data set following the balanced distribution. The results are shown in Figure 18, where the x-axis represents the height of the lookup path and the y-axis indicates the number of operations. We omit the evaluation result for the skewed distribution here as it is similar to the balanced situation. Most operations have to visit 4-level nodes to reach the bottom-most level of POS-Tree whilst 5- or 7-level nodes are frequently traversed for MPT. The efficiency in MBT is also certified in the figure since all requests only need 3 levels to reach the bottom in the structure in both balanced and skewed scenarios.

(a) Read (b) Write
Figure 12. Latency on Wiki data
(a) Read (b) Write
Figure 13. Latency on Ethereum transaction data

We can obtain similar results and conclusions on Wiki dataset as shown in Figure 13. In contrast, the experiment on Ethereum transaction data show different results, depicted in Figure 13. All structures have similar read latency, due to the dominant block scanning process, which is not necessary for the other 2 experiments. In addition, the write latency shows larger deviation because the execution time depends on the number of transactions in that block, which is not fixed.

5.4. Storage

In this section, we evaluate the space consumption of the index structures under different scenarios.

5.4.1. Simple Case

We first demonstrate a simple case, where a single dataset is maintained by the same group of users. There is no sharing of data or collaborative editing in this setting. Therefore, the deduplication benefit is limited using SIRI. The results on YCSB dataset are depicted in Figure 11(a). In the experiment, the dataset size is again varied to see the trend of space consumption. There are two main factors affecting the space efficiency, including the size of the node and the height of the tree. On the one hand, larger tree height results in more node creations for write operations, which also increases the space consumption. As an example, MPT performs badly since it has the largest tree height in our experiment setting. It consumes the storage up to 1.6x higher than the baseline and up to 1.4x larger than POS-Tree. On the other hand, large node size means that even minor changes to the node could result in creating a new substantial node, which hence leads to larger space consumption. As can be seen in Figure 11(a)

, MBT performs the worst due to the largest node size it has in the implementation. It consumes up to 6.4x the space of that used by the baseline. Compared to the baseline, POS-Tree also has larger node size variance due to content-defined chunking, leading to more large nodes. Therefore, it consumes more space than the baseline as reflected in the figure.

(a) Storage
(b) Number of chunks
(c) Deduplication ratio
(d) Chunk sharing ratio
Figure 14. 10 datasets scenario respect to overlap ratio
(a) Storage
(b) Number of chunks
(c) Deduplication ratio
(d) Chunk sharing ratio
Figure 15. 10 datasets scenario respect to batch size
(a) Deduplication ratio (b) Chunk Sharing ratio
Figure 16. How Structure Invariance affect POS-Tree deduplication
(a) Deduplication ratio (b) Chunk Sharing ratio
Figure 17. How Recursively Identical affect POS-Tree deduplication

To better analyze how the memory space is used by different pages, we further accumulate the number of chunks/pages for all chosen indexes. The results are demonstrated in Figure 11(b) with variant dataset sizes. Typically, they follow similar trends as Figure 11(a), except that MBT generates the least number of nodes. The reason is rooted from the nature of MBT, which has a fixed total number of nodes and increasing leaf node size as more records are inserted. Therefore, the number of nodes created keeps constant when updating or inserting, no matter how large the total number of records is. On the contrary, other structures have fixed node size and an increasing number of nodes, causing the number of nodes created, as well as the height of the tree, increases during updating or inserting.

The results for Wiki dataset and Ethereum transaction dataset are displayed in Figure 7(b) and Figure 8(b). Similar to the results of YCSB experiment, MBT and MPT consumed more space than POS-Tree and MVMB+-Tree. A difference is that MPT storage consumption increases very fast as the number of versions are loaded. This is because the key length of Wiki dataset is much larger than that of YCSB, and the encoding method used by Ethereum further doubles the key length. This makes MPT a very sparse tree. For every insert/update operation, more nodes need to be re-hashed and created. Hence, the space efficiency is worse than it shows in YCSB experiment.

5.4.2. Collaborative Case

Next, we compare the storage consumption in the scenario that different groups of users are collaborating to work on the same dataset. This often occurs in the data cleansing process and data analysis procedure, where divergent parties work on different parts of the same dataset. One significant phenomenon in this case is that duplicates can be frequently found. Therefore, the deduplication advantages naturally introduced by SIRI is useful yet critical to improve the space efficiency. We use YCSB dataset to demonstrate the scenario. In the experiment, we simulate 10 groups of users, each of which initializes the same dataset of 40,000 records. We generate workloads of 160,000 records with overlap ratio ranging from 10% to 100% and feed them into the candidates. e.g. 10% overlap ratio means 10% of the records have the same key and value. The execution is processed with default batch size, i.e., 4,000 records. To make the experiment closer to the practical situation, the batch of the workload is randomly generated here instead of representing a continuous range of keys theoretically analyzed in Section 4.2.2.

The change of deduplication ratio and the chunk sharing ratio are shown in Figure 14(c) and Figure 14(d), respectively. Both metrics of all structures become higher when the workload overlap ratio increases since more duplicate nodes can be found when there exist more similarities among the datasets. Benefiting from smaller node size and smaller portion of updating nodes, MPT achieves highest deduplication ratio (up to 0.96) and chunk sharing ratio (up to 0.7). POS-Tree achieves slightly better deduplication ratio than the baseline though they both have similar size of nodes and the height of the tree. The actual ratios are 0.88 and 0.86, respectively. However, it achieves much better chunk sharing ratio compared to the baseline (0.48 vs. 0.27) because of its content-addressable strategy when chunking the data. By contrast, MBT’s fixed number of pages and growing leaf nodes limit the number of duplicates and therefore it does not perform as good as the other two SIRI representatives. To be more precise, we further collect the storage usage and the number of pages created by the testing candidate and illustrate the results in Figure 14(a) and Figure 14(b). The trends in the figures match the corresponding deduplication ratio and chunk sharing ratio perfectly. With the overlap ratio increases, storage reduction of POS-Tree and MPT is more obvious than the baseline, among which POS-Tree is most space-efficient. MPT is most sensitive to overlap ratio changes due to the high chunk sharing ratio introduced by its structural design. Though it consumes more space for non-overlapping datasets, MPT outperforms the baseline approach for datasets when the overlap ratio is above 90%.

We also evaluate the effect of query batch size on storage space. Same as previous experiments’ setting, we simulate 10 parties. Each of them initializes the same dataset contains 40,000 records and executes workloads containing 160,000 keys with default overlap ratio 50%. Figure 15(c) depicts how the deduplication ratio decreases along with the query batch size increases. The reason is that larger batch sizes cause a larger portion of the index structure to be updated, resulting in fewer nodes to be reused between versions. Figure 15(a) and Figure 15(b) show the storage usage and number of chunks created with different batch size. Similar relationships across the structures can be observed as in Figure 14(a) and Figure 14(b). Except for both of the metrics decrease when using larger batch size due to less versions stored in the index.

5.4.3. Structure parameters

The parameters of the indexes can affect the deduplication ratio as aforementioned in section 4. The impact of those key parameters is verified in this experiment, namely chunk size for POS-Tree, number of buckets for MBT and mean key length for MPT. For POS-Tree, the boundary pattern is varied to change the chunk size from 512 to 4,096 bytes probabilistically. For MBT, the number of fixed buckets is set from 4000 to 10,000. For MPT, the dataset is generated with different minimum key lengths, which can lead to diverse mean key length from 10.2 to 13.7. (The maximum key length in the dataset is fixed.) The results are shown in Table 2, which coincide with the conclusions in Section 4. The deduplication ratio of POS-Tree increases as the average chunk size increases. This is expected as the number of same large nodes is less than that of small nodes, leading to fewer occurrences of duplicate pages. Similarly, the deduplication ratio of MBT increases as the number of buckets increases because a larger number of buckets results in smaller leaf nodes. The deduplication ratio of MPT increases as the mean key length increases. This is because longer keys usually have more conflicting bits and result in a wider tree. Therefore, the portion of reusable nodes increases.

Figure 18. Tree height
Chunk Size (POS-Tree) 512 0.722 1024 0.6485 2048 0.5391 4096 0.4108 # Buckets (MBT) 4000 0.3301 6000 0.4599 8000 0.5433 10000 0.6003 (MPT) 10.2 0.9685 12 0.9693 13.3 0.9806 13.7 0.9823 Table 2. Deduplication ratio respect to parameters
(a) Read (b) Write
Figure 19. Throughput of Forkbase integrated with different structures
(a) Read (b) Write
Figure 20. Forkbase vs. Noms

5.5. Breakdown Analysis

In this section, we evaluate how each SIRI property affects the storage and deduplication performance. We select POS-Tree as the testing object and disable the properties one by one. For each property, we start with the explanation on how each property is disabled and provide the experimental results following closely. We noticed that the Universally Reusable property is common for all immutable tree indexes. Thus, it is ignored in this experiment.

5.5.1. Disabling Structurally Invariant

POS-Tree calculates rolling hash on raw bytes of the entries in a sliding window manner and the entries are grouped as one node if the pattern is detected or maximum size is reached. In this case, the structure always only depends on the current data when building in a bottom-up manner, which is the key insight to apply Structurally Invariant

property. Therefore, we disable the property by splitting the entries at half of the maximum size when no pattern is found within maximum size forcibly. Since index structure rebuilding only starts from the nodes containing the updating key position and the nodes on the right side of the same level, the resulting structure depends on the data insertion order. We increase the probability of not finding the pattern by increasing the bits of pattern and lowering the maximum value.

The result on POS-Tree with and without Structurally Invariant property enabled is described in Figure 16(a). We can observe an up to 15% decrease of the deduplication ratio when Structurally Invariant is disabled. For instance, the deduplication ratio drops from 0.67 to 0.52 when workload overlap ratio equals to 100%. It is expected as the index performs the operations in different order, resulting in different nodes and less number of share-able pages. Though the records stored are the same, Structurally variant POS-Tree cannot reuse the nodes. Similarly, Figure 16(b) shows that the share ratio decreases by up to 17%, i.e. from 0.53 to 0.36, by disabling Structurally Invariant property.

5.5.2. Disabling Recursively Identical

Originally, only the set of nodes lying in the path from root node to leaf node is copied and modified when update operation is performed while the rest of the nodes are shared between the two versions in POS-Tree. We disable Recursively Identical property by forcibly copying all nodes in the tree. Therefore, there is no sharing of nodes in the update process. Different pages between two instances are much larger than their intersection, whose cardinality in fact equals to zero.

Figure 17(a) shows that the deduplication ratio for POS-Tree with Recursively Identical disabled is 0, since the structure does not allow sharing of nodes among different versions. Obviously, the share ratio of non Recursively Identical POS-Tree shown in Figure 17(b) is also 0. Compared to the figures in the previous section, we can derive how this property accelerates the deduplication rate and ultimately influences the final storage performance.

Overall, we can conclude that Recursively Identical is the fundamental property for indexes allowing deduplication and sharing across different users and datasets. On top of this, Structurally Invariant property further enhances the level of deduplication and sharing by making the structure history-independent.

5.6. System Experiment

To further evaluate the performance of SIRI, we integrate the indexes into Forkbase (Wang et al., 2018), a storage engine for blockchain and forkable applications. In this experiment, we configured a single Forkbase servlet and a client to benchmark the system-level throughput. The evaluation results are demonstrated in Figure 20.

For read operations, the main difference between index level performance and system integrated performance is the remote access due to client-server architecture. The overhead of remote access becomes the dominant factor of performance. To mitigate such overhead, Forkbase caches the nodes at clients after retrieved from servers. Hence, the following read operations on the same nodes can benefit from performing only local access.

Figure 19(a) shows the throughput of read workload. Similar to index level experiments, the throughput decreases when the total number of records grows. POS-Tree achieves comparable performance to our baseline MVMB+-Tree, and it outperforms the other 2 indexes when the total number of records is large (greater than 2,560,000). MPT performs the worst among all indexes due to larger tree height, which comply with the operation bound in Secion 4.1.1. Again, the system runs out of memory for MPT with the size of dataset larger than 2,560,000 in the experiment. Different from the index level experiment, MBT performs worse than POS-Tree and MVMB+-Tree when number of records are extremely small (10,000 records). This is because the hit ratio of cached nodes for MBT is lower than other indexes. Since all index nodes of MBT have a fixed number of entries, the number of repeated reads is less compared with POS-Tree and MVMB+-Tree, where large nodes contribute more repeated reads. When the number of records grows larger, the affected portion of POS-Tree and MVMB+-Tree decreases. Consequently, the number of repeated reads decreases. While the structure of MVMB+-Tree keeps unchanged, leading to a constant number of repeated reads. Therefore, MBT performs better when the number of records is greater than 20,000. When the number of records is greater than 2,560,000, the bottleneck becomes the loading time and the scanning time of leaf nodes, and the throughput drops below that of other indexes.

The write operations will be performed at the server side completely. Hence they will not be affected by the hit ratio of cached nodes described above. Figure 19(b) shows the throughput of write workload. We can observe similar results as that of index level experiments.

Meanwhile, we further run a comparison experiment between Forkbase and NOMS, two similar systems in terms of data versioning management. We directly use the code of NOMS from the official Github repository (28), which is implemented in GO, for the experiment. To make a fair comparison, we configure the node size of POS-Tree to 4K and use a window size of 67 bytes, which is the default setting of NOMS. The experiment is conducted as follows. First, we initialize the systems with 10K to 128K records. Then we execute read and write workload of 10K records respectively to measure the throughput. The results are shown in Figure 20. We can observe that Forkbase performs 2x better in read operations and up to 20x better in write operations compared with NOMS in the figure.

6. Conclusion

The tamper evidence and deduplication are two properties increasingly demanded in emerging applications (e.g., blockchain and collaborative analytics) on immutable data. Recent work (43; 44; 15) has proposed three index structures with these two properties, but does not provide systematic comparisons with each other. Motivated by this, we conduct a comprehensive analysis of all three indexes in terms of both theoretical and empirical performance. Our analysis provides insights regarding the pros and cons of each index, based on which we conclude that POS-Tree (Wang et al., 2018) is a favorable choice for indexing immutable data.


  • [1] I. Ahn and R. Snodgrass (1986) Performance evaluation of a temporal database management system. In SIGMOD Record, Vol. 15, pp. 96–107. Cited by: §2.1.
  • [2] Amazon quantum ledger database. Note: Cited by: §2.3.
  • [3] Azure blockchain service. Note: Cited by: §2.3.
  • [4] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O’Neil, and P. O’Neil (1995) A critique of ansi sql isolation levels. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, SIGMOD ’95, New York, NY, USA, pp. 1–10. External Links: ISBN 0-89791-731-6, Link, Document Cited by: §2.1.
  • [5] P. A. Bernstein and N. Goodman (1983-12) Multiversion concurrency control-theory and algorithms. ACM Trans. Database Syst. 8 (4), pp. 465–483. External Links: ISSN 0362-5915, Link, Document Cited by: §2.1.
  • [6] A. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. Parameswaran (2015)

    DataHub: collaborative data science & dataset version mangement at scale

    In CIDR, Cited by: §2.1.
  • [7] Bitcoin. Note: Cited by: §1.
  • [8] T. Chiueh and D. Pilania (2005) Design, implementation, and evaluation of a repairable database management system. In ICDE, Cited by: §2.1.
  • [9] K. Eshghi and H. K. Tang (2005) A framework for analyzing and improving content-based chunking algorithms. Hewlett-Packard Labs Technical Report TR 30 (2005). Cited by: §2.2.
  • [10] Ethereum. Note: Cited by: §1, §2.1, §2.3, §3.4.1, §5.2.
  • [11] Git. Note: Cited by: §2.1.
  • [12] Git(and github) for data. Note: Cited by: §2.1.
  • [13] GoogleDocs. Note: Cited by: §2.1.
  • [14] S. Huang, L. Xu, J. Liu, A. J. Elmore, and A. Parameswaran (2017) OrpheusDB: bolt-on versioning for relational databases. PVLDB 10 (10), pp. 1130–1141. Cited by: §1, §2.1.
  • [15] Hyperledger. Note: Cited by: §1, §1, §2.1, §5.2, §6.
  • [16] M. Kallahalla, E. Riedely, R. Swaminathan, Q. Wangz, and K. Fux (2003) Plutus: scalable secure file sharing on untrusted storage. In FAST, pp. 29–42. Cited by: §2.3.
  • [17] A. Kemper and T. Neumann (2011) HyPer: a hybrid oltp&olap main memory database system based on virtual memory snapshots. In ICDE, pp. 195–206. Cited by: §2.3.
  • [18] G. Kollios and V. J. Tsotras (2002) Hashing methods for temporal data. IEEE Trans. on Knowl. and Data Eng. 14 (4), pp. 902–919. Cited by: §2.1.
  • [19] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao (2000) OceanStore: an architecture for global-scale persistent storage. In ASPLOS, Cited by: §2.1.
  • [20] S. Lanka and E. Mays (1991) Fully persistent B+-trees. In SIGMOD, pp. 426–435. Cited by: §2.1.
  • [21] LevelDB. Note: Cited by: §1, §2.3.
  • [22] J. Li, M. Krohn, D. Mazieres, and D. Shasha (2004) Secure untrusted data repository. In OSDI, Cited by: §2.3.
  • [23] M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. G. Parameswaran, and A. Deshpande (2016) Decibel: the relational dataset branching system. PVLDB 9 (9), pp. 624–635. Cited by: §1, §2.1.
  • [24] Mercurial. Note: Cited by: §2.1.
  • [25] R. C. Merkle (1988) A digital signature based on a conventional encryption function. In A Conference on the Theory and Applications of Cryptographic Techniques on Advances in Cryptology, pp. 369–378. Cited by: §2.3.
  • [26] A. Muthitacharoen, B. Chen, and D. Mazieres (2001) A low-bandwidth network file system. In SIGOPS Operating Systems Review, Vol. 35, pp. 174–187. Cited by: §2.2.
  • [27] S. Nakamoto (2009) Bitcoin: a peer-to-peer electronic cash system. Note: Cited by: §1, §2.1, §2.3.
  • [28] NOMS. Note: Cited by: §5.6.
  • [29] C. Okasaki (1999) Purely functional data structures. Cambridge University Press. Cited by: §2.1.
  • [30] J. Paulo and J. Pereira (2014) A survey and classification of storage deduplication systems. ACM Computing Surveys (CSUR) 47 (1), pp. 11. Cited by: §2.2.
  • [31] D. R. K. Ports and K. Grittner (2012-08) Serializable snapshot isolation in postgresql. Proc. VLDB Endow. 5 (12), pp. 1850–1861. External Links: ISSN 2150-8097, Link, Document Cited by: §2.1.
  • [32] J. R. Driscoll, N. Sarnak, D. Sleator, and R. Tarjan (1989-01) Making data structures persistent. Vol. 38, pp. 109–121. Cited by: §2.1.
  • [33] RocksDB. Note: Cited by: §1, §2.3.
  • [34] O. Rodeh (2008) B-trees, shadowing, and clones. ACM Transactions on Storage (TOS) 3 (4). Cited by: §2.1.
  • [35] B. Salzberg and V. J. Tsotras (1999) Comparison of access methods for time-evolving data. ACM Computing Surveys (CSUR) 31 (2), pp. 158–221. Cited by: §2.1.
  • [36] D. J. Santry, M. J. Feeley, N. C. Hutchinson, and A. C. Veitch (1999) Elephant: the file system that never forgets. In HotOS, pp. 2–7. Cited by: §2.1, §2.1.
  • [37] A. Seering, P. Cudre-Mauroux, S. Madden, and M. Stonebraker (2012-04) Efficient versioning for scientific array databases. In ICDE, Vol. , pp. 1013–1024. Cited by: §2.1.
  • [38] C. A. N. Soules, G. R. Goodson, J. D. Strunk, and G. R. Ganger (2003) Metadata efficiency in versioning file systems. In FAST, Cited by: §2.1, §2.1.
  • [39] M. Stonebraker and L. A. Rowe (1986) The design of the POSTGRES. In SIGMOD, pp. 340–355. Cited by: §2.1.
  • [40] J. D. Strunk, G. R. Goodson, M. L. Scheinholtz, C. A. N. Soules, and G. R. Ganger (2000) Self-securing storage: protecting data in compromised system. In OSDI, Cited by: §2.1, §2.1.
  • [41] Subversion. Note: Cited by: §2.1.
  • [42] A. U. Tansel, J. Clifford, S. Gadia, S. Jajodia, A. Segev, and R. Snodgrass (1993) Temporal databases: theory, design, and implementation. Benjamin-Cummings Publishing Co., Inc.. Cited by: §2.1.
  • [43] S. Wang, T. T. A. Dinh, Q. Lin, Z. Xie, M. Zhang, Q. Cai, G. Chen, B. C. Ooi, and P. Ruan (2018) Forkbase: an efficient storage engine for blockchain and forkable applications. PVLDB 11 (10), pp. 1137–1150. Cited by: 1st item, §1, §3.2, §3.4.3, §3, §5.2, §5.6, §5, §6.
  • [44] D. Wood (2014) ETHEREUM: a secure decentralised generalised transaction ledger. Cited by: §1, §6.
  • [45] W. Xia, H. Jiang, D. Feng, F. Douglis, P. Shilane, Y. Hua, M. Fu, Y. Zhang, and Y. Zhou (2016) A comprehensive study of the past, present, and future of data deduplication. Proceedings of the IEEE 104 (9), pp. 1681–1710. Cited by: §2.2.
  • [46] N. Zhu and T. Chiueh (2003) Design, implementation, and evaluation of repairable file service. In DSN, Cited by: §2.1.