Blockchains allow a set of parties to collaboratively maintain a collection of data on a tamper-resilient and decentralized ledger. This provides numerous benefits compared to traditional data storage models, where the administration of a shared database is delegated to either one or more trusted entities.
One particularly notable benefit is that a blockchain-based data storage solution mitigates security issues that can arise from malicious administrators. Moreover, it eliminates the potential for a single point of failure because data is replicated across multiple entities. Due to their immutability and auditability guarantees, blockchains are useful for storing access logs from multiple sites to different datasets (e.g., genomic research data). This is because access logs require strong auditability (e.g., auditing accesses to genomic information), transparency (e.g., publicly verifying that certain data is not misused by checking the logs) and tamper-resistance (e.g., preventing an attacker from manipulating the stored logs).
Furthermore, blockchains provide access to an uniform view of data that is logged from different sites. This, in conjunction with its other properties, can be beneficial in various contexts as demonstrated by previous investigations, such as those in the healthcare domain ozercan2018choudhury2018li2019dmms.
However, the underlying data structure of a blockchain, by default, does not provide for an efficient technique to query the stored data. To overcome this limitation, most existing blockchain implementations, such as Bitcoin Bitcoin and Ethereum ethereum, provide support for key-value stores to duplicate and store data on the blockchain. These key-value stores are then leveraged to support simple key-based retrieval queries that return the associated values.
In this paper, we show how to leverage the key-value store provided by a blockchain to run complex queries efficiently (e.g., range queries) on the blockchain data. We specifically report on the approach we designed for the blockchain track of iDASH Privacy & Security Workshop 2018 competition iDash2018. Our findings show that our approach induces reasonable overhead, in terms query-response time, in comparison to a traditional relational database management tool.
Blockchain was first introduced by Nakamato as the underlying ledger of the now famous Bitcoin cryptocurrency Bitcoin. Briefly, a blockchain is an append-only, distributed and replicated database. It allows the participants of a network to collectively maintain a sequence of data in a tamper-resilient way. More importantly, it does so without a requirement for a trusted third party by invoking a consensus mechanism.
Informally, a blockchain network operates as follows: participants broadcast their data and certain nodes called miners gather and store the data they receive in wrapper structures called blocks
. Through a consensus mechanism, the network elects a leader miner in a decentralized fashion for a sequence of epochs. The epoch leader broadcast his block to the network and, having received the leaders block, other nodes store it in their local memory where each block maintains a hash-link to the previous block.
The consensus algorithm that the blockchain network deploys may depend on whether or not the network is permissionless. For example, Bitcoin operates on a permissionless network, where anyone is free to join and there is no uniform view of the network across participants. It utilizes a cryptographic puzzle called Proof-of-Work PoW to achieve consensus. This makes tampering with the order of blocks computationally infeasible when the majority of the network participants follow the protocol honestly.
In permissioned networks however, participants can employ more efficient consensus algorithms, such as PBFT PBFT. This is because the identity and number of participants are known to every party.
Multichain is a platform to deploy permissioned blockchains Multichain. In this context, permissioned means that access to the blockchain network can be arbitrarily restricted. Such networks are usually initialized by a single party who, at a later point in time, allocates permissions to other nodes to join the network and participate in the consensus protocol. For consensus, Multichain deploys a variant of a classical Byzantine fault tolerance algorithm whose exact details are provided in the Mining in MultiChain section of the corresponding whitepaper MultiChainWhite.
To handle queries efficiently, Multichain provides a module called streams, which uses an abstraction of a dictionary (i.e., key-value store) on top of a blockchain Streams. The streams module allows a node to store an arbitrary datum and an associated key by submitting a key-value pair in a transaction to the blockchain. Multichain duplicates and indexes the data stored on the blockchain in LevelDB (a key-value store LevelDB), which is locally maintained by each node to serve queries submitted to the blockchain efficiently MultichainStreams. In other words, the streams module allows a node to interact with the underlying key-value store. It is possible to store multiple values with the same key, such that query results can be returned as lists.
The streams module supports the following methods (among others) on top of a blockchain.
createDictionary(dictionary-name): Creates a dictionary with the specified name.
insert(dictionary-name, value, key): Inserts the key-value pair to the specified dictionary.
retrieve(dictionary-name, key): Retrieves the value(s) corresponding to the given key from the specified dictionary.
We note that it is possible to create an arbitrary number of dictionaries on top of a blockchain, each of which stores data independently.
Overview of the task
The blockchain track of the iDASH Privacy & Security Workshop 2018 competition provided a genomic dataset access log trail iDASH2018Tracks in which each log consists of 7 fields: timestamp, node, ID, ref-ID, user, activity and resource. The activity and resource fields can take on arbitrary string values. The other fields can take on arbitrary positive integer values.
This trail is stored on a blockchain, where it is assumed that the trail arrives as a data stream (i.e., one log at a time). The competition rules dictated that the blockchain must be created using Multichain version 1.0.4 and the network must consist of four identical nodes, each representing a different site, initialized with default parameters per Multichain specifications. It should be possible to insert and query the data using any node. Also, the rules prohibited the use of using any off-chain mechanisms to handle the data other than what Multichain provides.
The goal is to develop a system that can query the blockchain efficiently under this setting while minimizing loading times and storage space. A viable solution must support two types of queries:
Conjunctive equality queries on selected fields.
A range query on the timestamp field.
Furthermore, the system should support the reporting of results in ascending or descending order on any field.
We now describe the techniques and the optimizations we deployed to handle queries efficiently in our system. We note that although our system is explicitly tuned for the blockchain track of iDASH 2018, our methods can be applied to support more general queries.
In what follows, we first describe how to handle conjunctive equality queries. Next, we describe how to handle range queries. We then explain how query response times can be further improved via batch loading.
Conjunctive Equality Queries
For each field, we create a dictionary that uses field values as keys and logs as values. For example, in the user dictionary at key , we have the logs whose user field’s value is . Similarly, in the ID dictionary at key , we have the logs whose ID is . Note that the logs are duplicated for each field.
When processing such a query, we first find the most restrictive field key and retrieve the logs from the corresponding dictionary with that key. Next, we filter the retrieved logs with other field keys. Finding the most restrictive field can be achieved efficiently. This is because Multichain keeps track of how many items are stored at a key in a dictionary, which can be accessed by a getCount(dictionary-name, key) method.
As an example, consider a query that requests logs with . We first compute = getCount(user-dictionary, 1) and = getCount(id-dictionary, 2). Then if , we retrieve the logs from the id-dictionary, via retrieve(id-dictionary, 2) and discard the logs whose user field is not equal to 1.
Range Queries on a Single Field
To handle range queries, we designed a bucketization technique. That is, we create intervals of a fixed size and assign each log to exactly one of those intervals depending on the queried field’s value. Each interval is referred as a bucket and identified by an unique value. Particularly, our bucketization technique works as follows: first, we create a separate dictionary, range-dictionary, in which we assign each log key , where is a predefined bucket size.
Next, given a range query , which requests logs whose timestamps are between and (inclusive), we retrieve all of the logs with keys . Finally, we retrieve and perform a linear scan of the logs at keys and and discard logs whose timestamps are not in .
Note that, for each log with a key in range , it is guaranteed that the log is in . As a result, we do not need to scan the logs stored at these keys.
Improving Retrieval Speed via Batch Loading
We observed that in the Multichain platform loading logs in batches can substantially improve retrieval speed. Here batch loading means that, instead of inserting one log in each transaction, we buffer and insert several logs in a single transaction.
We observed that if we load logs as batches of size , then retrieving these logs would be roughly times faster than storing them one at a time. The competition rules required the solution to be crash-resistant, so a straightforward way of buffering would have failed to meet this requirement. For example, if our buffer size is , then we load logs to the blockchain in batches of size 4. Yet if the system crashes after the first 2 logs arrive, then both of these logs would be lost due to the fact that the contents of the buffer was not loaded to the blockchain at the crash time.
To overcome this problem, we extend our solution to maintain two dictionaries per field, a batch dictionary and a regular dictionary. In the batch dictionary, we load logs in batches, while in a regular dictionary, we load the logs one at a time (i.e., a buffer size of 1). When retrieving data, we select from the batch dictionary, compare the size of the retrieved list with the corresponding list in the regular dictionary and execute a crash recovery (if needed).
For example, imagine a query that attempts to retrieve logs with . To support this query, we first retrieve the logs from the batch dictionary via batchLogs = retrieve(batch-node-dictionary, 1). We then compute the length of the corresponding list in the regular dictionary via logListSize = getCount(regular-node-dictionary, 1). Then, if the size of batchLogs is equal to logListSize, we simply return batchLogs as the result.
Otherwise, it becomes evident that we lost some data from the batch dictionary due to a crash. To recover, we compute the difference between the size of batchLogs and logListSize, i.e., = logListSize - size(batchLogs). We then retrieve the last items from the regular dictionary and append these to both batchLogs and the batch dictionary. Finally, we return batchLogs as the result.
In this section, we report on a set of experiments designed to characterize the performance of our system.
Implementation details and experiment setting
We implemented our solution using Python 3.5.2, Multichain 1.0.4 and used the Savoir wrapper to interact with Multichain API Savoir. Our code is available at MyEntry. Our test setup consisted of four identical virtual machines with the following specifications: 2-Core CPU (2.6. GHz Intel Xeon E5), 7.5 GB of RAM and 50 GB of storage with the Ubuntu 14.04 LTS operating system.
We used the dataset supplied by the competition organizers, which consisted of four files, one per node, in which each file has logs. The following is an illustration of the structure and content of the logs:
TIMESTAMPΨNODE ID REF-ID USER ACTIVITY RESOURCE 1522000002801Ψ1Ψ1Ψ1Ψ1ΨREQ_RESOURCEΨTOPMed 1522000008352Ψ1Ψ2Ψ1Ψ1ΨVIEW_RESOURCEΨGTEx 1522000016966Ψ1Ψ3Ψ3Ψ6ΨREQ_RESOURCEΨMOD_FlyBase 1522000019451Ψ1Ψ4Ψ1Ψ1ΨFILE_ACCESS TOPMed 1522000019566Ψ1Ψ5Ψ3Ψ6ΨVIEW_RESOURCEΨMOD_FlyBase 1522000024848Ψ1Ψ6Ψ6Ψ10ΨFILE_ACCESS GTEx
From the previous discussion regarding batch loading, it is evident that the larger the buffer size, the faster the retrieval speed. However, Multichain imposes a size limit on each transaction, such that it is not possible to increase buffer size arbitrarily. We observed that, for the given dataset, the transaction size limit is reached for a buffer size around . As a result, we set the buffer size accordingly.
Now, it can be seen that the number of buckets we have to retrieve decreases with the increasing bucket size. However, the number of individual logs we have to scan may increase. This is because it depends on the distribution of logs over the buckets and the query. Note that if one chooses bucket size poorly, it could be the case that all the logs would go to the same bucket.
To determine an appropriate bucket size, we ran several range queries of varying sizes on the given data and measured the average running time. During our empirical analysis, we observed an increase in the average running time as bucket size increased from to . After that point however, average running time started to decrease with the increasing bucket size. As a result, we selected a bucket size of
Finally, we compared our solution’s performance with a traditional relational database, namely SQLite 3.22 which we ran in one of the virtual machines. We report on the average measures over 10 runs and illustrate standard deviations by error bars in our plots. Also we note that
First, we measured the load time by using logs of various sizes concurrently at each node. Figure 1 depicts the average load time of a node with respect to number of logs loaded by it. We further plot the influence of file size on total load time, which corresponds to the slowest node. Those results are provided in Figure 2. These figures do not include the results from SQLite because our solution is substantially slower. For instance, loading all logs required merely about 4 seconds in SQLite.
Next, we investigated query-response times. We ran the test queries supplied by the competition organizers. The queries and the number of records returned by were as follows.
Query 1: SELECT * FROM Chain WHEREuser = 7 AND (timestamp BETWEEN1522257730000 AND 1522449160000).Returns 30489 records.
Query 2: SELECT * FROM Chain WHEREresource = ’MOD_WormBase’.Returns 67462 records.
Query 3: SELECT * FROM Chain WHEREuser = 1 AND resource = ’TOPMed’.Returns 17098 records.
Query 4: SELECT * FROM Chain WHEREnode = 3 AND ref-ID = 40345ORDER BY timestamp ASC.Returns 5983 records.
The running times for these queries are shown Figure 3 As the results indicate, the main bottleneck of our solution is the range query. We also note that SQLite’s internal representation and processing scheme is quite different than our method. As such, the SQLite running time is not always highly correlated with the blockchain time.
In Figure 4, we compare the range query performance of our solution with respect to SQLite’s performance. We observe both methods scale linearly where the difference is between 5-6 seconds.
Finally, Figure 5 illustrates how the number of records retrieved influences the query-response time. In this experiment, we ran queries without any restrictions (i.e., query returns every stored log) after loading appropriate number of logs. Given how our approach handles conjunctive equality queries, this plot also represents the performance of conjunctive equality queries. This is due to the fact that a conjunctive equality query makes a call to getCount(.) per field given in the query in addition to retrieving the data. This only adds a negligible overhead.
Finally, we considered the storage requirements. After loading all logs, the size of the blockchain was about 3 GBs per node, whereas the size of the SQLite database was several orders of magnitude smaller at 20 MBs.
In this section, we discuss certain limitations and highlight opportunities for improvement of our approach.
First, as mentioned earlier, bucket and buffer size was based on an empirical investigation. We did not conduct extensive studies on these parameters to optimize them. It might be possible to improve query-response times by fine-tuning these parameters.
Second, it is possible to map string fields (i.e., resource and activity) to the integer values to reduce the size of logs. This may improve the loading and query-response times.
Third, we did not consider parallelization. Although Multichain platform imposes some limitations on the parallelization (e.g., concurrently reading different parts of a stream is not possible) workarounds might exist Parallelization.
Further, per the rules of the competition, we were not permitted to modify the blockchain parameters. A straightforward way of improving performance might be to optimize these parameters. For example, the target-block-time parameter controls the average number of seconds between two blocks whose default value is 15. It might be possible to decrease loading times by letting the blockchain generates blocks more often.
Finally, we note that Multichain is expected to deploy some new features to support data handling more efficiently in future versions. For instance, in version 2, blockchain stores just the hashes of data Multichain2. Since transactions will be shortened, this will likely reduce loading and response times. One can simply compare the hashes of data after fetching them from the accompanying key-value store with the hash on blockchain to ensure immutability in this model.
In this paper, we demonstrated that blockchain technology can overcome inherent limitations on querying and, thus, can be a useful tool for managing data accross multiple sites, particularly in scenarios that require strong immutability and auditability. We showed how bucketization, simple data duplication and batch loading can be utilized to run complex complex queries efficiently over blockchains that provide support for only simple key-value stores. Particularly, we implemented these notions in the submission to the blockchain track of iDASH 2018 competition that supports efficient conjunctive equality and range queries over blockchains created with Multichain platform. We illustrated that our approach induced reasonable overhead, in terms query-response time, in comparison to a traditional relational database management tool.
SQL: Structured Query Language; PBFT: Practical Byzantine Fault Tolerance.
Ethics approval and consent to participate
Consent for publication
Availability of data and material
Dataset is available at iDASH2018Tracks and our implementation is available at MyEntry.
The authors declare that they have no competing interests.
The authors would like to thank the anonymous reviewers for their constructive suggestions and comments.
This publication was partly supported by NIH award 1R01HG006844, NSF awards CICI- 1547324, IIS-1633331, CNS-1837627, OAC-1828467 and ARO award W911NF-17-1-0356.
All authors contributed equally to this work. All authors have read and approved this manuscript.