Searchable symmetric encryption (SSE) has been extensively studied for a long time since it was first introduced in . Generally, it allows a data owner to outsource data to an untrusted server in the encrypted form and later search for the records matching a given query. During the entire search process, the private information about the database and the query is well protected from the semi-trusted server.
In most existing works, the remote server is modeled as an honest-but-curious entity [19, 17, 9, 14, 18, 11, 32, 38, 37] who never tries to deviate from the prescribed protocol. In reality, however, a malicious server may return partial answers or even non-matching documents (e.g., due to random failures). More seriously, any security breach and insider attacker may illegally gain access to alter the computations performed over the data. This could happen when a successful malware infection (e.g., email attachments, infected P2P media) on one host gives an attacker a high access authority. To address these concerns, security designs against a malicious server are urgently needed to facilitate the wide application of SSE.
Recently, a few works have been devoted to designing verifiable SSE schemes where a data owner is able to verify the integrity of search results. Nevertheless, their verification techniques (e.g., using MAC  or hash table ) are highly dependent on specific SSE schemes, and for now only support simple query expressions such as single-keyword search. How to generically impose verifiability on the existing abundant SSE schemes that support expressive queries and complex data structures (e.g., boolen query [6, 10] or graph data ) without incurring expensive overheads on the data owner remains unclear.
We observe that the main reason of possible cheating is that the centralized server takes full control of data and executes protocols independently without being supervised. In light of this, we resort to smart contract, a newly emerging decentralized computing paradigm in blockchain where all operations are transparent and reliable. Getting rid of a central server, outsourcing search queries to smart contract yields a correct and immutable result, and requires no further verifications by the data owner. It thoroughly eliminates our misgivings about a malicious adversary as long as the security of blockchain is guaranteed.
To this end, we first study public blockchain, a permissionless environment that everyone can get access to. It provides an off-the-shelf decentralized platform, enabling the data owner to directly make use of it. By utilizing the popular public blockchain environment Ethereum , we, for the first time, propose a decentralized SSE scheme . The smart contract running over Ethereum is carefully designed to circumvent various barriers (e.g., gas limitation, gas availability) in Ethereum. Considering some application scenarios where a set of permissioned service providers (i.e., peer nodes) is available, we further study private blockchain environment running among those service providers, and propose an alternative decentralized SSE scheme leveraging the popular private blockchain framework Hyperledger . The two proposed designs and have their own merits, leading to a trade-off between security and efficiency. To give an exemplary instantiation, both and are constructed on classic inverted index based searchable symmetric encryption schemes [5, 4]. We emphasize that our framework is a general one, and many other SSE solutions supporting complex expressiveness (e.g., boolean queries) and structured data (e.g., graph) fit for our setting as well and can be altered likewise to have their decentralized counterparts, as explicitly discussed in Section 9.
In order to further support practical applications, we investigate the multi-user setting, a more complex scenario of SSE [9, 17, 22] where an authorized user is allowed to search files shared by the data owner. For instance, in a traditional cloud-based picture or file sharing system (e.g., Dropbox), a data owner can upload its pictures or files to the cloud server such that they can be shared among friends or family. In our decentralized setting, instead of using the cloud server, we also aim to provide sharing services through the blockchain network. We study public and private blockchain environments and show how to enable users to search private database, and impose search control such as adding or revoking users. According to the characteristics of the underlying blockchain platforms, we use a straightforward extension for : letting the data owner search after receiving the user’s query. For the private blockchain scheme , we propose an alternative approach that enables the user to search keywords independently and efficiently without getting any help from the data owner.
In summary, we make the following key contributions:
By leveraging the smart contract, we propose two decentralized searchable symmetric encryption (SSE) schemes and , catering to the public and the private blockchain environments respectively, to guarantee that the data owner receives correct search results and has no need to perform verifications in the face of a malicious adversary.
We investigate the multi-user setting for both and where the authorized users are able to search shared files correctly and privately, and the data owner can add/revoke users flexibly.
We implement two prototypes of and . Extensive experiments and evaluations over local simulated network and official test network demonstrate the practicability of designing SSE schemes in a decentralized manner.
2 Related Work
Searchable Symmetric Encryption. SSE was first introduced in . Since then, great efforts have been devoted to developing secure and efficient SSE schemes. More than ten years ago,  for the first time formally considered leakage and designed a static SSE scheme that is secure against adaptive chosen-keyword attack. As a following work,  proposed the first dynamic SSE scheme that is also secure against adaptive chosen-keyword attack.
In recent years, most of SSE works focus on supporting more complex structures and queries and improving efficiency with regard to search time and communication cost. One of the most notable examples is  that proposed the first SSE scheme to support conjunctive queries in sub-linear time. Then  extended this work to achieve much more complex queries including substring, range, wildcard and phrase queries. Besides, [11, 32] showed how to handle boolean formulas, ranges and stemming by using garbled circuits and bloom filters.  then proposed several optimizations to handle very-large datasets (e.g., tens of billions of record-keyword pairs). Recently,  proposed the first efficient disjunctive and boolean SSE scheme with the worst-case sub-linear search complexity and optimal communication overhead. Along another line,  extended SSE to support arbitrarily-structured data, such as graphs, labeled data or matrices. And a recent work  presented a graph encryption scheme to support approximate shortest distance queries. All of these works, however, address the security against a semi-honest adversary. They are vulnerable to a malicious server who may return incorrect search results.
Verifiable Searchable Symmetric Encryption. To mitigate a malicious adversary, verifiable SSE schemes have aroused interests in recent years.  studied this problem and proposed a verifiable SSE scheme that is UC-secure. Then [35, 3] constructed dynamic and more efficient schemes. Based on these results, recently  used trapdoor permutations to construct a very simple forward secure searchable encryption scheme. To address the limitations of demanding specific SSE constructions,  proposed a generic verifiable scheme by using Merkle Patricia Tree (MPT) and Incremental Hash to create the proof index. Nevertheless, these works have to impose extra computation cost and storage overhead on a stateful data owner. Our preliminary work  proposed utilizing smart contracts in Ethereum to realize a decentralized and reliable SSE scheme, but it suffers from high overheads (e.g., gas and cryptocurrency consumptions, time costs) due to some inherent characteristics of public blockchains (e.g., PoW-based mining process), and does not fully consider the multi-user setting where adding/revoking users should be supported. We therefore propose a new scheme by making use of private blockchain to improve efficiency. We further investigate the multi-user setting for , and show how to enable authorized users to search private database. We propose new secure protocols to flexibly add and revoke users. Moreover, several construction variants are proposed to address some security issues and strengthen our designs. Table I gives a comparison of our work and previous verifiable SSE schemes.
In this section, we provide some basic introductions on traditional searchable symmetric encryption (SSE) and the cryptographic tools we will use, and main technologies that support our decentralized design, namely smart contracts in Ethereum and Hyperledger.
3.1 Searchable Symmetric Encryption
We follow the formalization of Kamara et al.  with a slight modification. In our paper, is defined as the security parameter and denotes a negligible function in the security parameter. The set of all binary strings of length is denoted as , and the set of all finite binary strings is denoted as . We write to represent an element being sampled uniformly at random from a finite set . The algorithms and protocols are running in polynomial time. In particular, adversaries are polynomial-time algorithms.
A database is a list of identifier/keyword-set pairs where and . The set of keywords of the database DB is . The set of documents containing a given keyword is denoted as . We will always set and to be the number of distinct keywords and the total number of keyword/document pairs, respectively.
A traditional dynamic searchable symmetric encryption scheme consists of one algorithm Setup and two protocols Search and Update between a data owner and a server.
Setup(DB) takes as input a database DB and outputs a tuple where EDB is the encrypted database, is a secret key, and is the data owner’s state.
is an interactive protocol where the data owner takes as input the secret key , its state , and a search word , and the server takes as input the encrypted database EDB. The server outputs a set of identifiers while the data owner has no output.
is an interactive protocol between the data owner with inputs the key , the state , an operation , a file identifier id, and a set of distinct keywords, and the server with input EDB. These inputs represent the actions of adding a new file with identifier id and deleting the file with identifier id.
For simplicity, the formalization of SSE here does not model the storage of the actual document payloads. The SSE literature varies on dealing with this issue. In our case where decentralized environment is considered, we can store encrypted documents in any decentralized file systems such as IPFS discussed below.
Cryptographic Tools. In our constructions, we make use of variable-input-length pseudo-random functions (PRFs) which are polynomial-time computable functions that cannot be distinguished from random functions by any probabilistic polynomial-time adversary. Formal definitions of PRFs can be found in . Some of our constructions will be analyzed in the random oracle model . We use to denote the random oracle.
3.2 Gas System in Ethereum
Gas system is a fantastic feature in Ethereum. It is designed to mitigate Denial-of-Service (DoS) attack on the Ethereum network. Specifically, the contract script is compiled into Ethereum opcodes and stored in the blockchain. Each opcode will cost a certain pre-defined amount of gas . When initiating a smart contract through sending a transaction, the sender has to specify the available gasLimit that supports for execution, and the corresponding gasPrice that the sender is willing to pay for each unit of gas. The transaction will get included in the blockchain successfully only when the balance of the sender is larger than . Although useful in avoiding network abuse, however, the gas system also sets some restrictions in designing our schemes as described in Section 5.
3.3 Smart Contract in Ethereum
Ethereum is a new promising public blockchain platform . Its security is maintained by a cryptographic chain of puzzles (or blocks). Miners in the Ethereum network validate and approve transactions while mining new blocks. Mining a new block by successfully solving a designated cryptographic puzzle rewards the miners with newly-created cryptocurrency and thus incentivizes them to mine more blocks, i.e., Proof-of-Work (PoW). The correctness of the network is guaranteed by this incentive mechanism. Anyone at any given point of time can join or leave/read/write/audit the public blockchain. In general, Ethereum provides us with two appealing properties:
Consensus. The entire network agrees on the rules to verify each transaction and block. The data stored and computations executed on Ethereum must be consistent across miners and cannot be modified or denied.
Transparency. Ethereum is a public network. All the stored data and executed computations are transparent to any users.
Therefore, Ethereum acts as a trusted base who is trusted for correctness and availability, but not for privacy.
Smart contracts in Ethereum are applications with a state stored in the blockchain. They can facilitate, verify, and enforce the process of a contract. Each smart contract, identified by a special address, consists of script code, a currency balance, and storage space in the form of a key/value store. Once created and deployed to Ethereum, the contract’s code cannot be modified forever even for its creator.111Except for that a special suicide opcode that clears all of the contract’s data is used. The contract can be triggered by a transaction from an external account or a call from other contracts, and is executed in transaction form. Once a smart contract transaction gets included in the blockchain, all the nodes in the network are expected to verify its validity by repeating the contract script. The most distinguished feature of smart contract in Ethereum lies in its support for Turing-complete scripting, which makes it feasible for us to design various complex functions.
3.4 Smart Contract in Hyperledger
Hyperledger is a modular and extensible open-source system for deploying and operating private (or consortium) blockchains . It is a typical kind of permissioned blockchain, running among a set of known and identified participants who share a common goal but do not fully trust each other. Usually the consensus is guaranteed using traditional protocols like PBFT .
Smart contracts in Hyperledger, also called chaincodes, are supported to implement the application logic written in general-purpose programming languages (e.g., Go, Java, Node.js). The execution of smart contracts in Hyperledger is different from that in Ethereum. In Hyperledger, instead of following the order-execute architecture, a new execute-order-validate architecture is realized to improve system efficiency and stability. Such design enables us to deploy a more efficient application. More importantly, no cryptocurrency is needed to support the execution of smart contracts.
4 Security Definitions
In this section, we explicitly discuss the security goals our design aims to achieve.
Soundness. This property is derived from  which basically indicates that the server will get caught if it tries to deviate from the protocol. In other words, the data owner (and other users) will not accept a wrong search result. Usually existing works achieve this objective by letting the data owner conduct a series of verifications. In this paper, we extend this notion to claim that the received search results are reliable and correct definitely, and no verification is needed on the data owner.
Confidentiality. The confidentiality of SSE evaluates the private information protected from the adversary. It follows the real/ideal simulation paradigm [19, 9, 5] and is parametrized by three leakage functions that describe what is allowed to leak to the adversary and are formalized as stateful algorithms. Formally, we have,
Definition: Let ,, be a dynamic SSE scheme and consider the following experiments with a stateful adversary , a stateful simulator and three stateful leakage functions , , :
chooses . The challenger runs to generate the key and gives to . Then repeatedly makes and queries where chooses challenger’s input . Meanwhile, the experiment runs or with challenger’s input and ’s input , and gives the transcript to . Finally, returns a bit as the output of the experiment.
chooses . The simulator is given and sends to . Then repeatedly makes and queries where chooses simulator’s input . Meanwhile, the experiment runs (resp. ) with simulator input (resp. ) and gives the simulated transcript to . Finally, returns a bit as the output of the experiment.
We say that is -secure against adaptive attacks if for all probabilistic polynomial-time (PPT) adversaries , there exists a probabilistic polynomial-time simulator such that
The -secure against non-adaptive attacks can be defined in the same way, except that in both experiments must choose all of its queries at the start, and takes them all as input and gives the output to who generates EDB and the transcripts at the same time.
5 Decentralized SSE in Public Blockchain
We first construct a decentralized SSE scheme with off-the-shelf public blockchains. To give an exemplary instantiation, is adapted from existing pioneering inverted index frameworks (such as [5, 4]) and modified to fit the decentralized environment. Therefore, soundness is automatically implied as long as the security of the underlying decentralized platforms is guaranteed. In Section 9, we show that the other SSE schemes with expressive queries or complex data types can also be extended to our settings similarly.
5.1 Design Challenges and Countermeasures
Intuitively, any traditional SSE scheme can be directly adapted to decentralized environment by replacing the central server with the smart contract. Unfortunately, some innovative features that guarantee the robustness and security of smart contract become obstacles instead in this adaption. Next we present some main design challenges and summarize the countermeasures at a high level.
Gas Limitation. In Ethereum, each transaction that calls a function of the smart contract has a upper bound of consumed gas, called gasLimit as described in Section 3.3. Each operation, including sending/storing data and executing computations, has a fixed gas cost. This restricts the designed function to have extremely limited computation steps and storage. Therefore, to make SSE over a large database become feasible, we are motivated to divide the database into smaller ones and conquer them individually. Simply speaking, in the setup phase where a large encrypted index is built, we partition the encrypted index into several blocks and upload them to the contract with sufficient transactions such that each transaction consumes less gas than gasLimit. To ensure correctness, the contract needs to align the data together in order to return all matched results.
Gas Availability. In the smart contract, each transaction is also associated with a gasPrice that specifies the money the sender is willing to spend to purchase the gas. It is required that the user who initiates the transaction has an account balance larger than the gas cost for executing the transaction. Otherwise the transaction will abort intermediately while the consumed gas cannot be refunded. Thus we should be very careful with the contract design with regard to gas cost. Particularly, it is critical to ensure that each functionality (e.g., Search,Update) in the contract incurs lower gas cost than the sender’s account balance.
The Verifier’s Dilemma. In Ethereum, miners are required to check the validity of transactions. However, verifying transactions may become significantly expensive when there are abundant and complex expressions in smart contracts. For rational miners, they are thus incentivized to skip the verification of the expensive transactions so as to stay ahead in the race to mine the next block. This phenomenon is called the verifier’s dilemma . To mitigate this attack, we are motivated to reduce the computation burden on the contract as much as possible. Our first observation is that the smart contract supports dictionary data type, and the main computation overhead of SSE lies in the search phase. In light of this, we make use of a dictionary to store encrypted index (i.e., EDB), which makes the search time complexity be , where is the number of times that the keyword has been historically added to the database. Our second optimization is the ultilization of packing method inspired by . Specifically, we can pack multiple plaintexts and encrypt the output to obtain one ciphertext with the same size. The search result is thus in blocks instead of individuals. Besides, packing also helps us circumvent the above Gas Limitation since it greatly reduces the storage cost. We note that although  claimed to use the packing method as well, it didn’t describe how to implement it explicitly.
5.2 System Overview
In Fig. 1, we outline the architecture of our design. Then the data owner builds an encrypted index of keyword/identifier pairs and sends it to Ethereum, where complex computations are available via the smart contract. For ease of presentation, operations on the data documents are not shown in the framework since the data owner could easily employ the traditional symmetric key cryptography to encrypt documents and then outsource encrypted data to any decentralized file storage network like InterPlanetary File System (IPFS). The reason why we don’t put encrypted documents on Ethereum is that it is very expensive to store data on it. Offloading huge data sets to another platform while focusing on computation on Ethereum with small data storage benefits the Ethereum network greatly with regard to efficiency and robustness.
For each query, the data owner sends a transaction containing the search token to the designated smart contract. Note that each contract has a unique address in Ethereum. With the search token and previously stored index, the smart contract executes search algorithms and saves the search results (i.e., file identifiers) to its state. The data owner can later read the state and use file id to retrieve real documents from file storage network. For adding or deleting files, the data owner also only needs to send add/delete tokens to the contract and wait for the transactions to be mined into the block. For the add operation, our scheme requires the data owner to maintain a dictionary locally. In fact this is unnecessary and we could modify the scheme slightly to make the data owner stateless as shown in Section 8.
5.3 Our Detailed Construction
In Fig. 2 and Fig. 3, we give a formal description of our decentralized SSE scheme . For simplicity, let , be two pseudo-random functions (Note that there should be different PRFs for different input keys). We use to denote the concatenation operation. “” is a floor function, and “” denotes the number of elements in a list. For a dictionary data type, it includes two algorithms: Add and Delete. And we use term Get to fetch the specified data item in a dictionary. For example, given a dictionary data type and an input label , outputs the corresponding item and parses it into and .
In the Setup phase, the data owner divides into blocks, with each block of entries. Here is a system parameter chosen by the data owner. We use concatenation to pack multiple file identifiers into one. To ensure confidentiality, the bit length of should be less than that of the security parameter . Therefore, we have , where is the bit length of the file identifier. Note that before uploading the database, the list L should be placed in lexicographic order. Otherwise it will leak information about the order in which the input was processed. To avoid exceeding gasLimit, we partition the encrypted database into blocks and send them to the contract one by one with different transactions. At the contract side, they are received iteratively and placed together using dictionary data type. Similarly, the search process will be completed with transactions, each of which returns step items at most. Here , and step are public system parameters and experimentally determined.
In the Add phase, we encrypt file id without using packing. This is because encrypting several plaintexts into one ciphertext makes it hard for the contract to identify which file/keyword pair has been previously deleted, i.e., whether it exists in the set . In addition, in reality changes often happen with only one or several documents at one time. Update incurs much less gas cost than the Gas Limitation. Therefore, individually dealing with file id satisfies the system requirements for update operations.
For the protocol on the smart contract, we remark that transaction triggering functions in smart contract doesn’t return any results. Execution of any function only changes its state that is permanently stored on Ethereum. We implement our scheme by saving search results into the state and later reading them on the data owner side.
5.4 Multi-user Setting
In this work, we further address the issue of multi-user data sharing as considered in [17, 9, 22]. In such applications, the data owner is interested in allowing a third party (i.e., other users) to search the database, while the other users learn the information that the data owner authorizes them to learn but nothing else. The private information about the queries and search results should be protected from the adversary as well.
Using existing cryptographic tools such as broadcast encryption  is a possible solution to help the data owner add and revoke users. In a permissionless blockchain environment like Ethereum, however, anyone at any time can participate in the network and read/write history records, and everything on the smart contract is public. It is not applicable to leverage such cryptographic schemes which usually require the nodes in the network to store a private key and perform decryption operations. Currently we propose to use the straightforward extension for as indicated in : the data owner receives the user’s query, and generates the corresponding search tokens as if himself is searching the database. Fig. 4 gives an overview for the multi-user design. Relying on cryptographic tools in a public blockchain environment to efficiently realize users searching and flexibly add/revoke users is a challenging problem and we leave it to our future work.
6 Decentralized SSE in Private Blockchain
To expand the application scenarios, we construct with the private blockchain, where a set of known and identified service providers (i.e., peer nodes) is available. Although bearing a stronger assumption for the blockchain network, enjoys a higher efficiency than .
6.1 The Practical Concerns
The private blockchain, such as Hyperledger, runs among a set of participants who do not trust each other but have a common goal and try to provide a service collaboratively. We emphasize that the assumption of such consortium holds in practice. Taking health information sharing for example, a number of hospitals, research institutes, banks, and insurance companies may facilitate collaboration to maintain a shared medical database so as to provide a better user experience for patients. Typical examples include WorldCare , OMAHA , etc. Building a private blockchain among these participants creates a transparent and reliable environment for medical data. Clinics or patient individuals can outsource their medical records, in encrypted versions, to the consortium for ease of management. When necessary, any participant from the consortium, after getting authorized by the data owner, can decrypt database locally and obtain correct medical information with assurance. In such application scenario, the participants in the consortium enjoy the benefit of a trusted database when getting access to. Search services with privacy preserved should also be supported by the consortium before the data owner releases private information to all the participants.
6.2 Our Construction
Although using different blockchain platforms with , we can regard blockchain as a black box and construct similarly. is constructed based on inverted index framework as well. The difference lies in the way we deal with the large data set.
In the Setup, also divides EDB into blocks. In Hyperledger fabric, however, there is a size limitation of the parameters. Generally speaking, we have , where denotes the limitation for parameter size. According to our experiments, we can include as many as 500 entries of L in one transaction, which is much more than that in .
In the Search step, since there is no gas limitation in private blockchain, we can query records as many as possible. Therefore sets no limitation for step and set . In other words, the search token is sent to the smart contract in one transaction, and the smart contract can execute search operation at a time.
supports update operations over a large-scale data set. Similar to the construction in Fig. 2, makes use of different secret keys to realize add or delete, i.e., using and to generate add token and delete token respectively. Besides, it is able to deal with large data set by using divide-and-conquer method, as did in the Setup phase. Our experiments will show that supports adding several hundreds of files.
6.3 Multi-user Setting
Different from public blockchain environment, only authenticated participants are allowed to join in the private blockchain network. In light of this, we propose making use of broadcast encryption  to facilitate multi-user data sharing for . A broadcast encryption system consists of three randomized algorithms . takes as input the number of users and outputs a public key and secret keys . Enc takes as input a subset and a public key, and outputs the broadcast ciphertext Hdr. Dec takes as input a subset , a user id , public key , the private key for user , and a broadcast ciphertext. It outputs the plaintext if . Our multi-user construction is illustrated in Fig. 5. Compared with the single-user scheme, the contract only needs to perform some extra simple operations (i.e., xor) in order to determine if the user has been revoked. It is very efficient in practice. Our multi-user design requires that the peers executing smart contracts maintain a private key . Such requirement is easy to realize since every participant in the private blockchain is identified and permitted by others to join the consortium. They are motivated to maintain their reputation and not likely to take the risk of colluding with users and revealing the secret key.
|Scheme||Consensus Algorithm||Mining||Scalability||Efficiency||Performance Bottleneck||Privacy Guarantee||Trustworthy|
7 Theoretical Analysis
7.1 Comparison Between Our Two Designs
Our two proposed decentralized schemes make use of two different kinds of blockchains, leading to a trade-off between security and efficiency. is constructed over public blockchain which already provides a decentralized computing platform. It enjoys a high scalability since everyone can freely access to the public platform and build their own SSE applications. However, its consensus is guaranteed through costly PoW-based mining process, which becomes the main performance bottleneck for . Specifically, for each transaction that triggers search or update function, only after the transaction gets included into the valid block should we have confidence in the correctness of search results. Currently it takes about to mine a block in Ethereum, which means that we have to wait until we could get the search results. A detailed explanation is presented in Section 10.
On the other hand, requires a stronger security assumption of a consortium, which has limited application scenarios. Unlike public blockchain that trusts the whole world, believes that the entire consortium is trusted and always generates correct data. Due to the high efficiency of private blockchain resulting from the fast consensus algorithm (e.g., PBFT), its performance is mainly affected by the database size, as shown in our experiments in Section 10.2. The time complexity of is for a search and for an update. Table II presents a concise comparison between them. We emphasize that has a higher trustworthy degree than since the public blockchain relies on the assumption that the majority of the whole world are honest, while the private blockchain assumes the majority of the involved participants to be honest. We believe that corrupting more users (i.e., 50% of the whole world vs. 50% of a set of participants) is much more difficult, since it needs to unite more network nodes for the collusion purpose.
7.2 Security Analysis
Soundness: It is straightforward to see that soundness can be achieved as long as the security of blockchain is guaranteed. This is because if smart contracts are correctly executed on blockchains, the search results will be stored as contract states permanently. Each node in the blockchain network can verify the states. The consensus property of blockchain ensures the correct execution of each search operation.
Confidentiality: Since and have similar system model and desgin goal (i.e., protecting database from adversary), we will only present a security proof sketch for and the security of can be proved similarly. To prove confidentiality, we first proceed with the formal definition of three stateful leakage functions , , considered in our construction. Amongst the state, a list recording all queries that have been submitted will be maintained. Specifically, each entry of the list is of the form , where denotes a counter, denotes the operation type, and the rest denote the inputs to the operation.
(Leakage function ). Given an initial input , . Meanwhile, it initializes a counter , an empty list , a set containing all the identifiers in , and saves them as the state.
(Leakage function ). Given a search input , , where denotes the search pattern, (resp. ) denotes the add (resp. deletion) pattern of the keyword with respect to and , all of which are defined below. Meanwhile, it increases and appends to .
(Leakage function ). Given an add update input , , where (resp. ) denotes the add (resp. deletion) pattern of with respect to , both of which are defined below. Meanwhile, it increases , appends to and adds to . For a delete update input, the only difference is that outputs instead of as the first component. Finally, if any of the search patterns was non-empty, then it also outputs .
Here, we define all the patterns mentioned above. The search pattern is a set of indices of queries where was searched for, i.e., . Namely, the search pattern reveals whether the keyword has been searched before. The add pattern is the set of indices where was added to the document , i.e., . The add pattern is the set of identifiers to which was added along with the indices showing when they were added, i.e., . Besides, the deletion patterns and can be defined analogously.
Theorem: If and are pseudo-random, then our scheme is -secure against non-adaptive attacks.
Proof is deferred to Appendix 12 for ease of exposition.
8 Construction Variants
8.1 Adaptive Security
is proved to be secure against non-adaptive attacks. As is noted in , making use of random oracle enables us to achieve adaptive security easily. Specifically, in we replace the PRF with the random oracle . For an input with key , is replaced with . And is replaced with where is randomly chosen from . This variant has the same leakage function with . In the security proof, the simulator also behaves similarly except that needs to program the response of the random oracle in a way that it matches the query results that are already revealed. For the label , can set the response of to be a random value with bits in length. For the ciphertexts of id, can set the random oracle such that the ciphertexts will be decrypted to the revealed results.
8.2 Forward Privacy
Forward privacy is also an important security design goal in SSE. It means that the adversary does not learn if the newly-added document contains a keyword that has been searched before. Inspired by recent progress , our designs can be easily extended to achieve forward privacy as well. The key idea is to use trapdoor permutation to make the search token unlinkable to the update token. Specifically, when generating a label for the -th entry in , instead of using a counter that increases itself, we use a trapdoor permutation in a way that and set the label as where is a randomly chosen integer. Then on the smart contract, it can only compute with the public key in polynomial time, but not since it has no secret key. Therefore, the -th newly-added entry to without having been searched cannot be deduced from previously-leaked search token . This variant has the same communication complexity with (or ), and the computation overheads on the data owner and the contract increase a little caused by permutation computation.
8.3 Stateless Data Owner
Currently our schemes require the data owner to maintain a local dictionary consisting of a counter for each keyword that is added after initialization. We could slightly modify the Add protocol to make the data owner stateless by encrypting and sending the ciphertexts to any decentralized file storage systems (e.g., IPFS). The data owner can fetch the encrypted and decrypt it for each Add operation. The size of relies on the number of distinct keywords that have been added in the Add phase, which is much smaller than the total number of keywords. In this case, the adversary can learn how many of new keywords were added into the database. This leaked information is acceptable in practice as far as we can see.
8.4 Security Against Malicious Data Owner
In the multi-user setting, is vulnerable to a malicious data owner who arbitrarily reveals a random search token. To mitigate such attack, we can use zero-knowledge proof  to force the data owner to reveal a correct search token. Specifically, we first let the data owner generate a proof for his search token by using zero-knowledge proof. Then we use smart contract to verify the proof, as did in . If the search token is invalid we stop searching. In this way, the data owner earns nothing with the cheating.
9 Generalization of our Framework
In this work, we use smart contract to construct a decentralized SSE scheme based on the inverted index. We remark that many other SSE schemes fit for our framework as well and can be extended to construct abundant decentralized SSE schemes with soundness guaranteed.
Recent works on SSE have focused on increasing their expressiveness such as supporting boolean queries [6, 10, 18], or developing structured encryption like graph encryption [8, 29]. All of them are also bothered with a serious security challenge: a malicious central server can output partial or even incorrect results whenever it wants. To address this concern, these works can be tuned into our decentralized setting likewise. The most intuitive observation of this extension is that smart contracts actually provide us with a trusted and transparent “server”. The main obstacle lies in dealing with various limitations of gas system in smart contract when using public blockchain. Our proposed several countermeasures (e.g., dividing the encrypted index and conquering them individually, packing multiple identifiers) throw light on how to address these issues. Once constructed via smart contracts, the scheme is guaranteed with soundness and thus there is no need to concern itself with a malicious server any more.
Storing data and executing computations in blockchain-based decentralized environments are reliable and immutable. We strongly believe that using decentralized platforms instead of a central server benefits a lot for the security requirements of SSE.
|DB name||pairs||distinct keywords||EDB|
10 Implementation and Evalutations
We implement prototypes for both and . We first evaluate and in local simulated blockchain networks with TestRPC and Hyperledger fabric, respectively. Besides, the multi-user design of is evaluated as well to demonstrate the performance of adding/revoking users. Considering the open property of public blockchain, we further deploy to an official Ethereum test network Rinkeby.
|D.O. time||D.O. time||D.O. time|
10.1 Implementation Details
For , we use the Hyperledger fabric framework to construct a local private blockchain, and the smart contract (also named chaincode in Hyperledger) is written in Go language. There are two peers in our test network belonging to different organizations and we use the default 256-bit ECDSA scheme for signature purpose. We also use built-in HMAC-SHA256 function library supported by Go language. Due to the high scalability of private blockchain, we set and include 500 entries from the list L in each transaction. One transaction is sufficient to complete search query and thus we set and no limitation for step.
The experiments reported in this work use datasets derived from Enron emails222https://www.cs.cmu.edu/.̃/enron., which are a collection of plain text files. We extract a subset of emails and select increasing subsets from the original subset as document collections with different numbers of (i.e., keyword/identifier) pairs. The key attributes of these datasets are summarised in Table III.
10.2 Experiments on Simulated Network
We first evaluate and on local simulated networks. We use TestRPC333https://github.com/ethereumjs/testrpc. to construct a simulated Ethereum network, and Fabric version v1.3.0 444https://github.com/hyperledger/fabric-releases for a local Hyperledger network. TestRPC is initialized with the default configuration, which is much like real Ethereum environment except that its block time for mining is set to be 0. This allows us to focus on the performance of SSE part on smart contract, irrespective of time-consuming mining process and complex network circumstances (e.g., broadcast latency, transaction mining delay) in Ethereum.
Table IV presents an overview of time costs for each phase over different datasets. In the setup phase, different from existing centralized SSE schemes where the data owner side dominates the efficiency, the time cost on smart contract is much higher than that on the data owner. This is because storing EDB in is completed with thousands of transactions, with each transaction costing 4 seconds on average, while needs about hundreds of transactions. We also observe that has a much higher efficiency in each step than . This again shows that the private blockchain leverages a faster consensus algorithm (e.g., PBFT vs. PoW), such that inevitably enjoys a higher efficiency than although they have the same structure of encrypted index.
To show the core algorithm, Fig. 6(a) presents the search time per found document varying with the number of matching records. Due to the high efficiency of , we only evaluate it with the largest dataset DB4. We report average run times over 30 trials. For , the first thing we can notice is that a larger result set yields a lower search overhead (on a per matching document basis). We explain that by the constant cost of loading past mined blocks from disk into memory before each search runs. This also explains our second observation: the larger the dataset, the slower the search algorithm is. A larger number of mined blocks leads to a longer time for loading. For , we not only see that it has a lower time cost than , but also conclude that the number of matching document has negligible impact on the search overhead.
Fig. 6(b) shows the update performance for by varying the number of added files. Each added file includes 100 keyword/identifier pairs. We can see that adding about 2,200 files costs less than half an hour. is not presented since we give a high-throughput experiment (e.g., adding hundreds of files) which is apparently not suitable for . The update experiments for over a small dataset is shown in Fig. 8(b).
To evaluate the performance of multi-user setting, Table V presents the time costs of each algorithm described in Fig. 5. We select the number of users in a large range to clearly demonstrate the efficiency. For the search process, we only present the additional time cost caused by , without including the time cost of . We can see that the time costs of Setup and RevokeUser increase with the number of users, and they have similar overheads. This is because revoking users in needs to generate new secret keys and renewedly perform broadcast encryption in the same way with the setup. On the contrary, the other operations incur negligible time costs. Compared with the frequently executed searching, revoking users can be regarded as an one-time operation. Therefore, we emphasize that our multi-user design is still practical in real-world applications.
10.3 Experiments on Official Test Network
To show the practicability of decentralized SSE scheme, we deploy to the official Ethereum test network Rinkeby555https://www.rinkeby.io/ that mimics the real production network. Due to the limited balance, we only conduct experiments on the smallest database DB1. Our account and contract addresses in Rinkeby are
To illustrate the impact of mining process on the efficiency, we record the block number of each transaction generated in our setup phase and the corresponding gas usage, as shown in Fig. 7. In summary, it consists of 350 transactions, each of which is mined into one block with block number ranging from to . The average block time for mining is , resulting in to complete the entire setup phase. This again explains why the time cost of setup is dominated by the smart contract, instead of the data owner like in existing centralized SSE schemes. Besides, the average gas usage for a transaction is . Currently 1 gas costs about Ether, at the exchange rate of 89 USD at the time of writing. So each transaction costs about Ether (or USD).
Fig. 8(a) shows the total time needed to perform a search, given a search token (we neglect the cost of generating a search token since it is a small constant in microseconds). Each point is the mean of 10 executions. It clearly demonstrates the performance bottleneck of decentralized SSE. To be specific, we can see that the search time grows with the increase of the number of matching documents. But the sharp growth lies in the increase of the transaction number needed to complete the search step. It indicates that the time cost of mining each transaction dominates the overhead of each search. On the contrary, search algorithm has a faint impact on the efficiency. Generally speaking, the time cost of the mining process is dynamically adjustable. When the blockchain environment scales to allow a higher gas limitation or a faster mining process, our search efficiency increases as well.
A similar situation occurs in Fig. 8(b) which describes time costs varying with the number of transactions needed to add/delete a file. By choosing different sizes of files, we have update completed with different numbers of transactions. It again shows that the mining process of each transaction is the dominant factor on the efficiency.
11 Future Work
11.1 Hardening Security with Trusted Processor
Trusted processor is one of emerging security technologies that protects the private information through hardware-assisted trusted execution environment. It can protect the integrity and confidentiality of private data from other applications and privileged system software such as the operating system, hypervisor, and firmware, and has been widely used to provide privacy guarantee for various jobs, like Tor network or system log processing , etc. Although and are designed to secure the private data, some information leakage still exists such as the search pattern and access pattern. In light of this, integrating trusted processor with blockchain is a promising approach to address this issue. Prior works have explored the potential of applying trusted processors to the encrypted search [30, 12, 15], but how to support blockchain-based decentralized encrypted search is still a challenging problem.
11.2 Improving Efficiency with Sharding
Sharding is an important technique to improve the efficiency and scalability of blockchain networks. It generally partitions a large blockchain network into separate subsets (i.e., shards), each of which deals with a disjoint set of transactions and runs an intra consensus protocol independently [27, 24]. It is obvious that building our schemes of and atop of sharded blockchains benefits a lot for improving efficiency. Besides, a tailor-made search index is desired that caters to the sharded structure of blockchain. Parallel execution of search operations among shards can also improve efficiency greatly. However, how to design such a customized encrypted search index still remains unclear.
Traditional searchable symmetric encryption relies on a central server to manipulate search jobs. In this work, we resort to public and private blockchain technologies and construct two decentralized SSE schemes aiming at addressing malicious adversary. Different from existing verifiable SSE schemes, our search results are correct and immutable, and no verifications are needed on the data owner side. Our framework can be applied to other SSE schemes with complex queries. Finally, we conduct extensive experiments in both locally simulated and official test networks to demonstrate the practicability of decentralized SSE schemes.
We first restate the security claim for . Theorem: If and are pseudo-random, define , , , then our scheme is -secure against non-adaptive attacks.
Proof Sketch: We describe a polynomial-time simulator such that for any PPT adversary , the outputs of and are computationally indistinguishable.
To prove non-adaptive security, the simulator must be given all the leakages before simulating the view of the adversary, which includes the encrypted database (, and ) and the messages sent by the data owner.
The simulator iterates over the queries, it chooses the keys for each search at random with repetitions specified by the search pattern. Then it simulates the initial as follows. For all file ’s associated with each search keyword (i.e., ), computes , and as specified in the real (using and as and ), adds each pair to a list , and then adds random pairs to (still maintained in lexicographic order) until it has total elements, and finally creates a dictionary . The simulator outputs the simulated and the simulated transcript for each search query. Note that and step are public system parameters and deterministically computable from the state information, which do not need to be simulated.
Next, to simulate add update queries, that is, the simulator needs to simulate the message sent by the client, which contains multiple tuples. The simulator must determine whether each tuple sent is generated at random or should be computed with one of the keys used for a search query transcript. Intuitively, this can be done by leveraging both the add pattern and leakages which include the to encrypt when the add updates contain a keyword that was previously searched. The simulator can further simulate based on all the messages sent by the client and the delete patterns. Note that the message sent back to the client is deterministically computable from the state information, which does not need to be simulated.
To simulate delete update queries, the simulator needs to simulate the message sent by the client like add. Thus by using the deletion patterns and leakages, the simulator can simulate the corresponding message in a similar way. Finally, the simulator can simulate based on all the messages sent by the client and the add patterns.
In summary, the theorem follows from the pseudo-randomness of and .
-  (1993) Random oracles are practical: a paradigm for designing efficient protocols. In Proc. of ACM CCS, pp. 62–73. Cited by: §3.1.
-  (2005) Collusion resistant broadcast encryption with short ciphertexts and private keys. In Proc. of Crypto, pp. 258–275. Cited by: §6.3.
-  (2016) Verifiable dynamic symmetric searchable encryption: optimality and forward security.. IACR Cryptology ePrint Archive 2016, pp. 62. Cited by: TABLE I, §2, §4.
-  (2016) oo: Forward secure searchable encryption. In Proc. of ACM CCS, pp. 1143–1154. Cited by: TABLE I, §1, §1, §2, §5, §8.2.
-  (2014) Dynamic searchable encryption in very-large databases: data structures and implementation.. In Proc. of NDSS, Vol. 14, pp. 23–26. Cited by: TABLE I, §1, §2, §4, §5.1, §5, §8.1.
-  (2013) Highly-scalable searchable symmetric encryption with support for boolean queries. In Proc. of CRYPTO, pp. 353–373. Cited by: §1, §2, §9.
-  (1999) Practical byzantine fault tolerance. In Proc. of OSDI, Vol. 99, pp. 173–186. Cited by: §3.4.
-  (2010) Structured encryption and controlled disclosure. In Proc. of ASIACRYPT, pp. 577–594. Cited by: §2, §9.
-  (2011) Searchable symmetric encryption: improved definitions and efficient constructions. Journal of Computer Security 19 (5), pp. 895–934. Cited by: §1, §1, §2, §4, §5.4.
-  (2015) Rich queries on encrypted data: beyond exact matches. In Proc. of ESORICS, pp. 123–145. Cited by: §1, §2, §9.
-  (2015) Malicious-client security in blind seer: a scalable private dbms. In Proc. of IEEE S&P, pp. 395–410. Cited by: §1, §2.
-  (2017) HardIDX: practical and secure index with sgx. In Proc. of DBSEC, pp. 386–408. Cited by: §11.1.
-  (2018) Searching an encrypted cloud meets blockchain: a decentralized, reliable and fair realization. In Proc. of IEEE INFOCOM, pp. 792–800. Cited by: Augmenting Encrypted Search: A Decentralized Service Realization with Enforced Execution, §2.
Securing SIFT: privacy-preserving outsourcing computation of feature extractions over encrypted image data. IEEE Transactions on Image Processing 25 (7), pp. 3411–3425. Cited by: §1.
-  (2019) Towards private and scalable cross-media retrieval. IEEE Transactions on Dependable and Secure Computing PP, pp. 1–1, DOI: 10.1109/TDSC.2019.2926968. Cited by: §11.1.
-  Hyperledger. blockchain technologies for business.. Cited by: §1, §3.4.
-  (2013) Outsourced symmetric private information retrieval. In Proc. of ACM CCS, pp. 875–888. Cited by: §1, §1, §5.4, §5.4.
-  (2017) Boolean searchable symmetric encryption with worst-case sub-linear complexity. In Proc. of EUROCRYPT, pp. 94–124. Cited by: §1, §2, §9.
-  (2012) Dynamic searchable symmetric encryption. In Proc. of ACM CCS, pp. 965–976. Cited by: §1, §2, §3.1, §4.
-  (2017) Sgx-log: securing system logs with sgx. In Proc. of ACM AsiaCCS, pp. 19–30. Cited by: §11.1.
-  (2014) Introduction to modern cryptography. CRC press. Cited by: §10.1, §3.1.
-  (2016) Efficient encrypted keyword search for multi-user data sharing. In Proc. of ESORICS, pp. 173–195. Cited by: §1, §5.4, §5.4.
-  (2017) Enhancing security and privacy of tor’s ecosystem by using trusted execution environments. In Proc. of NSDI, pp. 145–161. Cited by: §11.1.
-  (2018) Omniledger: a secure, scale-out, decentralized ledger via sharding. In Proc. of IEEE S&P, pp. 583–598. Cited by: §11.2.
-  (2016) Hawk: the blockchain model of cryptography and privacy-preserving smart contracts. In Proc. of IEEE S&P, pp. 839–858. Cited by: §8.4.
-  (2012) UC-secure searchable symmetric encryption. In Proc. of FC, pp. 285–298. Cited by: §2.
-  (2016) A secure sharding protocol for open blockchains. In Proc. of ACM CCS, pp. 17–30. Cited by: §11.2.
-  (2015) Demystifying incentives in the consensus computer. In Proc. of ACM CCS, pp. 706–719. Cited by: §5.1.
-  (2015) GRECS: graph encryption for approximate shortest distance queries. In Proc. of ACM CCS, pp. 504–517. Cited by: §1, §2, §9.
-  (2018) Oblix: an efficient oblivious search index. In Proc. of IEEE S&P, pp. 279–296. Cited by: §11.1.
-  OMAHA: open medical and healthcare alliance.. Cited by: §6.1.
-  (2014) Blind seer: a scalable private dbms. In Proc. of IEEE S&P, pp. 359–374. Cited by: §1, §2.
-  (2013) Pinocchio: nearly practical verifiable computation. In Proc. of IEEE S&P, pp. 238–252. Cited by: §8.4.
-  (2000) Practical techniques for searches on encrypted data. In Proc. of IEEE S&P, pp. 44–55. Cited by: §1, §2.
-  (2014) Practical dynamic searchable encryption with small leakage.. In Proc. of NDSS, Vol. 71, pp. 72–75. Cited by: TABLE I, §1, §2.
-  The worldcare consortium.. Cited by: §6.1.
Privacy-preserving collaborative model learning: the case of word vector training. IEEE Trans. Knowl. Data Eng. 30 (12), pp. 2381–2393. External Links: Cited by: §1.
-  (2018) Searchable encryption over feature-rich data. IEEE Transactions on Dependable and Secure Computing 15 (3), pp. 496–510. Cited by: §1.
-  (2014) Ethereum: a secure decentralised generalised transaction ledger. Ethereum Project Yellow Paper 151. Cited by: §1, §3.2, §3.3.
-  (2018) Enabling generic, verifiable, and secure data search in cloud services. IEEE Transactions on Parallel and Distributed Systems 29, pp. 1721–1735. Cited by: §2.