ShieldDB: An Encrypted Document Database with Padding Countermeasures

03/13/2020 ∙ by Viet Vo, et al. ∙ CSIRO City University of Hong Kong Monash University 0

The security of our data stores is underestimated in current practice, which resulted in many large-scale data breaches. To change the status quo, this paper presents the design of ShieldDB, an encrypted document database. ShieldDB adapts the searchable encryption technique to preserve the search functionality over encrypted documents without having much impact on its scalability. However, merely realising such a theoretical primitive suffers from real-world threats, where a knowledgeable adversary can exploit the leakage (aka access pattern to the database) to break the claimed protection on data confidentiality. To address this challenge in practical deployment, ShieldDB is designed with tailored padding countermeasures. Unlike prior works, we target a more realistic adversarial model, where the database gets updated continuously, and the adversary can monitor it at an (or multiple) arbitrary time interval(s). ShieldDB's padding strategies ensure that the access pattern to the database is obfuscated all the time. Additionally, ShieldDB provides other advanced features, including forward privacy, re-encryption, and flushing, to further improve its security and efficiency. We present a full-fledged implementation of ShieldDB and conduct intensive evaluations on Azure Cloud.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data breaches are happening quite frequently in recent time, affecting millions of individuals. This phenomenon calls for increased control and security for private and sensitive data. To combat against such “breach fatigue”, encrypted database systems recently receive wide attention [Poparz11, microsoftEDB, github, PappasVK14, PapadimitriouBCRHSMB16, PoddarBP16, YuanGWWLJ17]. Their objective is to preserve the query functionality of databases over encrypted data; that is, the server can process a client’s encrypted query without decryption. The first generation of encrypted databases [Poparz11, microsoftEDB, github] implements property-preserving encryption (PPE) in a way that a ciphertext inherits equality and/or order properties of the underlying plaintext. However, inference attacks can compromise these encryption schemes by exploiting the above properties preserved in the ciphertexts [NaveedKW15].

In parallel, dedicated privacy-preserving query schemes are investigated intensively in the past decade. Among others, searchable symmetric encryption (SSE) [SoWa00, CurtmolaGKO11] is well known for its application to keyword based search. In general, SSE schemes utilise an encrypted index to enable the server to search over encrypted documents. The server is restricted such that only if a query token (keyword ciphertext) is given, the search operation against the index will be triggered to output the matched yet encrypted documents. This ensures that an adversary with a full image of the encrypted database learns no useful information about the documents. In that sense, SSE outperforms PPE in terms of security. Apart from security, SSE is scalable, because it is realised via basic symmetric primitives like pseudo-random functions and symmetric encryption.

In this paper, we aim to design an encrypted document database system built on SSE. However, deploying SSE in practice is non-trivial. The foremost challenge is how to address recent emerging inference attacks against SSE [IslamKK12, CashGP15, ZhangKP16], which raise doubts whether SSE achieves an acceptable tradeoff between the efficiency and security. As a noteworthy threat, the count attack [CashGP15] demonstrates that an adversary with extra background knowledge of a dataset can analyse the size of the query result set to recover keywords from the query tokens. The above information is known as access pattern, and can be monitored via the server’s memory access and communication between the server and client. If SSE is deployed to a database, access patterns can also be derived from database logs [GrubbsRS17]. This situation further reduces the security claim of SSE, since the adversary does not have to stay online for monitoring.

Using padding (bogus documents) is proven as a conceptually simple but effective countermeasure to obfuscate the access pattern against the aforementioned attacks [IslamKK12, CashGP15, BostF17]. Unfortunately, existing padding countermeasures only consider a static database, where padding is only added at the setup [BostF17, Cash15]. They are not sufficient for real-world applications. In practice, the states of database change over the time. Specifically, the updates of documents change the access pattern for a given keyword, and new keywords can be introduced randomly at any time. Hence, exploring to what extend adversaries can exploit such changes to compromise the privacy of data and how padding countermeasures can be applied in a dynamic environment are essential to make SSE deployable in practice.

Contributions: To address the above issues, in this paper, we propose and implement an encrypted document database named ShieldDB, in which the data and query security in realistic and dynamic application scenarios is enhanced via effective padding countermeasures. Our contributions can be summarised as follows:

  • ShieldDB is the first encrypted database that can support encrypted keyword search, while equipping with padding countermeasures against inference attacks launched by adversaries with database background knowledge.

  • We define two new types of the attack models, i.e., non-persistent and persistent adversaries, which faithfully reflect different real-world threats in a continuously updated database. Accordingly, we propose padding countermeasures to address these two adversaries, respectively.

  • ShieldDB  is designed with a dedicated system architecture to achieve the functionality and security goals. Apart from the client and server modules for encrypted keyword search, a padding service is developed. This service leverages two controllers, i.e., cache controller and padding controller, to enable efficient and effective database padding.

  • ShieldDB  implement advanced features to further improve the security and performance. These features include: 1) forward privacy that protects the newly inserted document, 2) flushing that can reduce the load of the padding service, and 3) re-encryption that refreshes the ciphertexts while realising deletion and reducing padding overhead.

  • We present the implementation and optimization of ShieldDB, and deploy it in Azure Cloud. We build a streaming scenario for evaluation. In particular, we implement an aggressive padding mode (high mode) and a conservative padding mode (low mode), and compare them with padding strategies against non-persistent and persistent adversaries, respectively. We perform a comprehensive set of evaluations on the load of the cache, system throughput, padding overhead, and search time. Our results confirm that the high mode results in much larger padding overhead than the low mode does, while achieving lower cache load. In contrast, the low mode results in higher system throughput (accumulated amount of real data) but requiring a significantly higher cache load. The evaluations of flushing and re-encryption demonstrate the reduction of the cache load and padding overhead, respectively.

1.1 Technical Overview

To design effective padding countermeasures for a dynamic database, we identify two new attack models (i.e., non-persistent and persistent adversaries). The non-persistent adversary can monitor a targeted database at one certain (but arbitrary) time interval. Within that interval, the database state remains unchanged. The adversary also has the background knowledge of the database at that state. Advanced than the non-persistent adversary, the persistent (stronger) adversary can monitor the database over multiple arbitrary time intervals, and have the background knowledge of the database at multiple states.

Our first observation to address the above adversaries is that bogus and real data needs to be inserted in a batch and mixed manner, so that the adversary cannot distinguish the bogus data from the real one. In particular, our system implements a dedicated component, called Padding Service, to perform padding, encryption, and insertion. The incoming documents are indexed as keyword and document id pairs denoted as entries, and cached by this service. To reduce the padding overhead, keywords with similar frequencies are clustered together, and the above entries are cached to the corresponding clusters. Once a cluster is full, the entries in that cluster will be padded in a way that the access pattern of each keyword therein is identical. After that, all real and bogus entries are encrypted and inserted to the database.

However, the above basic strategy can still fail to defend against the above two adversaries we have identified. Our key observation is that keyword existence is critical information and can be exploited by the adversaries.

For non-persistent adversary, if a keyword never appears in the database while its entries being padded, the adversary is able to identify the padding for this keyword during her controlled time interval. The reason is that if such a keyword is queried, the server should return an empty set. However, due to the padding, the server would return some results, which is essentially the padding. To handle this issue, Padding Service is programmed to keep tracking the state of each keyword. No padding will be added for keywords that have never appeared before. When a keyword of a cluster appears in the first time, Padding Service ensures all keywords that have already appeared of that cluster have the same result length.

For persistent adversary, as she is able to monitor the database over the time and have knowledge on its changes, it is even more difficult to hide the keyword existence information. For example, if a keyword does not exist in the first time interval, but appearing in the second time interval, following the above approach, the server will return an empty result set at the first interval and a non-empty set at the second interval. As seen, this keyword can easily be identified. To address this adversary, we apply a conservative constraint to the first batch of each keyword cluster; that is, all keywords have to appear before insertion. Based on this treatment, Padding Service can perform padding for all keywords in the cluster no matter a keyword appears in the subsequent time interval or not. The server will not expose unique access patterns at any time interval.

Based on the proposed padding countermeasures, we deploy the SSE scheme proposed in [CashJJ14] (a dictionary-based index) to ShieldDB. Padding Service  is designed with two primary modules, i.e., Cache Controller and Padding Controller, which jointly conduct the cache management, state tracking, padding, encryption, and batch insertion. To further improve the security, efficiency, and functionality, ShieldDB provides three advanced features, i.e., forward privacy, cache flushing, and re-encryption. Forward privacy [Bost16] hides the linkability between the searched query tokens and newly inserted documents. Cache flushing can empty the “cold” cache of keyword clusters, where the padding is barely triggered. Re-encryption can periodically pull the index entries of different clusters back to Padding Service for re-padding and re-encryption. In that way, redundant bogus entries are removed, and the access pattern is reset. Also, deletion can naturally be implemented.

2 Background

In this section, we introduce the background knowledge related to the design of our system.

Symmetric searchable encryption. Considering a client and a server, the client encrypts documents in a way that the server can query keywords over the encrypted documents. Functions included in an SSE scheme are setup and search. If the scheme is dynamic, update functions (data addition and deletion) are also supported. Let DB represent a database of documents, and each document is a variable-length set of unique keywords. We use to present all keywords occurred in DB, DB() to present documents that contain , and to denote the number of those documents, i.e., the size of the query result set for . In SSE, the encrypted database, named EDB, is a collection of encrypted keyword and document id pairs ’s, aka an encrypted searchable index.

In setup, client encrypts DB by using a cryptographic key k, and sends EDB to server. During search, client takes k and a query keyword as an input, and outputs a query token tk. S uses tk to query EDB to get the pseudo-random identifies of the matched documents, so as to return the encrypted result documents. In update, C takes an input of k, a document parsed as a set of pairs, and an operation op . For addition, the above pairs are inserted to EDB. For deletion, server no longer returns encrypted documents in subsequent search queries. As an output, server returns an updated EDB.

The security of SSE can be quantified via a tuple of stateful leakage functions . They define the side information exposed in setup, search, and update operations, respectively. The detailed definitions can be referred to Section 6.

Count Attack. Cash et al. [Cash15] propose a practical attack that exploits the leakage in the search operation of SSE. It is assumed that an adversary with full or partial prior knowledge of DB can uncover keywords from query tokens via access pattern. Specifically, the prior knowledge allows the adversary to learn the documents matching a given keyword before queries. Based on this, she can construct a keyword co-occurrence matrix indicating keyword coexisting frequencies in known documents. As a result, if the result length for a query token is unique and matches with the prior knowledge, the adversary directly recovers . For tokens with the same result length, the co-occurrence matrix can be leveraged to narrow down the candidates. In this work, we extend the threat model of the count attack to the dynamic setting, which will be introduced in Section 3.2.

Forward Privacy. Forward privacy in SSE prevents the adversary from exploiting the leakage in update (addition) operations. Given previously collected query tokens, this security notion ensures that these tokens cannot be used to match newly added documents. As our system considers the scenario, where the documents are continuously inserted, we adapt an efficient scheme with forward privacy [SongDYXZ17] proposed by Song et al. This scheme follows Bost’s scheme [Bost16] that employs trapdoor permutation to secure states associated to newly added ’s. Without being given new states, the server cannot perform search on the new data, and those states can be used to recover old states via trapdoor permutation. Specifically, we optimise the adapted scheme in the context of batch insertion and improve the efficiency, which will be introduced in Section 5.3.

3 Overview

3.1 System Architecture

ShieldDB is a document-oriented database, where semi-structured records are modeled and stored as documents, and can be queried via keywords or associated attributes.

Participants and application scenarios: As illustrated in Figure 1, ShieldDB consists of a query client C, a padding service P and a storage server S. In our targeted scenario, new documents are continuously inserted to S, and required to be encrypted. Meanwhile, C expects S to retain search functionality over the encrypted documents. To enhance the security, P adapts padding countermeasures during encryption. In this paper, we consider an enterprise that utilises outsourced storage. P is deployed at the enterprise gateway to encrypt and upload the documents created by its employees, while C is deployed for employees to search the encrypted documents at S. Note that the deployment of P is flexible. It can be separated from or co-located with C.

Overview: ShieldDB supports three main operations, i.e., setup, streaming, and search; it also supports deletion and re-encryption as introduced later. During setup, the module App Controller receives a sample dataset and groups keywords into clusters based on their frequencies. After that, App Controller notifies the cluster information to the module Cache Controller to initialise a cache for each keyword cluster. In the meantime, App Controller notifies the same information to the module Padding Controller to generate a set of bogus documents (i.e. padding).

Fig. 1: Architecture of ShieldDB

During streaming, P receives a sequence of documents and parses them into a set of keyword and document identifier pairs, i.e., index entries for search. Then Cache Controller stores these pairs to the caches of the corresponding keyword clusters. Based on the targeted attack model, Cache Controller applies certain constraints to flush the cache. Once the constraints on a cluster are met, Cache Controller also notifies Padding Controller for padding. In particular, Padding Controller adds bogus pairs to make the keywords in this cluster have equal frequency. Then it encrypts and inserts all those real and bogus index entries as a data collection in a batch manner to EDB. Meantime, both real and bogus documents are encrypted and uploaded to EDB.

During search, C receives a query keyword. On the one hand, it retrieves the local results from Cache Controller, since some index entries might have not been sent to EDB yet. On the other hand, C sends a query token generated from this keyword to S to retrieve the rest of the encrypted results. After decryption, C filters padding and combines the result set with the local one. For security, C will not generate query tokens against the data collection which is currently in streaming; this constraint enforces S to query over data collections which are already inserted to EDB. Following the setting of SSE [CurtmolaGKO11, KamaraPR12], search is performed over the encrypted index entries in EDB, and document identifiers are pseudo-random strings. In response to query, S will return the encrypted documents via recovered identifiers in the result set after search.

Apart from padding countermeasures, ShieldDB provides several other salient features. First, it realises forward privacy (an advanced notion of SSE) for the streaming operation. Our realisation is customised for efficient batch insertion and can prevent S from searching the data collection in streaming. Second, ShieldDB integrates the functionality of re-encryption. Within this operation, index entries in a targeted cluster are fetched back to P and the redundant padding is removed. At the same time, deletion can be triggered, where the deleted index entries issued and maintained at P are removed and will not be re-inserted. After that, real entries combing with new bogus entries are re-encrypted and inserted to EDB. Third, Cache Controller can issue a secure flushing operation before meeting the constraints for padding. This reduces the overhead of P while preserving the security of padding.

Remark: ShieldDB  assigns P for key generation and management, and P issues the key for C to query. In our current implementation, P and C use the same key for index encryption, just as most SSE schemes do. This is practical because SSE index only stores pseudorandom identifies of documents, and documents can separately be encrypted via other encryption algorithms. Note that advanced key management schemes of SSE [SunLSSY16, JareckiJK13] can readily be adapted; yet, this is not our focus.

Setup

[6pt]    Padding service:

1: ;

2: Initialise a map with size ;

3: Initialise maps and ;

4: Initialise a bogus document set ;

Server:

5: Initialise a map ;

[6pt] Streaming

[6pt]    Padding service:

1: while is full do

2:    foreach do

3:       while do

// is the maximum size of the matching lists in

4:       add padding from ;

5:       end while

6:        foreach do

7:          ;

8:          ;

9:          if then

10:            ;

11:         else

12:            ;

13:         end if

14:         ;

15:         ;

16:          ;

17:         ;

18:       end foreach

19:    end foreach

20:    Send to server;

21: end while

Server:

22: foreach in do

23:   ;

24: end foreach

Search

[6pt]    Client:

1: ;

2: ;

3: Send to Server

Server:

4: Initialise a set ;

5: for until returns do

6:    ;

7:     ;

8:     ;

9:    ;

10: end for

11: Send to client

Client:

12: Search in and combine .

Fig. 2: Protocols in the strawman approach

3.2 Attack Models

ShieldDB mainly considers a passive adversary who monitors the server’s memory access and the communication between the server and other participants. Following the assumption of the count attack [CashGP15], the adversary has access to the background knowledge of the dataset and aims to exploit this information with the access pattern in search operations to recover query keywords. In this paper, we extend this attack model to the dynamic (streaming) setting.

Before elaborating the attack models, we define the streaming model. In our system, streaming performs batch insertion on a collection of encrypted keyword and document identifier pairs. Giving a number of continuous streaming operations, encrypted collections are added to a sequence over time. Accordingly, S orders the sequence of data collections by the timestamp. We define the gap between any two consecutive timestamps is a time interval , and C is allowed to search at any time interval. Note that at a given , S can only perform search operations against the collections that have been completely inserted to EDB.

In the dynamic setting, we observe two attack models, which we refer to as non-persistent and persistent adversaries, respectively.

  • Non-persistent adversary: This adversary controls S within one single arbitrary time interval , where is a system parameter that monotonically increases and . During , she observes query tokens that C issued to S, and the access patterns returned by S. She knows the accumulated (not separate) knowledge of the document sets inserted from to .

  • Persistent adversary: This adversary controls S across multiple arbitrary time intervals, for example, from to . She persistently observes query tokens and access patterns at those intervals, and knows the separate knowledge of the document sets inserted from to .

For both attack models, S cannot obtain the query tokens against the encrypted data collections streamed in the current time interval. It is enforced by our streaming operation with forward privacy, which will later be introduced in Section 5.3.

Real-world implication of the adversaries: We stress that non-persistent adversary could be any external attackers, e.g., hackers or organised cyber criminals. They might compromise the server at a certain time window. We also assume that this adversary could obtain a snapshot of the database via public channels, e.g., a prior data breach [NaveedKW15]. Because the database is changed dynamically, the snapshot might only reflect some historical state of the database. On the contrary, the persistent adversary is more powerful and could be database administrators or insiders of an enterprise. They might have long term access to the server and could obtain multiple snapshots of the database via internal channels.

Other threats: Apart from the above adversaries, ShieldDB  considers another specific rational adversary [ZhangKP16] who can inject documents to compromise query privacy. As mentioned, this threat can be mitigated via forward privacy SSE. Note that ShieldDB currently does not address an active adversary who sabotages the search results. This threat can be addressed by verifiable SSE schemes [RaphaelFD16, WangCS18]; they are built on authenticated data structures and cryptographic accumulators, and can naturally be integrated to SSE.

4 Strawman Approach

This section introduces a strawman approach of designing ShieldDB. It serves as a stepping stone to illustrate the data structures and protocols in our system. Then we evaluate this approach from security and performance aspects to motivate our design intuitions of padding countermeasures.

ShieldDB adapts an encrypted map proposed by Cash et al. [CashJJ14] as the underlying data structure. It is compatible and can directly be deployed to the existing key-value store for a wide range of applications. In setup, App Controller generates private keys and for indexing and encryption, and Cache Controller initialises an empty set with capacity for caching. Then Padding Controller initialises an empty set to track the states of keywords, and generates bogus documents for padding. Also, S initialises an empty map in EDB. Given an incoming document, App Controller parses it as pairs and caches them at . Once is full, Cache Controller pushes all cached items to Padding Controller.

During streaming, Padding Controller introduces bogus pairs to make all keywords in have equal number of matched documents, i.e., the maximum size of the matching lists in those keywords. For each real/ bogus pair, it is encrypted as (, ), where is a pseudo-random function, and are cryptographic hash functions, and is the state of , i.e., a counter starting from . After padding and encryption, all the encrypted pairs are inserted to in a batch. Note that batch insertion is to facilitate padding countermeasures. If documents are separately indexed and inserted, unique access patterns can be created in later searches.

During search, C generates a token (, ) from query keyword . Upon receiving the token, S retrieves the result ids via the symmetric way of encryption. is used to find matched entries in , while is used for decryption to get the result id. Meanwhile, C searches in and combines the local results with the ones from S.

To enable C to differentiate bogus ids from real ids, we define the space of id as , where the space for real id is , the space for bogus id is , and the bit length of each id is . Then a pseudo-random identifier can be derived from a pseudo-random permutation with input id, which can later be reversed. For ease of the presentation, we skip the above procedure and assume it is a system function.

Issues: The strawman design only maintains one single cache for batch streaming. As long as its capacity is full, P pushes all pairs from cache for padding. We note that this approach may introduce large padding overhead and even break the effectiveness of padding against the attack models in the dynamic setting.

The keywords in the current batch could be associated to different numbers of matched documents. To avoid unique access patterns, the number of ’s for each keyword should be identical after padding. Namely, the size of the matching lists for each keyword needs to be padded to the maximum one. However, the size of the above lists in streaming would vary greatly, thereby incurring large amount of padding. Regarding security, in the context of streaming, not all the keywords in the keyword space might appear in every batch. As a result, the unique access pattern is very likely to be created if the padding strategy does not consider the change of the database state over the time.

5 Design of ShieldDB

In this section, we present the detailed design of ShieldDB. First, we introduce how to manage the keywords and cache during the setup phase. The goal is to facilitate padding and reduce the padding overhead. Second, we introduce padding strategies against two types of adversaries in the streaming context. Third, we implement some advanced features to further improve the security and efficiency.

5.1 Setup

During setup, ShieldDB  invokes Cache Controller to initialise the cache for batch insertion, and Padding Controller to generate bogus documents for padding.

To reduce the overhead, ShieldDB implements cache management in a way that it groups keywords with similar frequencies together and performs padding at each individual keyword cluster. This approach is inspired from existing padding countermeasures in the static setting [CashGP15, BostF17]

. The idea of doing this in a static database is intuitive; the variance between the result lengths of keywords with similar frequencies is small, which can minimize the number of bogus entries added to the database. We note that it is also reasonable in the dynamic setting, where the keyword frequencies in specific applications can be stable in the long run. If a keyword is popular, it is likely to appear frequently during

streaming, and vice versa. Therefore, we assume that there exists a sample dataset, where the keyword frequencies are close to the real ones during streaming. Such a sample dataset can be provided or collected at the trial stage of the application.

We implement a heuristic algorithm for keyword clustering. Given the keywords of a sample dataset

, they are in ascending order via their frequency for . Initially, Cache Controller partitions them as and the minimum size of each group is subjected to . For security, the keyword frequency in each cluster after padding should be the same, i.e., the maximum one, and thus Cache Controller computes the padding overhead as follows:

This algorithm iteratively evaluates for every combination of the partition, and determines the clusters if is the smallest. After that, the controller allocates the capacity of the cache based on the aggregated keyword frequencies of each cluster, i.e., , , , , where is the total capacity assigned for the local cache.

After that, Padding Controller initialises a bogus dataset with size , where the number of bogus keyword/id pairs for each keyword is determined via the frequency, i.e., , and is the maximum frequency in the cluster of . The reason of doing so still follows the assumption in cache allocation. If the keyword is less frequent in a cluster, it needs more bogus pairs to achieve the maximum result length after padding, comparing other keywords with higher frequency, and vice versa. Then the controller generates bogus index pairs. Once the bogus pairs for a certain keyword is run out, the controller is invoked again to generate padding for it through the same way.

Remark: We assume that the distribution of the sample dataset is close to the one of the streaming data in the long run. Yet, it is non-trivial to obtain optimal padding overhead in the dynamic setting due to the variation of streaming documents in different time intervals. Nevertheless, if the distribution of the database varies during the runtime, the keyword clustering can be re-invoked based on the up-to-date streaming data (e.g., in a sliding window), and the cache can be re-allocated. Besides, our re-encryption operation can further reduce the padding overhead, which will be introduced later in Section 5.3.

1 function ()
Input : : entries for streaming,
: cache clusters,
: a map that tracks keyword states,
: bogus document set;
: padding mode (high or low);
Output : : a set of real and bogus entries
2 push entries in to ;
3 if padding against non-persistent adverdary then
4       for cluster  do
5             if  then
6                   ;
7                   if  then
8                         skip padding for ; //no occurred yet
9                        
10                   end if
11                  ;
12                   add to ;
13                  
14             end if
15            
16       end for
17      
18 end if
19if padding against persistent adverdary then
20       for cluster  do
21             if first batch && ,  then
22                   ;
23                   add to ;
24                  
25             else if  then
26                   ;
27                   add to ;
28                  
29             end if
30            
31       end for
32      
33 end if
34return ;
Algorithm 1 Padding strategies
1 function ()
Input : : cluster for padding,
: a map that tracks keyword states,
: bogus dataset;
: padding mode (high or low)
Output : : a set of real and bogus entries
2 ;
3 , ;
4 if  then
5       ; // is the maximum size of keyword matching lists in
6       :
7       add bogus entries from to ; // is the size of matching list for in
8       ;
9      
10else
11       // mode is
12       ; // is the minimum size of keyword matching lists in
13       :
14       add bogus entries from to ;
15       ;
16      
17 end if
18 return ;
Algorithm 2 Padding modes

5.2 Padding Strategies

After setup, documents are continuously collected and parsed as keyword/id pairs cached at their corresponding clusters. Once a cluster is full, the streaming operation is invoked. Then Padding Controller adapts the corresponding padding strategy to the targeted adversary, encrypts and inserts all real and bogus pairs to EDB in a batch manner. Next, we present the padding strategies against the non-persistent and persistent adversaries, respectively. The sketch of the padding function is given in Algorithms 1 and 2. Note that Padding Controller will also upload bogus and real documents in the batch. For simplicity, we do not include this operation in our algorithm. We mainly focus on the protection of the access pattern to the encrypted index.

Padding strategy against non-persistent adversary: Recall that this adversary controls S within a certain time interval . From the high level point of view, an effective padding strategy should ensure that all keywords occurred in EDB at do not have unique result lengths. There are two challenges to achieve this goal. First, can be an arbitrary time interval. Therefore, the above guarantee needs to be held at any certain time interval. Second, not all the keywords in the keyword space would appear at each time interval. It is non-trivial to deal with this situation to preserve the security of padding.

To address the above challenges, ShieldDB programs Padding Controller to track the states of keywords over the time intervals from the beginning. Specifically, each keyword state includes two components, a flag that indicates whether the keyword has existed before in the streamed documents, and a counter that presents the number of total real and bogus pairs in EDB of the keyword. Based on the states of keywords, Padding Controller performs the following actions. If the keyword has not existed yet, the controller will not pad this keyword even its cluster is full. The reason is that the adversary might also know the information of keyword existence. If C queries a keyword which does not exist, S should return an empty set. Otherwise, the adversary can identify the token of this keyword if padded. Accordingly, only when a keyword appears at the first time, padding over this keyword will be invoked. After that, if the keywords in a cluster all exist, no matter a keyword exists in a certain time interval or not, padding will be added to ensure that all keywords in the cluster always have the same result length at any following time interval.

Padding strategy against persistent adversary: Recall that this adversary can monitor the database continuously and obtain multiple references of the database across multiple time intervals. Likewise, the padding strategy against the persistent adversary should ensure that all keywords have no unique access pattern in all time intervals from the very beginning. However, directly using the strategy against the non-persistent adversary does not address the leakage of keyword existence. Let us demonstrate this issue with an example. Below are a sequence of streaming and search operations across two consecutive time intervals:

: streaming ({(,), (,), (,), (,)});

: search(), search();

: streaming ({(,), (,), (,), (,), (,});

: search(), search(), search().

Considering that , , and are in the same cluster, and Padding Controller utilises bogus , , and to ensure that these keywords have the same search result length after batch insertion at either or . It is effective to defend against the non-persistent adversary, because she can only control at either or . However, the persistent adversary can figure out is the only new keyword. The reason is that she might know the states of the database at all the four time intervals; namely, she knows is the new keyword inserted in and identifies the query token of in .

To address this issue, Padding Controller is programmed to enforce another necessary constraint to invoke padding. That is, all keywords in the cluster at the first batch have to exist before streaming. As a tradeoff, Cache Controller has to hold all the pairs in the cluster even the cache is full, if there are still keywords yet to appear. In Algorithm 1, the existence of all keywords in the first batch is checked. For subsequent batches of the cluster, the padding constraint follows the same strategy for the non-persistent adversary. We name this additional constraint as first batch condition.

Setup

[6pt]    Client:

1: ;

2: Initialise maps ;

3: Initialise maps and ;

4: Initialise a bogus document set ;

Server:

5: Initialise a map ;

[2pt] Streaming

[6pt]    Padding service:

1: ;

2: Select with padding;

3: foreach in do

4:    ;

6:    ;

7:    ;

8:    if then

9:        ;

10:   else

11:      , 0;

12:   end if

13:   

14:   

15:   ;

16.   foreach that matches do

17:     

18:     ;

19:      ;

20:     

21:   end foreach

22:   

23:   ;

24:    ;

25: end foreach

26: Send to Server;

Server:

27: foreach in do

28:   ;

29: end foreach

Search

[5pt]    Client:

1: if then

2:    ;

3:    ;

4:     ;

5:    Send to server

6: else

7:    Search in , return ;

8: end if

Server:

9:  , , ;

10: while do

11:    for to do

12:      

13:       E

14:      

15:       ;

16:    end for

17:    ;

18:     E;

19:    

20:    

21:    

22: end while

23: send to Client

Client:

24: Search in , combine with .

Fig. 3: Protocols in ShieldDB. In Streaming, is an ephemeral key generated for every batch insertion. and present the result length and the state of in the current batch respectively. The result length of in the previous batch is embedded in . In Search, and present the current state and the result length of in batch .

Padding modes: ShieldDB implements two modes for padding, i.e., high and low modes. These two modes are described in Algorithm 2. In the high mode, once the constraint for the cache of a cluster is met, the keywords to be padded have the maximum result length of keywords in this cluster. Accordingly, the cache can be emptied since all entries are sent to Padding Controller for streaming. On the contrary, the low mode is invoked in a way that the keywords to be padded have the minimum result length of keywords in this cluster. Therefore, some entries of keywords might still be remained in the cache. Yet, this mode only introduces necessarily minimum padding for keywords which do not occur in each time interval. The two modes have their own merits. The high mode consumes a larger amount of padding and execution time for padding and encryption, but it reduces the load of cache in P. The low mode introduces relatively less padding overhead but heavier load of P.

Security guarantees: Our padding countermeasures ensure that no unique access pattern exists for keywords which have occurred in EDB. For the persistent adversary, the padding countermeasure also ensures that the keyword occurrence is hidden across multiple time intervals. Note that padding not only protects the result lengths of queries, but also introduces false counts in keyword co-occurrence matrix, which further increases the efforts of the count attack. Regarding the formal security definition, we follow a notion recently proposed by Bost et al. [BostF17] for SSE schemes with padding countermeasures. This notion captures the background knowledge of the adversary and formalises the security strength of padding. That is, given any sequence of query tokens, it is efficient to find another same-sized sequence of query tokens with identical leakage. We extend this notion to make the above condition hold in the dynamic setting in Section 6.

Remark: Our proposed padding strategies are different from the approach proposed by Bost et al. [BostF17], which merely groups keywords into clusters and pads them to the same result length for a static database. Directly adapting their approach for different batches of incoming documents will fail to address persistent or even non-persistent adversaries. The underlying reason is that the above approach treats each batch individually, while the states of database are accumulated. Effective padding strategies in the dynamic setting must consider the accumulated states of the database so that the adversaries can be addressed in arbitrary time intervals.

5.3 Other Features

ShieldDB provides several other salient features to enhance its security, efficiency, and functionality.

Forward privacy: First of all, ShieldDB  realises the notion of forward privacy [Bost16, SongDYXZ17] to protect the newly added documents and mitigate the injection attacks [ZhangKP16]. In particular, our system customises an efficient SSE scheme with forward privacy [SongDYXZ17] to our context of batch insertion. This scheme is built on symmetric-key based trapdoor permutation and is faster than the public-key based solution [Bost16]. But the ephemeral key of permutation needs to be embedded inside the index entry to recover the state of the previous entry. To reduce the computation and storage overhead, we propose to link a master state to a set of entries with the same keyword in the batch. All counters associated with the entries are derived from the master state. The detailed algorithm for encryption and search can be found in Figure 3.

Another benefit of our design is that S can be enforced to perform search operations over the completed batches. The batches which are still transmitted on the fly cannot be queried without the latest keyword state from C.

Re-encryption and deletion: ShieldDB also implements the re-encryption operation. This operation is periodically conducted over a certain keyword cluster. P first fetches all entries in this cluster stored in EDB from S. After that, P removes all bogus entries and re-performs the padding over this cluster of keywords. All the real and bogus entires are then encrypted via a fresh key, and inserted back to EDB. The benefits of re-encryption are two-fold: (1) redundant bogus entries in this cluster can be eliminated; and (2) the leakage function can be reset to protect the search and access patterns. During re-encryption, ShieldDB can also execute deletion. A list of deleted document ids is maintained at P, and the deleted entries are physically removed from the cluster before padding.

Cache flushing: In the streaming documents, the keywords in some clusters might not show up frequently. Even the cache capacity of such clusters is set relatively small, the constraint might still not be triggered very often. To reduce the load of the cache at P and improve the streaming throughput, ShieldDB develops an operation called flushing to deal with the above “cold” clusters. In particular, Cache Controller monitors all the caches of clusters, and sets a time limit to trigger flushing. If a cluster is not full after a period of this time limit, all entries in this cluster will be sent to Padding Controller. Note that the padding strategies still need to be followed for security and the high mode of padding is applied to empty the cache.

6 Security of ShieldDB

ShieldDB implements a dynamic searchable encryption scheme , consisting of three protocols between a padding service , a storage server , and an querying client . A database DB is defined as a tuple of keyword and document id pairs with and at the time interval .

Setup is a protocol that takes as input a database DB, and outputs a tuple of , where are secret keys to encrypt keywords and document ids, a set contains cache clusters, maintains keyword states, and is a bogus dataset to be used for padding, and EDB is the encrypted database at .

Streaming is a protocol between with inputs ,, and () the cache cluster to be updated, the states , the bogus dataset , and the set of keyword and document id pairs to be streamed, and with input the encrypted database at time (), and the set of encrypted keyword and document identifier pairs for batch insertion. Once uploads to , and gets updated, is reset. At , once gets updated by , it changes to .

Search is a protocol between with the keys , the query querying the matching documents of a single keyword , and the state , and with . Meanwhile, queries for retrieving cached documents of the query keyword.

The security of ShieldDB can be quantified via a leakage function . It defines the information exposed in Setup, Streaming, and Search, respectively. The function ensures that ShieldDB does not reveal any information beyond the one that can be inferred from , , and .

In Setup, presenting the size of , i.e., the number of encrypted keyword and document id pairs.

In Streaming, ShieldDB is forward private as presented in Streaming protocol in Figure 3. Hence can be written as

where denotes a batch of keyword and id pairs , and is a stateless function. Hence, only reveals the number of pairs to be added to EDB. ShieldDB does not leak any information about the updated keywords. In particular, cannot learn that the newly inserted documents match a keyword that being previously queried.

In Search, reveals common leakage functions [CurtmolaGKO06]: the access pattern ap and the search pattern sp as follows.

The ap reveals the encrypted matching document identifiers associated with search tokens. For instance, if an adversary controls , she monitors the search query list by the time order. Then, (with ) for a query keyword is presented as

where and are an encrypted keyword and document id entry associated with in .

The sp leaks the repetition of search tokens sent by C to S, and hence, the repetition of queried keywords in those search tokens.

Next, we detail the leakage during the interaction between C and S over on a given . We call an instantiation of interaction as a history . We note that the states of keywords in do not change during these queries. The leakage function of is presented as

where () is the number of matching documents associated with the keyword mapping to the query , is the access pattern induced by , and is a symmetric binary matrix such that for , the element at row and column is 1 if , and 0 otherwise.

Constrainted security: Here, we define constraints to formalise that the a history conforms to the information known by the non-persistent adversary at the time she launches an attack. This constraint definition follows the constraint definition proposed by Bost et al. [BostF17]. In details, we want to represent that and the list of queries are known by the adversary. We use a predicate over histories: the history satisfies the constraint if .

Definition 1.

A constraint set C over a database and a query set , is a sequence of algorithms such that, , where is true or false and captures ’s state, and for q , . The constraint is consistent if (the constraint remains false if it once evaluates to false).

For a history , we note C(H) the evaluation of

If true, we say that H satisfies C. A constraint set C is valid if there exists two different efficiently constructable histories H and H satisfying C.

The validity of the constraint allows to formalise the fact that the adversary knows DB throughout the queries within the time interval . Hence, the existence of multiple histories satisfying is essential for the acceptable security. We formalise it as follows.

Definition 2.

A constraint set C is -acceptable for some leakage if, for every efficiently computable history satisfying C, there exists an efficiently computable satisfying C, for , such that .

Let present the keyword space of , and similarly be the keyword space of , Definition 2 shows that the condition implies , and , where denotes the frequency of , (i.e. the number of distinct documents that contain ). The databases and have the identical set of keywords and that set is known by the adversary. If the set is not the same for each one, querying a non-existed keyword in one of the databases will cause the difference in access pattern between the databases. Hence, the leakage of those histories will not be the same. The frequency condition, , ensures that the adversary receives the identical number of matching documents when she executes queries and over and , respectively.

6.1 Security against Non-persistent Adversary

In this section, we use the constraint set to define the security game for the non-persistent adversary of a scheme as follows. The adversary selects two different databases and , and gives them to a challenger. Let be the keyword space of , and be the keyword space of . Then, the challenger randomly picks one of the database and runs a setup protocol to generate

. Then, the adversary sends queries to the challenger to receive the search results. The scheme is secure if the adversary cannot correctly guess the picked database and query keywords with a non-negligible probability in security parameter

. This security game is formalised in Definition 3.

Definition 3.

Let = (Setup,Search) be the SSE scheme of ShieldDB, be the security parameter, and be a non-persistent adversary. Let be a leakage function, and be a set of -acceptable constraints. Let Ind be the following game:

Ind() Game:

for to do

end for

if b b return 1, otherwise return 0

where presents the transcript of the query , and with the restriction that, for , and ,

We say that is -constrained-adaptively-indistinguishable if for all probabilistic polynomial time adversary

We underline again that the constraint C can be seen as information the adversary knows about the histories, including the keyword space, and the frequencies of keywords in that keyword space. In addition, we stress that the states of the keywords in that space are unchanged over the queries (with . Then, we can prove the following theorem by analysing the transcripts and .

Theorem 1.

Let = (Setup,Search) be our SSE scheme, and C a set of knowledge constraints. If is -constrained-adaptively-indistinguishable secure, and C is either -acceptable, then is (,C)-constrained-adaptively-indistinguishability secure.

Proof.

In Definition 3, we can see that the adversary receives the transcript when she sends the query to the challenger. Hence, to prove the indistinguishbility between and , we start investigating the query keyword of for all query (with . This analysis will help to investigate when is non-existed/existed in as follows. We also recall that . Hence, the existance of in and is identical due to .

If , we can easily see that both and are indistinguishable since .

If , we can also see that since . Hence, and are also indistinguishable upon the queries and .

The above analysis of ’s existence can be applied to keywords of the next queries in the sequence of . Eventually, and are indistinguishable under the adversary’s view. ∎

6.2 Security against Persistent Adversary

In this section, we also use the constraint set C to define the security game for the persistent adversary. We recall that a persistent adversary monitors the communication between the padding service, query client, and the server over time. Hence, she obtains both and caused by streaming and search queries, respectively. An query can be written as op, where op can be u (streaming) or q (search). Note that u streams a pair of . Although our system performs batch insertion, the adversary can still see that every encrypted entry is inserted orderly.

The security of a scheme against the persistent adversary is defined as follows. The adversary selects two different databases and , and gives them to a challenger. Let be the keyword space of , and be the keyword space of . Then, the challenger randomly picks one of the database and runs a setup protocol to generate , with . After that, the adversary sends queries to the challenger to update the EDB and to receive search results. The scheme is secure if the adversary cannot correctly guess the picked database and query keywords with a non-negligible probability in security paramter . This security game is formalised in Definition 4.

Definition 4.

Let = (Setup,Streaming,Search) be the SSE scheme of ShieldDB, be the security parameter, and be a persistent adversary. Let be a leakage function, and be a set of -acceptable constraints. Let u be a streaming query, q be a search query, and op be either u or q. Let Ind be the following game:

Ind() Game:

for to do

end for

if b b return 1, otherwise return 0

where presents the transcript of the query , and with the restriction that, for all , and ,