Data Inference from Encrypted Databases: A Multi-dimensional Order-Preserving Matching Approach

01/23/2020 ∙ by Yanjun Pan, et al. ∙ University of Cincinnati Huaqiao University The University of Arizona Stony Brook University 0

Due to increasing concerns of data privacy, databases are being encrypted before they are stored on an untrusted server. To enable search operations on the encrypted data, searchable encryption techniques have been proposed. Representative schemes use order-preserving encryption (OPE) for supporting efficient Boolean queries on encrypted databases. Yet, recent works showed the possibility of inferring plaintext data from OPE-encrypted databases, merely using the order-preserving constraints, or combined with an auxiliary plaintext dataset with similar frequency distribution. So far, the effectiveness of such attacks is limited to single-dimensional dense data (most values from the domain are encrypted), but it remains challenging to achieve it on high-dimensional datasets (e.g., spatial data) which are often sparse in nature. In this paper, for the first time, we study data inference attacks on multi-dimensional encrypted databases (with 2-D as a special case). We formulate it as a 2-D order-preserving matching problem and explore both unweighted and weighted cases, where the former maximizes the number of points matched using only order information and the latter further considers points with similar frequencies. We prove that the problem is NP-hard, and then propose a greedy algorithm, along with a polynomial-time algorithm with approximation guarantees. Experimental results on synthetic and real-world datasets show that the data recovery rate is significantly enhanced compared with the previous 1-D matching algorithm.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Data outsourcing has become popular in recent years. Small businesses or individual users choose to delegate their data storage to public cloud servers (such as Amazon EC2 or Google Cloud) to save operational costs. Meanwhile, data breaches happen at an increasing rate, which compromise users’ privacy. For instance, the Yahoo! data breaches reported in 2016 affected 3 billion user accounts [yahoo]. This is exacerbated by recent scandals of data misuse (such as the Facebook-Cambridge Analytica case [facebook]), which increases the level of distrust from users. To address this issue, end-to-end encryption is commonly adopted to encrypt the data before it is uploaded and stored on an untrusted server. In order to enable efficient utilization over encrypted data (such as answering queries), many cryptographic techniques called searchable encryption (SE) [song2000practical, bellare2007deterministic, curtmola2011searchable] have been proposed. The main challenge for SE is to simultaneously provide flexible search functionality, high security assurance, and efficiency. Among existing SE schemes, Order-Preserving Encryption (OPE) [boldyreva2009order, boldyreva2011order, popa2013ideal, kerschbaum2014optimal] has gained wide attention in the literature due to its high efficiency and functionality. In particular, OPE uses symmetric key cryptography and preserves the numeric order of plaintext after encryption, which supports most Boolean queries such as range query. Well-known systems for encrypted database search using OPE include: CryptDB [popa2011cryptdb], Google Encrypted Bigquery Client [Bigquery], and Microsoft Always Encrypted Database [Microsoft].

Many early OPE schemes, unfortunately, were shown to leak more information beyond what is necessary (i.e., the order between plaintexts). Therefore, schemes that satisfy ideal security guarantees (that only the order is leaked) have been proposed [popa2013ideal, kerschbaum2014optimal]. However, recent research [NKW15, GSBNR17] showed that it is possible to infer/recover a significant portion of plaintexts from their OPE ciphertext, using only the ciphertext order relationships, as well as some auxiliary dataset with data frequencies similar to a target dataset. For example, Naveed et al. [NKW15] attacked an encrypted medical database where users’ age column is encrypted using OPE. Later, the attack was improved by Grubb et al. [GSBNR17], with an additional restriction of non-crossing property in the matching algorithm.

We note that, to date, all the successful inference attacks against OPE are limited to one-dimensional data [NKW15, GSBNR17]. That is, even though a database may have multiple numeric columns/dimensions, where each of them being encrypted by OPE, each of these columns are treated separately when they are matched with plaintext values. This works well for dense data, i.e., where most of the values from the whole data domain have corresponding ciphertexts present in the database, such as age [GSBNR17]. Intuitively, the denser the data is, the more effective the attack is, because more constraints imposed by the ciphertext order reduces the uncertainty of their corresponding plaintext values. However, for multi-dimensional databases, applying such 1-D matching algorithms on each dimension separately can yield results far from optimal, since it neglects that for each pair of data tuples the order-preserving constraints on all the dimensions must be held jointly, leading to a much larger search space than the actual one and therefore more ambiguity in matching. In addition, for higher dimensional data (such as spatial/location data), the data tuple tends to be increasingly sparsely distributed in the domain, which invalidates the one-dimensional matching approach (unless the ciphertext and known plaintext datasets are highly similar with each other). Therefore, we wonder whether it is still feasible to recover OPE-encrypted data tuples for multi-dimensional, sparse databases? This turns out to be a very challenging problem.

In this paper, we study data inference attacks against multi-dimensional encrypted databases by jointly considering all the dimensions and leveraging only the ciphertext tuples’ order and frequency information, with the help of an auxiliary plaintext dataset with similar frequencies (the same assumption is adopted by many previous works). We formulate the order-preserving matching problem first in 2D but later extend it to 3D and higher dimensions. In the unweighted case, given an OPE-encrypted database and an auxiliary plaintext dataset, each containing a set of points in 2D, we maximize the number of points in a matching from the ciphertext to the plaintext, where order-preserving property must be simultaneously satisfied in both dimensions. Such a matching is called a non-conflicting matching in which the / projection of one edge in the matching cannot contain the projection of another edge in the matching. In general we also consider point frequency (the number of records with the same value), points matched with a smaller frequency difference are given higher weights and we maximize the total weights of the matching.

We show that our problem can also be formulated as an integer programming problem (ILP), and prove its NP-hardness by reducing it to sub-permutation pattern matching problem. Then we propose a greedy algorithm, along with an approximation algorithm with

runtime and an approximation factor of . This algorithm exploits the geometric structure of the problem, which is based on the idea of finding jointly heaviest monotone sequences (i.e., sequence of points with either increasing or decreasing order on each dimension) inside the auxiliary and target datasets. The main contributions of this paper are summarized as follows:

(1) To the best of our knowledge, we are the first to study data inference attacks against multi-dimensional OPE-encrypted databases by jointly considering all the dimensions simultaneously. We formulate a 2-D order-preserving matching problem and show its NP-hardness.

(2) We design two 2-D order-preserving matching algorithms, including a greedy and a polynomial time algorithm with approximation guarantees. We consider both unweighted and weighted cases, with different weight functions. We further explore efficiency enhancement using tree-based data structures. We also discuss extensions to higher dimensions. These algorithms have independent interest beyond the applications in this paper.

(3) We evaluate the efficiency and data recovery rate of our algorithms over both synthetic and real-world datasets for different application scenarios, including location-based services, census data, and medical data. Our results suggest that when the ciphertext dataset is highly similar to a subset of the plaintext dataset, the greedy min-conflict algorithm performs the best; but, in general, when these two datasets have arbitrary intersections and are less similar, our monotone matching algorithm performs better. Overall, the recovery rate of our 2-D algorithms significantly outperform single-dimensional matching algorithms when the data is sparse in each dimension.

Ii Background and Related Work

Ii-a Order-Preserving Encryption

Order-Preserving Encryption (OPE) [popa2013ideal] is a special encryption, where the order of ciphertexts is consistent with the order of plaintexts. For instance, assume there are two plaintexts and their OPE are ciphertexts , where is the encrypted version of by following the common notations in previous studies [popa2013ideal, GSBNR17]. If , then . With such property, comparison and sorting could be performed on encrypted data directly, without the need to access plaintext. While some OPEs are probabilistic and only reveal the order of data items [kerschbaum2014optimal], probabilistic OPEs increase the ciphertext size or require client-size storage, which scale poorly on sparse data. Most efficient OPEs are deterministic, and thus also reveal the frequency of data items [popa2013ideal]. In this paper, we focus on inference attacks on deterministic OPEs.

Ii-B Inference Attacks on OPE via 1-D Matching

While the security of OPEs has been proved formally under Ordered Chosen-Plaintext Attacks [popa2013ideal], several studies propose inference attacks to evaluate the privacy leakage of OPE ciphertexts. For instance, Naveed, et al. [NKW15] proposed an inference attack, named cumulative attack, on 1-D OPE by leveraging frequency leakage only. The authors address the attack by running the Hungarian algorithm. Grubbs et al. designed [GSBNR17] leakage abuse attacks on 1-D OPE ciphertexts. The authors utilize both frequency and order leakage, and formulate the attack as a dynamic programming problem [GSBNR17]. This leakage abuse attack performs faster than the cumulative attack and derives higher recovery rate. We briefly describe this leakage abuse attack below.

Given an OPE-encrypted dataset and an unencrypted dataset similar to , an attacker tries to infer the plaintexts of without decrypting OPE ciphertexts, by leveraging the plaintexts of as well as the order and frequency information of and . Without loss of generality, the attack assumes that and are sorted, where for any , and for any . The attacker also assumes . Let and

be the Cumulative Distribution Function (CDF) of the OPE ciphertexts of dataset

and the plaintexts of dataset respectively. Now, construct a bipartite graph on vertex set , , in which the weight of an edge between vertex and vertex is defined as

where is a pre-defined parameter and can be any integer greater than 1.

The attacker finds a max-weight bipartite matching in that is (one-dimensional) order-preserving (i.e., a vertex early in is mapped to an early vertex in ). Intuitively, suppose we plot the points of and on two parallel lines in their order. If we draw the edges in the matching, these edges could not cross. That is, if and are matched, any vertex in with cannot be matched with vertex with . Therefore, such a matching is also called a non-crossing matching. The max-weight non-crossing matching can be found in time via dynamic programming. If vertex is matched with vertex , this attacker infers as the plaintext of OPE ciphertext .

Ii-C Other Attacks on Encrypted Databases

In addition to cumulative attacks and leakage abuse attacks, some other attacks have also been proposed against OPE. Durak et al. [DDC16] proposed sort attacks on 2-D data encrypted by OPE. This attack performs a non-crossing matching on each dimension separately, and then improve the recovery results by evaluating inter-column correlation. Bindschaedler et al. [BGCRS18]

proposed an inference attack against property-preserving encryption on multi-dimensional data. This attack operates column by column. Specifically, it first recovers the column encrypted with the weakest encryption primitive, and then infers the next column encrypted by a stronger primitive by considering correlation. The attack is formulated as Bayesian inference problem. It also leverages

record linkage and machine learning to infer columns that are strongly encrypted. In comparison, our proposed matching algorithms aim at optimally recover data tuples containing two or more dimensions as a whole. We utilize the order and frequencies of the 2-D tuples, instead of single-dimension order and frequency in previous works. In addition, we do not need explicit prior knowledge about the data correlations across dimensions within an encrypted dataset.

Finally, reconstruction attacks [KKNO16, LMP18]

recover plaintexts on any searchable encryption that support range queries. Different from inference attacks, a reconstruction attack does not require a similar dataset as a reference but recover data based on access pattern leakage from a large number of range queries. However, reconstruction attacks often assume range queries are uniformly distributed, except


, which is based on statistical learning theory. These works are orthogonal to this work.

In this paper, we design two 2-D order-preserving matching algorithms that jointly consider the data ordering on 2D. We also extend the the 1-D matching algorithm in [GSBNR17] to 2-D data for comparison. It turns out all the algorithms have advantages and limitations, as we describe in the evaluation and conclusion sections.

Iii Models and Objectives

System Model. In the system model, there are two entities, a client and a server. We assume that a client has a dataset (e.g., a location dataset) and needs to store it on the server. Due to privacy concerns, this client will encrypt the dataset before outsourcing it to the server.

We assume that the client encrypts the data using deterministic OPE, such that the server will be able to perform search operations (e.g., range queries) over encrypted data without decryption. We assume that each dimension of the data is encrypted separately with OPE, such that search can be enabled for each dimension. The client’s data set is denoted as and its encrypted version as .

Threat Model. We assume that the server is an honest-but-curious attacker, who is interested in revealing the client’s data but does not maliciously add, modify, or remove the client’s data. In addition, we assume that the server is able to possess a similar dataset (in plaintext) as the client’s dataset. In addition, we assume that and have a significant common data points. For those points in that are also contained in , they have similar frequency distributions. For example, can be the location data from Uber users, and can be a USGS spatial database ( can be considered to be randomly sampled from ). Or and can be two location check-in datasets from two different social networking apps with partially overlapping locations.

Objectives. The attacker’s goal is to perform inference attacks to maximally infer/recover the plaintext of encrypted database without decryption, using only and with the ciphertext/plaintext order, either with or without frequency of points in both datasets. He aims at recovering the database points exactly. We define the recovery rate as the primary metric to measure the privacy leakage of the inference attack.

Recovery rate: If an attacker infers points, of which are correct inference (the same as their true plaintext points), then the recovery rate is . In addition, we consider both the unweighted version of the above metrics, where each unique point/location is counted once, or the weighted version where the frequency is considered as well (number of ‘copies’ of the same point, e.g. the number of customers in a restaurant). The former can be regarded as “point-level” and the latter is “record-level”. Intuitively, to maximize the weighted recovery rate, the points with larger frequencies should be correctly matched with high priority.

Iv 2-D Order-Preserving Matching

We formulate an order-preserving matching problem in two dimensions. Let and be two finite sets of points in the plane. and . If is matched to , we denote it as an edge and sometimes also denoted as . We say that a matching between and is order preserving if there exist two monotone functions such that if (for ) then , .

It is convenient to consider an alternative, equivalent way to define order preserving, in terms of “conflicts”. We say that two edges and are in -conflict with each other if the -projection (interval) of one edge contains the -projection (interval) of the other edge; the notion of being in -conflict is defined similarly. We say that a matching is a non-conflicting matching of and if it does not contain any -conflicting or -conflicting pair of edges. From the definitions, it is easy to see that a matching is order preserving if and only if it is a non-conflicting matching.

Fig. 1: In this matching between and , the edge is in -conflict with edge and in -conflict with edge .

We say that a point dominates , and write , if either (i) , or (ii) . With this notation, two pairs , with but are in conflict.

Iv-a Unweighted v.s. Weighted Version

In this paper we study the problem of finding a maximum cardinality, or a maximum-weight order preserving matching.

In the unweighted version, we maximize the number of edges in a non-conflict matching between and . This formulation does not use information on data frequencies.

To incorporate knowledge on data frequencies from and , we can define weight of matching a point in with a point in and ask for the non-conflict matching with maximum weight. The goal is to minimize the total difference of the frequencies between each ciphertext and its matched plaintext points. Note that this may or may not be equivalent with the objective of maximizing the recovery rate. This depends on the similarity of the two datasets and : when the frequencies of the same points are close in either dataset, max-weight matching will likely maximize recovery rate.

There are several possible choices of weight function. Assume , are the frequencies of locations (resp. ) . Then the weight of matching to could be one of the following weight function:

  1. . The rational for this weight function is that if we consider and as indicating the normalized number of items at point and , then indicates the maximum number of items could be matched.

  2. , where is a manually-picked constant, usually as maximum of all and This is the cost function used in [GSBNR17].

Iv-B Integer Programming Formulation

Given two sets of points, and , we define a variable that takes value if and otherwise. Now, we can formulate our matching problem as follows:

Subject to
s. t.  is in conflict with .

The first two constraints imply that one point can only be matched to one other point. The last inequality is the non-conflicting constraint.

Iv-C Related Results on Maximum Independent Sets

Our problem can be phrased as a (weighted) maximum independent set (MIS) problem, in the conflict graph, defined below.

Conflict Graph : the graph whose nodes are pairs of potentially matched points, one from and one from , and whose edges represent the conflict relationship: if the matched point pair is in conflict with the matched point pair .

Unfortunately, this graph in our settings is enormous, and its node set has cardinality quadratic in the size of the input. Thus, pursuing our problem as a maximum independent set problem is likely impractical. In general, MIS has no polynomial-time constant factor approximation algorithm (unless ); in fact, MIS, in general, is Poly-APX-complete, meaning it is as hard as any problem that can cannot be approximated within a polynomial factor [BazganEP05]. However, there are efficient approximation algorithms for restricted classes of graphs. In planar graphs, MIS can be approximated to within any approximation ratio in polynomial time; MIS also has a polynomial-time approximation scheme in any family of graphs closed under taking minors [grohe2003local]. In bounded degree graphs, effective approximation algorithms are known with approximation ratios that are constant for a fixed value of the maximum degree; for instance, a greedy algorithm that forms a maximal independent set by, at each step, choosing a minimum-degree vertex in the graph and removing its neighbors, achieves an approximation ratio of on graphs with maximum degree [halldorsson1997greed]; hardness of approximation for such instances is also known [berman1999some], and MIS on 3-regular 3-edge-colorable graphs is APX-complete [bazgan2005completeness].

V NP-Hardness

The problem of finding a maximum-cardinality order preserving matching (i.e., the unweighted case) is NP-hard. Therefore the weighted setting is also NP-hard.

We establish this by using a reduction from the problem Pattern Matching Problem for Permutations (PMPP) [Bose1998], which asks the following: Given a permutation of the sequence and a permutation of the sequence , for , determine if there exists a subsequence, , of of length (with ) such that the elements of are ordered according to the permutation , i.e., such that if and only if . We map a PMPP input pair of permutations, , to a pair of points, , in the plane: Specifically, is the set of points corresponding to the permutation , and is the set of points corresponding to the permutation . It now follows from the definition of an order preserving matching, and the specification of the PMPP, that there exists an order preserving matching of size between and if and only if there is a subsequence of of length such that the elements of are ordered according to the permutation . It follows that our (unweighted) order preserving matching problem is NP-hard.

Theorem V.1.

Given two point sets , it is NP-complete to decide if there exists an order preserving matching of cardinality between and .

Vi Algorithms

Vi-a Greedy Minimum-Conflict Matching

In this heuristic, we create an order preserving matching

in a greedy manner. We start with empty, and at each iteration we add to the edge that has the minimum number of conflicted edges among all potential future edges that could be selected. This heuristic is reminiscent of the minimum-degree heuristic of Halldórsson and Radhakrishnan [halldorsson1997greed] that shows that similar heuristics provide a approximation for finding a maximum independent set in graphs having maximum degree ; however, in our setting, might be , making this bound uninteresting.

Formally, define for

and greedily select to minimize A straightforward algorithm computes directly (in time ) for each of the candidate edges , in order to select each edge to be greedily added to . Overall, this is .

Vi-A1 Unweighted Case

Here, to expedite the algorithm to avoid the time (per edge selected), we propose a weighted random sampling approach. We could find in amortized time per pair . This is done in two steps: We first compute for each the number of point above and to the left of . Similarly we define , , and , , . Then the number of matching edges that are in conflict with can be computed by evaluating the products , where is one of the 4 directions As easily observed, the number of conflicts is