I Introduction
Data outsourcing has become popular in recent years. Small businesses or individual users choose to delegate their data storage to public cloud servers (such as Amazon EC2 or Google Cloud) to save operational costs. Meanwhile, data breaches happen at an increasing rate, which compromise users’ privacy. For instance, the Yahoo! data breaches reported in 2016 affected 3 billion user accounts [yahoo]. This is exacerbated by recent scandals of data misuse (such as the FacebookCambridge Analytica case [facebook]), which increases the level of distrust from users. To address this issue, endtoend encryption is commonly adopted to encrypt the data before it is uploaded and stored on an untrusted server. In order to enable efficient utilization over encrypted data (such as answering queries), many cryptographic techniques called searchable encryption (SE) [song2000practical, bellare2007deterministic, curtmola2011searchable] have been proposed. The main challenge for SE is to simultaneously provide flexible search functionality, high security assurance, and efficiency. Among existing SE schemes, OrderPreserving Encryption (OPE) [boldyreva2009order, boldyreva2011order, popa2013ideal, kerschbaum2014optimal] has gained wide attention in the literature due to its high efficiency and functionality. In particular, OPE uses symmetric key cryptography and preserves the numeric order of plaintext after encryption, which supports most Boolean queries such as range query. Wellknown systems for encrypted database search using OPE include: CryptDB [popa2011cryptdb], Google Encrypted Bigquery Client [Bigquery], and Microsoft Always Encrypted Database [Microsoft].
Many early OPE schemes, unfortunately, were shown to leak more information beyond what is necessary (i.e., the order between plaintexts). Therefore, schemes that satisfy ideal security guarantees (that only the order is leaked) have been proposed [popa2013ideal, kerschbaum2014optimal]. However, recent research [NKW15, GSBNR17] showed that it is possible to infer/recover a significant portion of plaintexts from their OPE ciphertext, using only the ciphertext order relationships, as well as some auxiliary dataset with data frequencies similar to a target dataset. For example, Naveed et al. [NKW15] attacked an encrypted medical database where users’ age column is encrypted using OPE. Later, the attack was improved by Grubb et al. [GSBNR17], with an additional restriction of noncrossing property in the matching algorithm.
We note that, to date, all the successful inference attacks against OPE are limited to onedimensional data [NKW15, GSBNR17]. That is, even though a database may have multiple numeric columns/dimensions, where each of them being encrypted by OPE, each of these columns are treated separately when they are matched with plaintext values. This works well for dense data, i.e., where most of the values from the whole data domain have corresponding ciphertexts present in the database, such as age [GSBNR17]. Intuitively, the denser the data is, the more effective the attack is, because more constraints imposed by the ciphertext order reduces the uncertainty of their corresponding plaintext values. However, for multidimensional databases, applying such 1D matching algorithms on each dimension separately can yield results far from optimal, since it neglects that for each pair of data tuples the orderpreserving constraints on all the dimensions must be held jointly, leading to a much larger search space than the actual one and therefore more ambiguity in matching. In addition, for higher dimensional data (such as spatial/location data), the data tuple tends to be increasingly sparsely distributed in the domain, which invalidates the onedimensional matching approach (unless the ciphertext and known plaintext datasets are highly similar with each other). Therefore, we wonder whether it is still feasible to recover OPEencrypted data tuples for multidimensional, sparse databases? This turns out to be a very challenging problem.
In this paper, we study data inference attacks against multidimensional encrypted databases by jointly considering all the dimensions and leveraging only the ciphertext tuples’ order and frequency information, with the help of an auxiliary plaintext dataset with similar frequencies (the same assumption is adopted by many previous works). We formulate the orderpreserving matching problem first in 2D but later extend it to 3D and higher dimensions. In the unweighted case, given an OPEencrypted database and an auxiliary plaintext dataset, each containing a set of points in 2D, we maximize the number of points in a matching from the ciphertext to the plaintext, where orderpreserving property must be simultaneously satisfied in both dimensions. Such a matching is called a nonconflicting matching in which the / projection of one edge in the matching cannot contain the projection of another edge in the matching. In general we also consider point frequency (the number of records with the same value), points matched with a smaller frequency difference are given higher weights and we maximize the total weights of the matching.
We show that our problem can also be formulated as an integer programming problem (ILP), and prove its NPhardness by reducing it to subpermutation pattern matching problem. Then we propose a greedy algorithm, along with an approximation algorithm with
runtime and an approximation factor of . This algorithm exploits the geometric structure of the problem, which is based on the idea of finding jointly heaviest monotone sequences (i.e., sequence of points with either increasing or decreasing order on each dimension) inside the auxiliary and target datasets. The main contributions of this paper are summarized as follows:(1) To the best of our knowledge, we are the first to study data inference attacks against multidimensional OPEencrypted databases by jointly considering all the dimensions simultaneously. We formulate a 2D orderpreserving matching problem and show its NPhardness.
(2) We design two 2D orderpreserving matching algorithms, including a greedy and a polynomial time algorithm with approximation guarantees. We consider both unweighted and weighted cases, with different weight functions. We further explore efficiency enhancement using treebased data structures. We also discuss extensions to higher dimensions. These algorithms have independent interest beyond the applications in this paper.
(3) We evaluate the efficiency and data recovery rate of our algorithms over both synthetic and realworld datasets for different application scenarios, including locationbased services, census data, and medical data. Our results suggest that when the ciphertext dataset is highly similar to a subset of the plaintext dataset, the greedy minconflict algorithm performs the best; but, in general, when these two datasets have arbitrary intersections and are less similar, our monotone matching algorithm performs better. Overall, the recovery rate of our 2D algorithms significantly outperform singledimensional matching algorithms when the data is sparse in each dimension.
Ii Background and Related Work
Iia OrderPreserving Encryption
OrderPreserving Encryption (OPE) [popa2013ideal] is a special encryption, where the order of ciphertexts is consistent with the order of plaintexts. For instance, assume there are two plaintexts and their OPE are ciphertexts , where is the encrypted version of by following the common notations in previous studies [popa2013ideal, GSBNR17]. If , then . With such property, comparison and sorting could be performed on encrypted data directly, without the need to access plaintext. While some OPEs are probabilistic and only reveal the order of data items [kerschbaum2014optimal], probabilistic OPEs increase the ciphertext size or require clientsize storage, which scale poorly on sparse data. Most efficient OPEs are deterministic, and thus also reveal the frequency of data items [popa2013ideal]. In this paper, we focus on inference attacks on deterministic OPEs.
IiB Inference Attacks on OPE via 1D Matching
While the security of OPEs has been proved formally under Ordered ChosenPlaintext Attacks [popa2013ideal], several studies propose inference attacks to evaluate the privacy leakage of OPE ciphertexts. For instance, Naveed, et al. [NKW15] proposed an inference attack, named cumulative attack, on 1D OPE by leveraging frequency leakage only. The authors address the attack by running the Hungarian algorithm. Grubbs et al. designed [GSBNR17] leakage abuse attacks on 1D OPE ciphertexts. The authors utilize both frequency and order leakage, and formulate the attack as a dynamic programming problem [GSBNR17]. This leakage abuse attack performs faster than the cumulative attack and derives higher recovery rate. We briefly describe this leakage abuse attack below.
Given an OPEencrypted dataset and an unencrypted dataset similar to , an attacker tries to infer the plaintexts of without decrypting OPE ciphertexts, by leveraging the plaintexts of as well as the order and frequency information of and . Without loss of generality, the attack assumes that and are sorted, where for any , and for any . The attacker also assumes . Let and
be the Cumulative Distribution Function (CDF) of the OPE ciphertexts of dataset
and the plaintexts of dataset respectively. Now, construct a bipartite graph on vertex set , , in which the weight of an edge between vertex and vertex is defined aswhere is a predefined parameter and can be any integer greater than 1.
The attacker finds a maxweight bipartite matching in that is (onedimensional) orderpreserving (i.e., a vertex early in is mapped to an early vertex in ). Intuitively, suppose we plot the points of and on two parallel lines in their order. If we draw the edges in the matching, these edges could not cross. That is, if and are matched, any vertex in with cannot be matched with vertex with . Therefore, such a matching is also called a noncrossing matching. The maxweight noncrossing matching can be found in time via dynamic programming. If vertex is matched with vertex , this attacker infers as the plaintext of OPE ciphertext .
IiC Other Attacks on Encrypted Databases
In addition to cumulative attacks and leakage abuse attacks, some other attacks have also been proposed against OPE. Durak et al. [DDC16] proposed sort attacks on 2D data encrypted by OPE. This attack performs a noncrossing matching on each dimension separately, and then improve the recovery results by evaluating intercolumn correlation. Bindschaedler et al. [BGCRS18]
proposed an inference attack against propertypreserving encryption on multidimensional data. This attack operates column by column. Specifically, it first recovers the column encrypted with the weakest encryption primitive, and then infers the next column encrypted by a stronger primitive by considering correlation. The attack is formulated as Bayesian inference problem. It also leverages
record linkage and machine learning to infer columns that are strongly encrypted. In comparison, our proposed matching algorithms aim at optimally recover data tuples containing two or more dimensions as a whole. We utilize the order and frequencies of the 2D tuples, instead of singledimension order and frequency in previous works. In addition, we do not need explicit prior knowledge about the data correlations across dimensions within an encrypted dataset.Finally, reconstruction attacks [KKNO16, LMP18]
recover plaintexts on any searchable encryption that support range queries. Different from inference attacks, a reconstruction attack does not require a similar dataset as a reference but recover data based on access pattern leakage from a large number of range queries. However, reconstruction attacks often assume range queries are uniformly distributed, except
[GLMP19], which is based on statistical learning theory. These works are orthogonal to this work.
In this paper, we design two 2D orderpreserving matching algorithms that jointly consider the data ordering on 2D. We also extend the the 1D matching algorithm in [GSBNR17] to 2D data for comparison. It turns out all the algorithms have advantages and limitations, as we describe in the evaluation and conclusion sections.
Iii Models and Objectives
System Model. In the system model, there are two entities, a client and a server. We assume that a client has a dataset (e.g., a location dataset) and needs to store it on the server. Due to privacy concerns, this client will encrypt the dataset before outsourcing it to the server.
We assume that the client encrypts the data using deterministic OPE, such that the server will be able to perform search operations (e.g., range queries) over encrypted data without decryption. We assume that each dimension of the data is encrypted separately with OPE, such that search can be enabled for each dimension. The client’s data set is denoted as and its encrypted version as .
Threat Model. We assume that the server is an honestbutcurious attacker, who is interested in revealing the client’s data but does not maliciously add, modify, or remove the client’s data. In addition, we assume that the server is able to possess a similar dataset (in plaintext) as the client’s dataset. In addition, we assume that and have a significant common data points. For those points in that are also contained in , they have similar frequency distributions. For example, can be the location data from Uber users, and can be a USGS spatial database ( can be considered to be randomly sampled from ). Or and can be two location checkin datasets from two different social networking apps with partially overlapping locations.
Objectives. The attacker’s goal is to perform inference attacks to maximally infer/recover the plaintext of encrypted database without decryption, using only and with the ciphertext/plaintext order, either with or without frequency of points in both datasets. He aims at recovering the database points exactly. We define the recovery rate as the primary metric to measure the privacy leakage of the inference attack.
Recovery rate: If an attacker infers points, of which are correct inference (the same as their true plaintext points), then the recovery rate is . In addition, we consider both the unweighted version of the above metrics, where each unique point/location is counted once, or the weighted version where the frequency is considered as well (number of ‘copies’ of the same point, e.g. the number of customers in a restaurant). The former can be regarded as “pointlevel” and the latter is “recordlevel”. Intuitively, to maximize the weighted recovery rate, the points with larger frequencies should be correctly matched with high priority.
Iv 2D OrderPreserving Matching
We formulate an orderpreserving matching problem in two dimensions. Let and be two finite sets of points in the plane. and . If is matched to , we denote it as an edge and sometimes also denoted as . We say that a matching between and is order preserving if there exist two monotone functions such that if (for ) then , .
It is convenient to consider an alternative, equivalent way to define order preserving, in terms of “conflicts”. We say that two edges and are in conflict with each other if the projection (interval) of one edge contains the projection (interval) of the other edge; the notion of being in conflict is defined similarly. We say that a matching is a nonconflicting matching of and if it does not contain any conflicting or conflicting pair of edges. From the definitions, it is easy to see that a matching is order preserving if and only if it is a nonconflicting matching.
We say that a point dominates , and write , if either (i) , or (ii) . With this notation, two pairs , with but are in conflict.
Iva Unweighted v.s. Weighted Version
In this paper we study the problem of finding a maximum cardinality, or a maximumweight order preserving matching.
In the unweighted version, we maximize the number of edges in a nonconflict matching between and . This formulation does not use information on data frequencies.
To incorporate knowledge on data frequencies from and , we can define weight of matching a point in with a point in and ask for the nonconflict matching with maximum weight. The goal is to minimize the total difference of the frequencies between each ciphertext and its matched plaintext points. Note that this may or may not be equivalent with the objective of maximizing the recovery rate. This depends on the similarity of the two datasets and : when the frequencies of the same points are close in either dataset, maxweight matching will likely maximize recovery rate.
There are several possible choices of weight function. Assume , are the frequencies of locations (resp. ) . Then the weight of matching to could be one of the following weight function:

. The rational for this weight function is that if we consider and as indicating the normalized number of items at point and , then indicates the maximum number of items could be matched.

, where is a manuallypicked constant, usually as maximum of all and This is the cost function used in [GSBNR17].
IvB Integer Programming Formulation
Given two sets of points, and , we define a variable that takes value if and otherwise. Now, we can formulate our matching problem as follows:
Maximize  
Subject to  
s. t. is in conflict with . 
The first two constraints imply that one point can only be matched to one other point. The last inequality is the nonconflicting constraint.
IvC Related Results on Maximum Independent Sets
Our problem can be phrased as a (weighted) maximum independent set (MIS) problem, in the conflict graph, defined below.
Conflict Graph : the graph whose nodes are pairs of potentially matched points, one from and one from , and whose edges represent the conflict relationship: if the matched point pair is in conflict with the matched point pair .
Unfortunately, this graph in our settings is enormous, and its node set has cardinality quadratic in the size of the input. Thus, pursuing our problem as a maximum independent set problem is likely impractical. In general, MIS has no polynomialtime constant factor approximation algorithm (unless ); in fact, MIS, in general, is PolyAPXcomplete, meaning it is as hard as any problem that can cannot be approximated within a polynomial factor [BazganEP05]. However, there are efficient approximation algorithms for restricted classes of graphs. In planar graphs, MIS can be approximated to within any approximation ratio in polynomial time; MIS also has a polynomialtime approximation scheme in any family of graphs closed under taking minors [grohe2003local]. In bounded degree graphs, effective approximation algorithms are known with approximation ratios that are constant for a fixed value of the maximum degree; for instance, a greedy algorithm that forms a maximal independent set by, at each step, choosing a minimumdegree vertex in the graph and removing its neighbors, achieves an approximation ratio of on graphs with maximum degree [halldorsson1997greed]; hardness of approximation for such instances is also known [berman1999some], and MIS on 3regular 3edgecolorable graphs is APXcomplete [bazgan2005completeness].
V NPHardness
The problem of finding a maximumcardinality order preserving matching (i.e., the unweighted case) is NPhard. Therefore the weighted setting is also NPhard.
We establish this by using a reduction from the problem Pattern Matching Problem for Permutations (PMPP) [Bose1998], which asks the following: Given a permutation of the sequence and a permutation of the sequence , for , determine if there exists a subsequence, , of of length (with ) such that the elements of are ordered according to the permutation , i.e., such that if and only if . We map a PMPP input pair of permutations, , to a pair of points, , in the plane: Specifically, is the set of points corresponding to the permutation , and is the set of points corresponding to the permutation . It now follows from the definition of an order preserving matching, and the specification of the PMPP, that there exists an order preserving matching of size between and if and only if there is a subsequence of of length such that the elements of are ordered according to the permutation . It follows that our (unweighted) order preserving matching problem is NPhard.
Theorem V.1.
Given two point sets , it is NPcomplete to decide if there exists an order preserving matching of cardinality between and .
Vi Algorithms
Via Greedy MinimumConflict Matching
In this heuristic, we create an order preserving matching
in a greedy manner. We start with empty, and at each iteration we add to the edge that has the minimum number of conflicted edges among all potential future edges that could be selected. This heuristic is reminiscent of the minimumdegree heuristic of Halldórsson and Radhakrishnan [halldorsson1997greed] that shows that similar heuristics provide a approximation for finding a maximum independent set in graphs having maximum degree ; however, in our setting, might be , making this bound uninteresting.Formally, define for
and greedily select to minimize A straightforward algorithm computes directly (in time ) for each of the candidate edges , in order to select each edge to be greedily added to . Overall, this is .
ViA1 Unweighted Case
Here, to expedite the algorithm to avoid the time (per edge selected), we propose a weighted random sampling approach. We could find in amortized time per pair . This is done in two steps: We first compute for each the number of point above and to the left of . Similarly we define , , and , , . Then the number of matching edges that are in conflict with can be computed by evaluating the products , where is one of the 4 directions As easily observed, the number of conflicts is
Comments
There are no comments yet.