In today’s world many organizations and individuals constantly gather
information about people, whether directly or indirectly. This leads to enormous
databases storing private information regarding individuals’ personal and
professional life. Commonly, access to these records is limited and safeguarded
using authorization and authentication protocols. Only authorized users may
query the system for data. There are however instances in today’s
global network of organizational connections, the growing demand to disseminate
and share this information is motivated by various academic, commercial and
other benefits. This information is becoming a very important resource for many
systems and corporations that may analyze the data in order to enhance and
improve their services and performance. The problem of privacy-preserving data
analysis has a long history spanning multiple disciplines. As electronic data
about individuals becomes increasingly detailed, and as technology enables ever
more powerful collection and curation of these data, the need increases for a
robust, meaningful, and mathematically rigorous definition of privacy, together
with a computationally rich class of algorithms that satisfy this definition. A
comparative analysis and discussion of such algorithms with regards to
statistical databases can be found in . One common practice
for publishing such data without violating privacy is applying regulations,
policies and guiding principles for the use of the data. Such regulations
usually entail data distortion for the sake of anonymizing the data. In recent
years, there has been a growing use of anonymization algorithms based on
differential privacy introduced by Dwork et al. .
Differential privacy is a mathematical definition of how to measure the privacy
risk of an individual when they participate in a database. To construct a data
collection or data querying algorithm which constitutes differential privacy,
one must add some level of noise to the collected or returned data respectively.
While ensuring some level of privacy, these methods still have several issues
with regards to implementation and data usability. Sarwate and Chaudhuri
 discuss the challenges of differential privacy with regards to
continuous data, as well as the trade-off between privacy and utility. In some cases, the data may become
unusable after distortion. Lee and Clifton  discuss the
difficulty of correctly implementing differential privacy with regards to the
choice of as the differential privacy factor. Due to these issues and
restrictions, other privacy preserving algorithms are still in prevalent in many
databases and statistical data querying systems. In this paper, we address
vulnerabilities of several implementations of such privacy preserving
The vulnerability of databases, and hence the potential avenues of attack, depend among other things on the underlying data structure (and query behavior). The information stored in databases also comes in many forms, such as plain text, spatial coordinates, numeric values, and others. Each combination of structure and data format allows for its own specific attack and requires its own unique handling of privacy protection. Another factor when handling privacy in databases is the type of queries allowed (which may be dictated by the previously mentioned structure and data format). For example, datasets with timestamp values may only allow min/max and grouping queries, while those containing sequential numeric values may also allow queries regarding averages, sums, and other mathematical formulas. In Section 2 we analyze the effectiveness of different queries using the -query-set-size limitation over aggregate functions in maintaining individual user privacy in a vehicular network.
Another field where privacy concerns are a growing issue is the field of recommendation systems. Many of these systems use the collaborative filtering technique, in which users are required to reveal their preferences in order to benefit from the recommendations. Su et al.  survey these techniques in depth. Several methods aimed at hiding and anonymizing user data have been proposed and studied in an attempt to reduce the privacy issues of collaborative filtering. These methods include data obfuscation, random perturbation, data suppression and others [2, 7, 8, 9]. Most of these methods rely on experimental results alone to show effectiveness, and some have already been shown to have weaknesses that can be exploited in order to recover the original user data [10, 11]. Parameswaran and Blough  propose a new data obfuscation technique dubbed “Nearest Neighbor Data Substitution” (NeNDS). In Section 3 we detail a privacy attack on NeNDS based on partial prior information, as well as address shortcomings in the NeNDS algorithm and propose avenues of research for its improvement. Finally, we conclude in Section 4.
2 Combining Queries with -Limited Results
The underlying data structure of a database is one of the factors in
determining the querying methods used over the database. The database logic
itself may further restrict queries, in some cases allowing for querying a
specific key and in others only returning aggregate results over a set of
values. The data type stored may also be a factor when discerning which
querying methods may be used. Numeric values can allow for mathematical queries
such as sums, averages and medians. Text fields may allow for string operations
such as “contains”, “starts-with”, or even regular expressions. In the same
manner, these queries may also be prohibited as they may convey information
that is meant to remain private. Other limitations may be placed on queries as
well, such as the query-set-size limitation, blocking query results in cases
where a predefined number () of record look-ups have not been reached (i.e.
the number of users/items taken into consideration by the query are less than
). Venkatadri et al.  recently demonstrated a Privacy
attack on Facebook users by utilizing an exploit in Facebook’s targeted
advertising API which similarly restricted query results containing too few
users. Using a combination of multiple queries which returned aggregate results
(or no results due to a low number of users matching the query), the
researchers were able to narrow down personally identifiable information which
was regarded as private by the users. In this section we look at such cases and
attempt to determine whether an attacker can use a combination of allowed
queries in order to extract information which the prohibited queries mean to
block. This may be done using multiple queries of the same type, or a
combination of several query types.
2.1 Dataset and Query Models
We attempt to show privacy attacks on data gathered from vehicular networks. The gathered data is stored in a centralized database which allows a set of queries that are designed to return meaningful information without compromising that privacy of the users. A privacy attack is defined as access to any information gathered from the user that was not made available from standard queries on the database.
2.1.1 Graph Datasets Model
A vehicular network is comprised of unique units distributed in the real world and are displayed on a graph as a set of vertices such that each vertex represents one vehicle at a single (discrete) point in time . The timestamps are measured as incremental time steps from the system’s initial measurement designated . We consider three different graph models:
A linear graph with vehicles distributed along discrete coordinates on the axis between .
A two-dimensional planar graph with vehicles distributed along discrete coordinates on the and axis between .
A three-dimensional cubic graph with vehicles distributed along discrete coordinates on the , and axis between .
For each vehicle at each timestamp, the speed is measured. We denote this with being a discrete value timestamp.
2.1.2 Query Model
Following are the set of queries allowed over the database.
: given a range a timestamp , return the average speed over all vehicles in the given range at the given time.
given a range a timestamp , return the max speed over all vehicles in the given range at the given time.
given a range a timestamp , return the min speed over all vehicles in the given range at the given time.
given a range a timestamp , return the median speed over all vehicles in the given range at the given time.
The range is defined by a set of boundaries over the relevant graph:
in : A starting coordinate and end coordinate .
in : A rectangle with corners .
in : A box with corners .
In order to protect user privacy, all queries deal with measurements over aggregated data so as not to indicate a single user’s information. As such, the queries only return a result if at least unique values have been recorded for the scope over which the query has been run, where . The value is known to the attacker, however the number of records which were a part of each query result is not (i.e. the attacker only knows that if a result returned there are at least records in the requested scope , but not the exact number).
2.2 Analysis of
In this section we present privacy attack problems over different graphs and queries.
2.2.1 Linear vehicular placement
Model: A linear graph with vehicles.
Attack: find the speed of a single vehicle at a given time .
It is easy to see that a single query will not constitute an attack. The attack can be performed using the following algorithm:
Select a range with .
Run query and denote the result .
Select a new range with .
Run query and denote the result .
Continue querying over ranges, each time incrementing until a result isn’t returned. Mark the last coordinate which returned a result as and the result returned as . Note that there were records in this scope.
You can now backtrack over all results and calculate the speed of each vehicle between and .
Denote this algorithm . We can see that the runtime for this algorithm is the number of query iterations required to find a section with vehicles.
2.2.2 2D vehicular placement
Model: A two-dimensional planar graph with vehicles distributed along discrete coordinates on the and axis between .
Attack(1): find the speed of a single vehicle at a given time .
Attack(2): find the average speed of a set of vehicles, with the size of smaller than , at a given time .
Assumptions: The values of and are known, where . We first select some value on the axis, denote this value , and split into 3 ranges (the section above , the section below , and the section containing only ):
Note that both and contain , and the union of and is the entire graph containing all vehicles (). We define to be and respectively. See partition example in Figure 1. It is important to note that due to symmetry, this partition can also be done around some value on the axis, with the sections built around this value .
We now perform 5 queries on :
, , , , .
If one of the selected queries does not return a response (i.e. it contains less than vehicles), we re-select and repeat the process until all 5 queries are answered (such a value should exists due to the size of
and the probable distribution of vehicles).
Using the results we wish to find the average speed of vehicles in section , and the number of vehicles in each section: . The number of vehicles in each section is a function of the section range and a given timestamp: . We denote the number of vehicles in each section as follows:
To do so, we solve the following equation system:
Solving this system gives us the following:
Denote this process . The runtime for this algorithm is the
equivalent to running 5 queries on the dataset, with the addition of solving
above equation system.
With these values we can now attempt Attack(1) and Attack(2):
If , we have succeeded in Attack(2).
If , we can run on which represents the
boundaries of a linear graph, we can select on any vehicle with
vehicles on either side of it as the target vehicle and perform
Attack(1). If , we cannot complete either attack, so we select a new
value and run again. There exists an edge case of graphs
where for all values of that we can choose as , the number of
vehicles will be equal to , in which case we will be unable to perform
any attack. This scenario is, however, unlikely in the case of vehicular
networks. In addition, since we have the number of vehicles and in
and respectively, if these values are sufficiently
large in relation to , we can look at these ranges as sub-graphs of
and run on them with and .
It is easy to see that we can apply the same method used on the two dimensional graph on the three dimensional graph with some minor modifications as follows. We again select some value on the axis, denote this value , and split into 3 ranges (the section above , the section below , and the section containing only ). In this instance, these sections are represented as cubes in the following manner:
Similarly, we define , to be and respectively. Note that after running
our five queries on the five sections, we achieve the same linear equations as
in the two dimensional case. Solving these equations now leaves us with the
average speed over the plane defined by , and the number of vehicles
in this plane. As in the two dimensional case, if we have succeeded
in Attack(2). If we now have a sub-graph of which constitutes a two
dimensional graph on which we may be able to perform . The
minimum size for this to be possible is .
While our results, given as and , refer to the average speeds of vehicles in their respective graph placements, they are not limited to speed values. The same methods can be used for any numeric value that can be averaged over a set of vehicles in this manner, such as number of traffic violations a vehicle has accumulated, number of accidents the vehicle has been involved in, and so on. Any of these, when given as averages over a set of vehicles may appear innocent and maintain high level of privacy for an individual in the system. However, as we have shown, an individual’s data can be inferred with minimal effort by employing our methods. Of course, we are also not limited to vehicular networks. Any data set with the same structure of node placement in a graph will yield the same results.
2.3 Analysis of and
In this section we look at possible attacks using the minimum, maximum and median value queries over ranges in the graph as defined previously by and respectively. Similar to the case of , we define that the queries will not return a result if the target Range at time contains less than individual values. In addition, our analysis of potential attacks rests on the following set of assumptions:
The data set consists of unique values.
The value is known to the attacker.
In case a result is returned, the number of actual values in is not known to the attacker.
If contains an even number of values, returns the lower of the median values.
The attacker is limited only to the and queries, but can perform any number of queries over the data set.
For simplicity, we will treat the data set as in the previous section - a linear graph representing a snapshot in time of recorded speeds of vehicles in a specified area. A query of type ( being or ) at time over a range beginning at and ending at (inclusive) will be denoted .
We note that there are several special cases in which a trivial attack can be performed. We will address these cases before moving on to the general case.
2.3.1 Case 1: Global Min/Max
Since there exist a unique global minimum and global maximum in the graph, it is easy to see that by querying over the entire graph and iteratively decreasing the range until a new minimum/maximum is found, the vehicle with the minimum and maximum speeds can be discovered.
2.3.2 Case 2: Min/Max
Similar to the case of a global min/max, if a vehicle has the local minimum or maximum value with regards to his nearest neighbors then their speed can be discovered. This is done using the same method as stated for the global min/max. A range consisting of vehicles, with the outer vehicle having a min (max) speed in that group must be found. Once found, decrease the range until a group of size remains in its bounds. By our definition, the min (max) value now changes, and the attacker knows that the previous value belongs to the vehicle that has been removed from the range. Note that if a such a min/max vehicle exists in the graph, the attacker can find it given enough queries.
2.3.3 Case 3:
In this case, since all values are defined to be unique, querying on a range containing exactly vehicles return values, each belonging to a specific vehicle. An attacker can query over a single coordinate at the left-most side of the graph and increase the range until a result is returned. The first time a result is returned, the minimum group size has been reached, and the attacker has the speed of each of the vehicles. Each speed cannot be attributed to a specific vehicle, but we will denote these values . The attacker now decreases the range’s size from the left until no result is returned, this indicates the range now only contains vehicles. Increasing the range to the right until a result is returned indicates that a new vehicle has been added to the range. Since all values are unique, one of the values will be missing from the results. This belongs to the left-most vehicle from the previous query results. Continuing this method until the entire graph has been scanned will reveal the speeds of each vehicle in the graph.
2.3.4 The General Case:
We show that for the general case, there exists a linear placement of vehicles such that at least vehicle will have a speed whose value will remain hidden from an attacker. Note that if a combination of queries can be used to attain the same results as the query , then a privacy attack can be performed in the manner detailed in Section 2.2.1. Hozo et al. 
Let be a set of vehicles positioned along a linear graph at coordinates at time . If , for any value there exists a corresponding assignment of speeds , such that the speed of cannot be determined by any attacker with access to the and queries over the graph.
We prove by induction for and , then extrapolate for and .
Show Correctness for
With vehicles positioned at , set the values of such that . Since the queries will only return results when the range queried contains the range . It is easy to see that:
As such, the value of is never revealed.
Assume Correctness for
Given a set of vehicles positioned at coordinates , assume there exists an assignment of corresponding speeds such that cannot be determined by an attacker with access to any number of queries with a limitation.
We assign such that for the subgraph , for , the value of is never revealed by any query . We note properties regarding of the node , placed at :
There exists only queryable range, , for which any query will take both and into consideration.
Regardless of the value of , the queries and cannot return as a result. (Otherwise, would have been a result of one of the queries over the subgraph )
Due to these properties, we must only ensure that the query
does not return as it’s result. Denote to be the result of . If then we set so that . Conversely, if then we set so that . We now have an assignment such that the value cannot be discovered by an attacker.
The parameter is defined as the minimum number of vehicles required to be in a range in order for a result to be returned. For any value of increasing the value of only reduces the number of available queries that will return a result. Since it holds that there exists an assignment such that cannot be discovered for , then setting for the same assignment will not give any new information to the attacker and will remain unknown. It can be seen that this is true for any value such that . ∎ While Lemma 2.1 holds for any value of and , such an assignment, where a specific node is deterministically undiscoverable, is susceptible to prior knowledge attacks. In addition, in most real world cases, the value of is chosen to be on a level of magnitude lower than as to allow for many queries. We show that for these cases, specifically any case where and , the vehicle whose speed is never returned by any query can be chosen as any vehicle by the adversary.
Let be a set of vehicles positioned along a linear graph at coordinates at time . If , for any value there exists a corresponding assignment of speeds , such that there exists a node with speed which cannot be determined by any attacker with access to the and queries over the graph.
We prove by induction for and , then extrapolate for and .
Show Correctness for
With vehicles positioned at , set the values of such that . The value of cannot be determined by an attacker even by running all possible query combinations on the graph. The results of all such possible queries can be see in Table 1.
|Range Containing Vehicles|
|Range Containing Vehicles|
|Range Containing Vehicles|
|Range Containing Vehicles|
|Range Containing Vehicles|
Assume Correctness for )
Given a set of vehicles positioned at coordinates , assume there exists an assignment of corresponding speeds such that there exists some value belonging to some vehicle at position , which cannot be determined by an attacker with access to any number of queries under a limitation.
We assign such that for the subgraph , for , there exists some value of which is never revealed by any query . Assume . We note properties regarding of the node , placed at :
Regardless of the value of : . (i.e. cannot be the result of any query in the range )
regardless of the value of : . (i.e. cannot be the result of any query in the range )
Therefore, we must only assign such that it does not cause to be the result of any query. Define to be the result of . Due to the properties of , if then . Conversely, if then . Otherwise at least one of those queries would have returned as a result, which contradicts the induction assumption. Define to be the closest median value to from the previously stated queries.
to be some uniformly distributed random value betweenand . We now look at and note that for any value , the results of and are either the same value or adjacent values, as the speeds in the range differ by exactly value. Since no value is adjacent to , then cannot be the result of any value . There exist no other queries of the type which contain both and , therefore we now have an assignment such that the value cannot be discovered by an attacker.
The above holds for the assumption . It is easy to see that due to symmetry, the case where allows us to shift all values of one vehicle to the right, and assign the random value between and to . This completes correctness for all positions of .
Similar to 2.1, increasing for a given value of only reduces the amount of information available to the attacker. Therefore, if a value exists for an assignment in a graph with vehicles under the limitation (with ), it will exist for any value of such that .
3 Collaborative Filtering
Collaborative filtering (CF) is a technique commonly used to build personalized recommendations on the Web. In collaborative filtering, algorithms are used to make automatic predictions about a user’s interests by compiling preferences from several users. In order to provide personalized information to a user, the CF system needs to be provided with sufficient information regarding his or her preferences, behavioral characteristics, as well as demographic information of the individual. The accuracy of the recommendations is dependent largely on how much of this information is known to the CF system. However, this information can prove to be extremely dangerous if it falls in the wrong hands. Several methods aimed at hiding and anonymizing user data have been proposed and studied in an attempt to reduce the privacy issues of collaborative filtering. Among these methods is the data obfuscation technique “Nearest Neighbor Data Substitution” (NeNDS) proposed by Parameswaran and Blough in . Using this approach, items in each column of the database are clustered into groups by closeness of their values, and a substitution algorithm is applied to each group. The algorithm gives each item a new location within the group such that each item now corresponds to a new row in the original database. The relative closeness in values of the substituted items allows for the recommendation system to maintain a good degree of approximation when the CF algorithm is applied to obtain recommendations, while the substitution itself offers a level of privacy by hiding the original values associated with each individual user. In this section, we show the possibility of a privacy attack on the substituted database by an attacker with partial knowledge of the original data.
3.1 The NeNDS Algorithm
The Nearest Neighbor Data Substitution (NeNDS) technique
is a lossless data obfuscation technique that preserves the privacy of
individual data elements by substituting them with one of their Euclidean space
neighbors. NeNDS uses a permutation-based approach in which groups of similar
items undergo permutation. The permutation approach hides the original value of
a data item by substituting it with another data item that is similar to it but
not the same. NeNDs treats each column in the database as a separate dataset.
The first step in NeNDS is the creation of similar sets of items called
neighborhoods. These items contained in each neighborhood are selected in a
manner that maintains Euclidean closeness between neighbors using some distance
measuring function suited to the data. Each data set is divided into a
pre-specified number of neighborhoods. The items in each neighborhood are then
permuted in such a way that each item is displaced from its original position,
no two items undergo swapping, and the difference between the values of the
original and the obfuscated items is minimal. The number of neighbors in each
neighborhood is denoted , with where
is the number of items in the dataset (this is due to the fact that does not allow any permutation and is the trivial case of
swapping between 2 items and easily reversible).
The substitution process is performed by determining the optimal permutation set subject to the following conditions:
No two elements in the neighborhood undergo swapping.
The elements are displaced from their original position.
Substitution is not performed between duplicate elements.
The permutation mapping is done by creating a tree depicting all possible permutation paths and selecting the path with the minimal maximum distance between any 2 substitutions. For example, we look at the case of the neighborhood . The optimal path for substitution would be with the new neighborhood order being and the maximal difference between any 2 substituted items being and . Once the substitutions in each neighborhood is complete, the column of the original database is replaced with column containing the new item positions. The detailed algorithm can be found in . Note that this algorithm is deterministic for any given value of , and will yield the same permutations given any original order of the original dataset.
3.2 Privacy Attack on NeNDS
In this section we will show an attack on a NeNDS permutated database by an attacker with partial knowledge of the original database, specifically the attacker knows the original position of at least items in each neighborhood. The attack is performed under the following assumptions:
The attacker has complete knowledge of the NeNDS algorithm.
The attacker knows the neighborhood size, used by the algorithm.
The attacker can measure the Euclidean distance between the items in the database.
The attacker has access to the output permutated database (i.e. the new positions of all items).
We will show the attack for a single dataset (column), however since the algorithm is performed independently for each dataset, this can be extended to the entire database. For a given dataset of size , we define the following notations:
Let be the original dataset .
Let be the NeNDS obfuscated dataset .
Let be the original data items in the neighborhood, .
Let be the obfuscated data items in the neighborhood, .
Let be the 2 items in whose original position is unknown to the attacker.
The attack is successful if the attacker can determine the original position in of and for all values of .
3.2.1 The Case of
We look at the simple case of the minimal neighborhood size, . In this case, we have for each value of the neighborhood . The attacker can only know the location of 1 of these items. Assume, without loss of generality, that the attacker knows the position of , and as such the original dataset to be where both and could be the original positions of and . We now look at the output neighborhood after the NeNDS algorithm. Due to the restrictions of the NeNDS algorithm which require each item to be relocated and do not allow swapping between 2 items, the resulting neighborhood can only be one of the following permutations:
Any other permutation would entail leaving an item in its original position. Assume permutation (1). The attacker can determine that the value could not have originally been in position since this is the current position of and the algorithm does not allow swapping between 2 items. Therefore, and . Assume permutation (2). The attacker can determine that the value could not have originally been in position for the same reason, and reaches the same conclusion - the original order for the neighborhood is .
3.2.2 The General Case of any
In this section we will show that the knowledge of original value positions is enough for an attacker to learn the original positions of all values in a neighborhood. We define and for any value to be the original and new location (row) of that value respectively. Taking some neighborhood in , the attacker knows the position for values in . For 2 values, , positions remain unknown. After obfuscation, all new positions are known to the attacker. With this knowledge, since the values in the neighborhood are chosen by their Euclidean closeness, the attacker learns the 2 values and their new positions . There remain 2 possible original positions between which the attacker cannot distinguish (i.e. each one of the values could have been at each one of the possible positions originally).
We now examine the new values in . There are 2 cases: either 1 of the values is or , or both values are from the other values in whose original position is known to the attacker. Note that the case cannot exist since by definition of the algorithm, no 2 items undergo swapping. We now show the attack for both cases, resulting in the discovery of the original positions for .
Assume, without loss of generality, that resides in a position whose original value is unknown, meaning was either or . It is easy to see that since no item remains in the same position after obfuscation. In addition, the remaining unknown position is . The attacker now knows the original position of both previously unknown values.
In this case, both and now contain values whose original position were known to the attacker. We arbitrarily define those positions to be and and their original values and respectively. The attacker can know use the following method to backtrack the obfuscation path and find the original positions of and