Locality Sensitive Hashing with Extended Differential Privacy

10/19/2020 ∙ by Natasha Fernandes, et al. ∙ 0

Extended differential privacy, a generalization of standard differential privacy (DP) using a general metric rather than the Hamming metric, has been widely studied to provide rigorous privacy guarantees while keeping high utility. However, existing works on extended DP focus on a specific metric such as the Euclidean metric, the l_1 metric, and the Earth Mover's metric, and cannot be applied to other metrics. Consequently, existing extended DP mechanisms are limited to a small number of applications such as location-based services and document processing. In this paper, we propose a mechanism providing extended DP with a wide range of metrics. Our mechanism is based on locality sensitive hashing (LSH) and randomized response, and can be applied to a wide variety of metrics including the angular distance (or cosine) metric, Jaccard metric, Earth Mover's metric, and l_p metric. Moreover, our mechanism works well for personal data in a high-dimensional space. We theoretically analyze the privacy properties of our mechanism, introducing new versions of concentrated and probabilistic extended DP to explain the guarantees provided. Finally, we apply our mechanism to friend matching based on high-dimensional personal data with an angular distance metric in the local model. We show that existing local DP mechanisms such as the RAPPOR do not work in this application. We also show through experiments that our mechanism makes possible friend matching with rigorous privacy guarantees and high utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Extended differential privacy (DP), also called -privacy (Chatzikokolakis et al., 2013), is a privacy definition that provides rigorous privacy guarantees while enabling high utility. Extended DP is a generalization of standard DP (Dwork, 2006; Dwork et al., 2006) in that the adjacency relation is defined in terms of a general metric (rather than the Hamming metric). A well-known application of extended DP is geo-indistinguishability (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Oya et al., 2017; Shokri, 2015)

, which is an instance of extended differential privacy in the two-dimensional Euclidean space. Geo-indistinguishability can be used to guarantee that a user’s location is indistinguishable from any location within a certain radius (e.g., a radius of 5km) in the local model, in which each user obfuscates her own data and sends it to a data collector. Geo-indistinguishability also results in higher utility for a task of estimating geographic population distributions

(Alvim et al., 2018), when compared with the randomized response for multiple alphabets (Kairouz et al., 2016) under the local differential privacy model (LDP).

Since extended DP is defined using a general metric, it has a wide range of potential applications. However, existing extended DP mechanisms are limited to a small number of applications; e.g., location-based services (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Oya et al., 2017; Shokri, 2015; Kawamoto and Murakami, 2018), document processing (Fernandes et al., 2019), and linear queries in the centralized model (Kamalaruban et al., 2020). One reason for the limited number of applications is that existing work focuses on a specific metric. For example, the existing work on locations (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Oya et al., 2017; Shokri, 2015; Kawamoto and Murakami, 2018), documents (Fernandes et al., 2019), and linear queries (Kamalaruban et al., 2020) focus on extended DP with the Euclidean metric, the Earth Mover’s metric, and the summation of privacy budgets for attributes, respectively. Therefore, their mechanisms cannot be applied to other metrics such as the angular distance (or cosine) metric, Jaccard metric, and metric other than the Euclidean metric (i.e., ). In addition, most of the studies on extended DP have focused on personal data in a two-dimensional space (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Kamalaruban et al., 2020; Kawamoto and Murakami, 2018; Oya et al., 2017; Shokri, 2015), and do not consider personal data in a high-dimensional space.

In this paper, we propose a new mechanism providing extended DP with a wide range of metrics. Our mechanism is based on locality sensitive hashing (LSH) (Gionis et al., 1999; Wang et al., 2016) and randomized response. In a nutshell, our mechanism embeds personal data with the

metric into a binary vector with the Hamming metric by using LSH, and then applies Warner’s randomized response

(Warner, 1965) for each bit of the binary vector. Since LSH can be applied to a wide range of metrics including the angular distance (or cosine) metric (Andoni et al., 2015; Charikar, 2002), Jaccard metric (Broder et al., 2000), Earth Mover’s metric (Charikar, 2002), and metric with (Datar et al., 2004)

, our mechanism can also be applied to these metrics as well. Moreover, our mechanism works well for personal data in a high-dimensional space, because LSH is a dimensionality reduction technique designed for high-dimensional data.

We apply our mechanism to friend matching (or friend recommendation) based on personal data (e.g., locations, rating history) (Agarwal and Bharadwaj, 2013; Chen and Zhu, 2015; Cheng et al., 2018; Dong et al., 2011; Guo et al., 2018; Li et al., 2017a, b; Liu and Mittal, 2016; Ma et al., 2018; Narayanan et al., 2011; Samanthula et al., 2015; Yang et al., 2019; Zhu et al., 2013)

. For example, consider a dataset of users who have visited certain Points of Interest (POIs). For each user, we can create a vector of visit-counts; i.e., each element in the user vector includes visit-counts on the corresponding POI (or whether she has visited the POI). Users with similar vectors have a high probability of establishing new friendships, as shown in

(Yang et al., 2019). Therefore, we could use the POI vector to recommend a new friend. Similarly, we can recommend a new friend based on a vector consisting of ratings for items, because users who have similar interests have similar rating vectors, as commonly known in item recommendation (Aggarwal, 2016). Since the distance between two vectors in such applications is usually given by the angular distance (or equivalently, the cosine distance) (Aggarwal, 2016), we use our mechanism with the angular distance metric.

The dimensionality of personal data in this application can be very large; e.g., both the numbers of POIs and items can be in the hundreds, thousands, or even millions. For such high-dimensional data, even non-private friend matching is challenging due to the curse of dimensionality

(Indyk and Motwani, 1998). The problem is much harder when we have to rigorously protect user privacy. We address both the utility issue and privacy issue by introducing an extended DP mechanism based on LSH, as explained above.

Note that privacy analysis of the private version of LSH is also very challenging, because it does not preserve the exact information about the original data space; i.e., it approximates the original space via hashing. In fact, a lot of existing works on privacy-preserving LSH (Aghasaryan et al., 2013; Qi et al., 2017; Chen et al., 2019) fail to provide rigorous guarantees about user privacy (or only apply LSH and claim that it protects user privacy because LSH is a kind of non-invertible transformation). We point out, using a toy example, how the lack of rigorous guarantees can lead to privacy breach. Then we theoretically analyze the privacy property of our mechanism, and formally prove that it provides concentrated (Dwork and Rothblum, 2016) and probabilistic (Machanavajjhala et al., 2008) versions of extended DP.

Contributions. Our main contributions are as follows:

  • We propose a new mechanism providing extended DP with a wide range of metrics. Our algorithm is based on LSH and the randomized response, and can be applied to metrics including the angular distance (or cosine) metric, Jaccard metric, Earth Mover’s metric, and metric. To our knowledge, this work is the first to provide extended DP with such a wide range of metrics.

  • We show that LSH itself does not provide strong privacy guarantees and could result in complete privacy collapse in some situations. We then prove that our extended DP mechanism provides rigorous privacy guarantees: concentrated version and probabilistic version of extended DP.

  • We apply our mechanism with an angular distance metric to friend matching based on rating history in a -dimensional space. Then we compare our mechanism with the multivariate Laplace mechanism (Fernandes et al., 2019). Note that although the multivariate Laplace mechanism is designed for the Euclidean distance, we can convert the Euclidean distance to the angular distance. We show through experiments that our mechanism provides much higher utility than the multivariate Laplace mechanism (Fernandes et al., 2019). Our experimental results also show that our mechanism makes possible friend matching with rigorous privacy guarantees and high utility.

The rest of this paper is organized as follows. Section 2 introduces notations and recalls background on locality sensitive hashing (LSH), privacy measures, and privacy protection mechanisms. Section 3 illustrates how the privacy of LSH alone can break. Section 4 defines our LSH-based privacy mechanisms (LSHPMs). Section 5 presents two types of privacy provided by the LSHPMs. Section 6 shows an experimental evaluation of the LSHPMs. Section 7 concludes. All proofs can be found in Appendix.

2. Preliminaries

In this section we introduce notations used in this paper and recall background on locality sensitive hashing (LSH), privacy measures, and privacy protection mechanisms.

2.1. Notations

Let , , , , , and be the sets of integers, non-negative integers, positive integers, real numbers, non-negative real numbers, and positive real numbers, respectively. Let be the set of non-negative real numbers not greater than . We denote by the -dimensional unit sphere (i.e., the set of all points satisfying ). Let be the base of natural logarithm, , and . Let be the Euclidean distance; i.e., for real vectors ,  .

We denote by the set of all binary data of length , i.e., . The Hamming distance between two binary data is defined by:

We denote the

set of all probability distributions

over a set  by . Let

be the normal distribution with a mean

and a variance

. For two finite sets and , we denote by a randomized algorithm from to , and by (resp. by ) the probability that maps to (resp. to some element of ).

2.2. Locality Sensitive Hashing (LSH)

We denote by the set of all possible input data. We introduce some (normalized) similarity function such that two inputs and have a larger similarity when they are closer, and that when . Then we define a dissimilarity function over by . When is symmetric and subadditive, then it is a metric. We will later instantiate and with specific metrics corresponding to hashing schemes.

Now we introduce the notion of locality sensitive hashing (LSH) as a family of one-bit hash functions such that the probability of two inputs having different hash values is proportional to their dissimilarity .

Definition 2.1 (Locality sensitive hashing).

A locality sensitive hashing (LSH) scheme w.r.t. a dissimilarity function is a family of functions from to coupled with a probability distribution such that for any ,

(1)

where a function is chosen from from according to the distribution .

Note that by the definition of , the probability of collision of two hash values is given by:

By using hash functions independently drawn from , an input can be embedded to a -bit binary code as follows:

The function is called a -bit LSH function. We denote by the randomized algorithm that chooses a -bit LSH function according to the distribution and outputs the hash value of a given input .

2.3. Examples of LSHs

There are a variety of LSH families corresponding to useful metrics, such as the angular distance (Andoni et al., 2015; Charikar, 2002), Jaccard metric (Broder et al., 2000), Earth Mover’s metric (Charikar, 2002), and metric with (Datar et al., 2004). In this section we present an LSH family based on randomized hashing called random-projection-based hashing, which corresponds to the angular distance.

A random-projection-based hashing is a one-bit hashing associated with a randomly chosen normal vector

that defines a hyperplane through the origin. Formally, we take the input domain as

, and define a random-projection-based hashing as a function such that:

where is the inner product and is a real vector whose element is independently chosen from the standard normal distribution .

Then the random-projection-based hashing is an LSH w.r.t. a similarity measure defined below. The angular distance and the angular similarity are respectively defined as the functions and such that for any ,

(2)
(3)

Then two vectors and have different hash values if and only if the random hyperplane separates and . Hence . Therefore is an LSH w.r.t. the angular distance .

For example, iff , while iff . represents that the two vectors and are orthogonal, namely, .

2.4. Approximate Nearest Neighbor Search

Next we recall the nearest neighbor search problem and utility measures for their approximate algorithms.

Definition 2.2 (Nearest neighbor search).

Let be a metric over . Given a dataset , the nearest neighbor search for a data point is the problem of finding the closest point to w.r.t. the metric . A -nearest neighbor search for an is the problem of finding the closest points to .

A naive and exact approach to nearest neighbor search is to perform pairwise comparison of data points, requiring operations (respectively operations for nearest neighbors). For a very large dataset , however, nearest neighbor search is computationally expensive, and approaches to improve computational efficiency shift the problem space inefficiency (Andoni et al., 2018). An alternative approach proposed by Indyk and Motwani (Indyk and Motwani, 1998) is to employ LSH to efficiently compute approximate nearest neighbor points.

To evaluate the utility of approximate nearest neighbors, we use the average utility loss defined as follows.

Definition 2.3 (Utility loss).

Assume that an approximate algorithm produces approximate nearest neighbors for a data point in terms of a metric . The average utility loss for w.r.t. the true nearest neighbors is given by:

That is, we compute the average distance of returned nearest neighbors from the data point compared with the average distance of true nearest neighbors. We prefer this measure to recall and precision measures as the LSH algorithm can return many neighbors with the same distance (i.e., in the same hash bucket) and our approximate algorithm determines the choice of neighbors at random (a reasonable choice given the usefulness of the output is determined by how similar the neighbors are to the original data point ).

2.5. Privacy Measures

In this section we recall notions of privacy measures: differential privacy (DP(Dwork, 2006), extended DP with a metric (Chatzikokolakis et al., 2013), and concentrated DP (Dwork and Rothblum, 2016).

First, the notion of differential privacy of a randomized algorithm represents that cannot distinguish between given adjacent inputs and .

Definition 2.4 (Differential privacy).

A randomized algorithm provides -differential privacy (DP) w.r.t. an adjacency relation if for any and any ,

where the probability is taken over the random choices in .

We then review a notion of extended differential privacy (Chatzikokolakis et al., 2013; Kawamoto and Murakami, 2019), which relaxes DP in the sense that when two inputs and are closer, the output distributions are less distinguishable. In this paper, we introduce a generalized definition using a function over inputs and an arbitrary function over rather than a metric as follows.

Definition 2.5 (Extended differential privacy).

Given two functions and , a randomized algorithm provides -extended differential privacy (XDP) if for all and for any ,

where the probability is taken over the random choices in .

Hereafter we sometimes abuse notation and write when is a constant function returning the same real number independently of the inputs and .

Next, we review the notion of privacy loss and privacy loss distribution (Dwork and Rothblum, 2016).

Definition 2.6 (Privacy loss random variable).

Let be a randomized algorithm. The privacy loss on an output w.r.t. inputs is defined by:

where the probability is taken over the random choices in . Note that when and , then . When , then . Then the privacy loss random variable of over

is the real-valued random variable representing the privacy loss

where is sampled from .

Given inputs and a privacy loss , let . Then the privacy loss distribution of over is defined as a probability distribution over such that:

Then mean-concentrated differential privacy (Dwork and Rothblum, 2016) is defined using the notion of subgaussian random variables as follows.

Definition 2.7 (Subgaussian).

For a , a random variable over is -subgaussian if for all , . A random variable is subgaussian if there exists a such that is -subgaussian.

Definition 2.8 (Mean-concentrated Dp).

Let . A randomized algorithm provides -mean-concentrated differential privacy (CDP) w.r.t. an adjacency relation if for any , the privacy loss random variable of over satisfies that , and that is -subgaussian.

Finally, we recall the notion of probabilistic differential privacy (Dwork et al., 2010; Sommer et al., 2019).

Definition 2.9 (Probabilistic Dp).

Let . A randomized algorithm provides -probabilistic differential privacy (PDP) w.r.t. an adjacency relation if for any , the privacy loss random variable satisfies and .

2.6. Privacy Mechanisms

Finally, we recall two popular privacy protection mechanisms: the randomized response (Warner, 1965) and the Laplace mechanism (Dwork et al., 2006).

Definition 2.10 (-randomized response).

For an , the -randomized response (or -RR for short) over the binary alphabet is the randomized algorithm that maps a bit to another with the following probability:

Then the -RR provides -DP.

Definition 2.11 (Laplace mechanism).

Given an , an input , and a metric be a metric over , the -Laplace mechanism over is the randomized algorithm that maps an input to an output with probability where .

In Section 6, we use a multivariate Laplace mechanism over .

3. Privacy Properties of LSH

Several works in the literature make reference to the privacy-preserving properties of LSH (Qi et al., 2017; Aghasaryan et al., 2013; Chow et al., 2012). The privacy guarantee attributed to LSH mechanisms hinges on its hash function, which ‘protects’ an individual’s private attributes by revealing only their hash bucket. We now apply a formal analysis to LSH and explain why LSH implementations do not provide strong privacy guarantees, and could, in some situations, result in complete privacy collapse for the individual.

3.1. Modeling LSH

We present a simple example to show how privacy can break down. Consider the set of secret inputs . Each element of could, for example, correspond to whether or not an individual has rated two movies and . Then we model an LSH as a probabilistic channel that maps a secret input to a binary observation.

For brevity we deal with a single random-projection-based hashing as described in Section 2.3. That is, we randomly choose a vector representing the normal to a hyperplane, and compute the inner product of with an input vector . The hash function outputs if the inner product is negative and otherwise. For example, if is chosen, then the hash function is defined as:

In fact, there are exactly 6 possible (deterministic) hash functions for any choice of the normal vector , corresponding to hyperplanes which separate different pairs of points:

Each of , , , and occurs with probability , while and each occur with probability . The resulting channel , computed as the probabilistic sum of these deterministic hash functions, turns out to leak no information on the secret input (i.e., all outputs have equal probability conditioned on each input).

This indicates that the LSH mechanism above is perfectly private. However, in practice LSH also requires the release of the choice of the normal vector (e.g. (Chow et al., 2012))111In fact, since the channel on its own leaks nothing, there must be further information released in order to learn anything useful from this channel.. In other words, the choice of hash function is leaked. Notice that in our example, the functions to correspond to deterministic mechanisms which leak exactly 1 bit of the secret, while and leak nothing. In other words, with probability, 1 bit of the 2-bit secret is leaked. Not only that, but mechanisms and leak the secret exactly, and similarly, mechanisms and leak exactly. Thus, the release of the normal vector destroys the privacy guarantee.

3.2. The Guarantee of LSH

In general, for any number of hash functions and any length input, an LSH mechanism which releases its choice of hyperplanes also leaks its choice of deterministic mechanism. This means that it leaks the equivalence classes of the secrets. Such mechanisms belong to the ‘-anonymity’-style of privacy mechanisms which promise privacy by hiding secrets in equivalence classes of size at least . These have been shown to be unsafe due to their failure to compose well (Ganta et al., 2008; Fernandes et al., 2018). This failure leads to the potential for linkage or intersection attacks by an adversary armed with auxiliary information. For this reason, we consider compositionality an essential property for a privacy-preserving system. LSH with hyperplane release does not provide such privacy guarantees.

4. LSH-based Privacy Mechanisms

In this section we introduce a new privacy protection mechanism called an LSH-based privacy mechanism (LSHPM). Roughly speaking, this mechanism is an extension of RAPPOR (Erlingsson et al., 2014) with respect to a locality sensitive hashing (LSH).

4.1. Bitwise Randomized Response

We first define the bitwise randomized response as the privacy mechanism that applies the randomized response to each bit of the input independently.

Definition 4.1 (-bitwise randomized response).

Let , , , and be the -randomized response. The -bitwise randomized response (or -BRR for short) is defined as the randomized algorithm that maps a bitstring to another with the following probability:

Then provides XDP w.r.t. the Hamming distance .

Proposition 1 (Xdp of BRR).

Let and . The -bitwise randomized response provides -XDP.

4.2. Construction of LSHPMs

Now we introduce an LSH-based privacy mechanism (LSHPM) as a randomized algorithm that (i) randomly chooses a -bit LSH function , (ii) computes the -bit hash code of a given input , and then (iii) applies the bitwise randomized response to .

Formally, we define the mechanism as follows. Recall that is the randomized algorithm that randomly chooses a -bit LSH function according to a distribution and outputs the hash value of a given input . (See Section 2.2 for details.)

Definition 4.2 (LSHPM).

Let be the -bitwise randomized response. The -LSH-based privacy mechanism with a -bit LSH function is the randomized algorithm defined by . Given a distribution of the -bit LSH functions, the -LSH-based privacy mechanism w.r.t. is the randomized algorithm defined by .

It should be noted that there are two kinds of randomness in the LSHPM: (a) the randomness in choosing a (deterministic) LSH function from (e.g., the random seed in the random-projection-based hashing 222More specifically, in the definition of , a tuple of seeds for the -bit LSH function is randomly chosen.), and (b) the random noise added by the bitwise randomized response . We can assume that each user of this privacy mechanism selects the input independently of both kinds of randomness, since they wish to protect their own privacy when publishing .

In practical settings, the same LSH function is often employed to produce hash values of different inputs, namely, multiple hash values are dependent on an identical hash seed (e.g., multiple users share the same LSH function so that they can compare their hash values). Furthermore, the adversary may obtain the hash function itself (or the seed used to produce ), and might be able to learn a set of possible inputs that produce the same hash value without knowing the actual input . Therefore, the hash value may reveal partial information on the input (see Section 3), and the bitwise randomized response is crucial in guaranteeing privacy.

5. Privacy Guarantees for LSHPMs

In this section we show two types of privacy that the LSHPMs guarantee: (i) the exact privacy that we learn after both input vectors and hash seeds are selected, and (ii) the expected privacy that represents the probability distribution of possible levels of exact privacy. To define the latter, we introduce a variety of privacy notions with shared randomness, including new notions of extended/concentrated privacy with a metric.

5.1. Exact Privacy of LSHPMs

The exact degree of privacy provided by LSHPMs depends on the random choice of hash seeds . This means that some user’s input may be protected only weakly when an ‘unlucky’ seed is chosen to produce the hash value . For example, if many hyperplanes given by such unlucky seeds split a small specific area in the input space , then the hash value reveals much information on the input chosen from that small area.

Hence we can learn the exact degree of privacy guarantee only after obtaining the hash seeds and the input vectors. Formally, we show the exact privacy provided by LSHPMs as follows.

Proposition 2 (Exact privacy of ).

Let , be a -bit LSH function, and be the pseudometric defined by for each . Then the -LSH-based privacy mechanism provides -XDP.

The above proposition implies that an unlucky user has a large privacy loss even though the LSHPM uses the randomized response, simply because LSH preserves the distance between inputs only probabilistically and approximately. By Proposition 2, the LSHPM provides -DP in the worst case, i.e., when the hamming distance between vectors is maximum due to unlucky hash seeds and/or too large original distance between the inputs .

Proposition 3 (Worst-case privacy of ).

Let and be a -bit LSH function. The -LSH-based privacy mechanism provides -DP.

This shows that achieves weaker privacy for a larger .

5.2. Privacy Notion with a Shared Randomness

Next we introduce expected privacy notions to characterize the range of possible levels of exact privacy. As shown later, the privacy loss follows a probability distribution due to the random choice of hash seeds. For example, the number of hyperplanes splitting a particular small area in the input space can vary depending on the random choice of hash seeds.

To formalize the LSHPM ’s expected privacy, we need to deal with the same hash seeds shared among multiple users. Thus we propose new privacy notions for privacy protection mechanisms that share randomness among them.

Hereafter we denote by a finite set of shared input, and by a randomized algorithm from to with a share input . Given a distribution over , we denote by the randomized algorithm that draws a shared input from and behaves as ; i.e., for and , . For brevity we sometimes abbreviate as .

We first introduce privacy loss variables with shared randomness as follows.

Definition 5.1 (Privacy loss random variable with shared randomness).

Given a randomized algorithm with a shared input , the privacy loss on an output w.r.t. inputs and is defined by:

where the probability is taken over the random choices in . Given a distribution over , the privacy loss random variable of over w.r.t. is the real-valued random variable representing the privacy loss where is sampled from and is sampled from . Here we call a shared randomness.

Then the the privacy loss distribution of over is defined analogously to Definition 2.6. Now we extend mean-concentrated DP with shared randomness as follows.

Definition 5.2 (Mean-concentrated Dp with shared randomness).

Let , , and . A randomized algorithm provides -mean-concentrated differential privacy with shared randomness (rCDP) if for all , the privacy loss random variable of over w.r.t. satisfies that , and that is -subgaussian.

5.3. Concentrated/Probabilistic Dp with a Metric

Next we introduce a new privacy notion, called mean-concentrated XDP (abbreviated as CXDP) , that relaxes CDP by incorporating a metric as follows.

Definition 5.3 (Mean-concentrated Xdp).

Let , , , and be a metric. A randomized algorithm provides -mean-concentrated extended differential privacy (CXDP) if for all , the privacy loss random variable of over w.r.t. satisfies that , and that is -subgaussian.

To explain the meaning of CXDP, we also introduce probabilistic XDP (abbreviated as PXDP) as a privacy notion that relaxes XDP and PDP as follows.

Definition 5.4 (Probabilistic Xdp).

Let , , and . A randomized algorithm provides -probabilistic extended differential privacy (PXDP) if for all , the privacy loss random variable of over w.r.t. satisfies .

Again, we sometimes abuse notation and simply write when is constant for all .

Now we show that CXDP implies PXDP, and that PXDP implies XDP as follows.

Proposition 4 (Cxdp Pxdp).

Let , , , be a randomized algorithm, and be a metric over . Let and . We define by for all . If provides -CXDP, then it provides -PXDP.

Proposition 5 (Pxdp Xdp).

Let , be a randomized algorithm, , and . If provides -PXDP, then it provides -XDP.

5.4. Expected Privacy of LSHPMs

Next we show that the LSHPM provides CXDP and PXDP

. To derive these, we first prove that the Hamming distance between hash values follows a binomial distribution.

Proposition 6 (Distribution of the Hamming distance of hash values).

Let be an LSH scheme w.r.t. a metric over coupled with a distribution . Let be any two inputs, and be the random variable of the Hamming distance between their -bit hash values, i.e., where a -bit LSH function is drawn from the distribution . Then follows the binomial distribution with mean and variance .

Now we show that the LSHPM provides CXDP as follows.

Theorem 1 (Cxdp of the LSHPMs).

The -LSH-based privacy mechanism provides -CXDP.

This implies that the LSHPM provides PXDP, hence XDP.

Theorem 2 (Pxdp/Xdp of the LSHPMs).

Let and . We define by . The -LSH-based mechanism provides -PXDP, hence -XDP.

Finally, we present a better bound for PXDP of the proposed mechanism. For , let .

Proposition 7 (Tighter bound for Pxdp/Xdp).

For an , we define and by:

The -LSH-based mechanism provides -PXDP, hence -XDP.

Note that when we apply Proposition 7 in our experiments, we numerically compute from a constant using the convexity of .

6. Experimental Evaluation

In this section we present an experimental evaluation of our LSH-based privacy mechanism (LSHPM) for a problem in privacy-preserving nearest neighbor search. Our goal is to evaluate the effectiveness of the LSHPM against a state-of-the-art privacy mechanism w.r.t. utility, while promising a similar privacy guarantee.

6.1. Problem Statement

Our problem of interest is privacy-preserving collaborative filtering. In this scenario, we are given a dataset of users in which each user is represented as a (real-valued) vector of attributes, i.e., . The data curator’s goal is to provide recommendations to each user based on their nearest neighbors w.r.t. an appropriate distance measure over users. When is large, LSH can be used to provide recommendations based on approximate nearest neighbors.

When attribute values are sensitive, LSH alone does not provide strong privacy guarantees, as shown in Section 3. In this case, to protect users’ sensitive attributes from an untrusted data curator, a differentially private mechanism can be applied to the data prior to sending it to the dataset. The noisy data can then be used for recommendations.

We compare two privacy mechanisms:

  1. LSHPM Mechanism: For each user we apply -bit LSH to the attribute vector according to a precomputed (shared) hash function and subsequently apply -bitwise randomized response (Definition 4.1) to the computed hash. The noisy hash is used to compute nearest neighbors. We compute the privacy guarantee of this mechanism according to Proposition 7, which gives a -XDP -style guarantee that depends on .

  2. Multivariate Laplace Mechanism: For each user we first add Laplace noise parametrized by (Definition 2.11) to the attribute vector before applying LSH in order to generate noisy hash values. Our implementation of the multivariate Laplace is -private for Euclidean distance and follows that described in (Fernandes et al., 2019); namely, we generate additive noise by constructing a unit vector uniformly at random over

    , scaled by a random value generated from the gamma distribution with shape

    and scale . Note that this method carries a strong privacy guarantee, namely -XDP since the multivariate Laplace mechanism is -private.

6.2. Comparing Privacy and Utility

For our evaluation we compare the utility loss of each mechanism for the same privacy guarantee. To establish equivalent privacy guarantees, the overall value of each mechanism is used as the basis for comparison; for the Laplace mechanism this is and for LSHPM this is from Proposition 7. However, as the privacy guarantee of LSHPM depends on and that of the Laplace depends on , we cannot simply compare the guarantees directly; we also need to compare the privacy guarantee w.r.t. the distances between users. Hence we make use of the relationship between the Euclidean and cosine distances provided that attribute vectors are normalized prior to performing experiments. We convert the Euclidean distance to the angular distance using the following formula:

Since depends on and , we perform comparisons against various reasonable ranges of these variables to properly evaluate the range of utility loss. The LSHPM privacy guarantee also includes a value which is fixed to a constant value for all computations.

For utility we use the average distance of the nearest neighbors from the original input vector, as defined in Definition 2.3. We will present an appropriate distance measure for our particular use case in the following section.

6.3. Use Case and Experimental Setup

Our particular application is friend-matching

, in which users are recommended similar users based on common items of interest – in this case, similarly rated movies. For our experiments we use the MovieLens 100k dataset 

333Downloaded from https://grouplens.org/datasets/movielens/. This contains 943 users with ratings across 1682 movies, with ratings ranging from to . We associate a movie ratings vector to each user, and for nearest neighbor comparison we use the angular distance between ratings vectors (from Eqn (2)), closely related to cosine distance which has been established as a good measure of similarity in previous work. That is, users who have similar movie ratings will be close in cosine (and angular) distance.

We computed ratings vectors for each user, where rating corresponds to movie in the dataset; unrated movies are assumed to have a rating value of . We reduced the size of the ratings vectors by including only movies from the 4 most rated genres, and selecting subsets of those movies to produce 100-dimensional and 500-dimensional ratings vectors for each user. Ratings were normalized by subtracting the total average rating (across all users) from each rated movie. Unrated movies retained their original value of .

We next computed the nearest neighbors (w.r.t. the angular distance ) for each user for using standard nearest neighbor search (i.e., pairwise comparisons over all ratings vectors). We refer to these data points as the True Nearest Neighbors; these are used to evaluate the performance of standard LSH and, by extension, the Laplace and LSHPM mechanisms. The distribution of True Nearest Neighbor distances for is shown in Figure 1.

Figure 1. Distribution of angular distances to nearest neighbor for and vector lengths 100 (left) and 500 (right). The distance 0.5 represents orthogonal vectors; i.e., sharing no movies in common.

We note that for the smaller vector length, the nearest neighbors are distance  444Ignoring distances of which represent users who have exactly the same ratings vector. This typically occurs when users only rate a single movie. and the average distance to the nearest neighbor is around . For the larger vector length the nearest neighbor distance is just over with an average distance of around . These distances are important for computing the privacy guarantee, which depends on the true distance between users. We therefore use distances between and for our utility comparison.

6.4. Performance of LSH

We performed a baseline comparison of LSH against True Nearest Neighbors to establish the utility of vanilla LSH. -bit LSH was implemented (for ) using the random-projection-based hashing described in Section 2.3, since this provides a guarantee w.r.t. the angular distance between vectors. A (uniformly random) -dimensional normal vector can be constructed by selecting each component from a standard normal distribution; this is done for each of the hashes to generate the LSH hash function . The same function is used to encode each user’s ratings vector into a -bit string.

For each user we then computed their nearest neighbors for using the Hamming distance on bitstrings. Where multiple neighbors shared the same distance (i.e., they were in the same hash bucket) we chose a nearest neighbor at random.

The results for utility loss for LSH against True Nearest Neighbors are shown in Figure 2. As expected, the utility loss decreases as the number of bits increases, since each bit can be seen as (probabilistically) encoding some information about the distance; utility loss is also smaller for the shorter vector length, since the dimensionality reduction from elements to bits is smaller for smaller .

Figure 2. Average utility loss for LSH for vector lengths 100 (top) and 500 (bottom). The average utility loss is the average angular distance of the nearest neighbors computed by LSH compared with the average distance to the true nearest neighbors. Values of shown are (blue), (orange) and (green).

6.5. Performance of Privacy Mechanisms

We now compare the performance of LSHPM and the Laplace mechanism against LSH and True Nearest Neighbors. We implemented both mechanisms as described in Section 6.1. For both LSHPM and Laplace, the same hash function was used as for vanilla LSH for the purposes of comparison. We chose values of ranging from through to for LSHPM. To compute the overall -XDP guarantee as per Proposition 7, we fixed and varied from to to obtain corresponding values of  555 is the distance measure corresponding to in Section 5.4. These are shown in Table 1.

Length of bitstring ()
10 20 30 50
0.05 0.31111 0.2028 0.15866 0.11713
0.1 0.3766 0.25209 0.19988 0.14979
0.2 0.44181 0.30509 0.24546 0.18683
0.25 0.45747 0.31977 0.25866 0.19796
0.3 0.46544 0.32908 0.26749 0.20573
0.4 0.46266 0.33474 0.27462 0.21313
0.5 0.43792 0.32553 0.26969 0.21123
Table 1. Values of given an angular distance between inputs and a length of bitstring. These are used to compute in the LSHPM guarantee of Proposition 7.

We chose values of corresponding to actual distances observed for true nearest neighbors as described in Section 6.3.

We first plotted the utility loss of LSHPM against vanilla LSH (see Figure 3), since the performance of LSHPM is bounded by that of LSH. We note that the utility of LSHPM improves as increases and approaches the utility of LSH for large . This suggests that LSHPM could perform better for larger datasets for which larger values of would not incur a significant privacy loss for individuals. We also notice that the performance of LSHPM is better with smaller values of , corresponding to more similar users; we suggest that LSHPM would therefore be suitable for datasets in which the vector lengths are smaller (i.e., there are less items in the recommendation engine).

Figure 3. Average utility loss vs total epsilon for LSHPM vs LSH. Note that in this plot corresponds to angular distance () computed between user ratings vectors.

Finally we compared the utility loss of LSHPM against the multivariate Laplace mechanism. We observe that LSHPM consistently outperforms the Laplace mechanism, more notably for larger values of , for larger bit size and for larger values of .

Figure 4. Average utility loss vs total epsilon for LSHPM vs Laplace. Note that in this plot corresponds to angular distance () computed between user ratings vectors.

In summary, our experiments show that LSHPM has better performance than the Laplace mechanism for a comparable privacy guarantee, and that the performance of LSHPM approaches the performance of vanilla LSH for larger values of and for larger bit lengths . This suggests that LSHPM could be used for very large datasets in which larger values of are tolerable.

7. Conclusion

In this paper we proposed an LSH-based privacy mechanism that provides extended DP with a wide range of metrics. We first showed that LSH itself does not provide strong privacy guarantees and could result in complete privacy collapse in some situations. We then proved that our LSH-based mechanism provides rigorous privacy guarantees: concentrated/probabilistic versions of extended DP. By experiments with large datasets, we demonstrated that our mechanism provides much higher utility than the multivariate Laplace mechanism, and that it can be applied to friend matching with rigorous privacy guarantees and high utility.

Acknowledgements.
This work was supported by Inria under the project LOGIS, by JSPS KAKENHI Grant Number JP19H04113, and by an Australian Government RTP Scholarship (2017278).

References

  • V. Agarwal and K. K. Bharadwaj (2013) A collaborative filtering framework for friends recommendation in social networks based on interaction intensity and adaptive user similarity. Social Network Analysis and Mining 3, pp. 359–379. Cited by: §1.
  • C. C. Aggarwal (2016) Recommender systems. Springer. Cited by: §1.
  • A. Aghasaryan, M. Bouzid, D. Kostadinov, M. Kothari, and A. Nandi (2013) On the use of lsh for privacy preserving personalization. In 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 362–371. Cited by: §1, §3.
  • M. S. Alvim, K. Chatzikokolakis, C. Palamidessi, and A. Pazii (2018) Metric-based local differential privacy for statistical applications. CoRR abs/1805.01456. External Links: 1805.01456 Cited by: §1, §1.
  • A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt (2015) Practical and optimal lsh for angular distance. In Proc. NIPS, pp. 1–9. Cited by: §1, §2.3.
  • A. Andoni, P. Indyk, and I. Razenshteyn (2018) Approximate nearest neighbor search in high dimensions. In Proc. ICM, pp. 3287–3318. Cited by: §2.4.
  • M. E. Andrés, N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi (2013) Geo-indistinguishability: differential privacy for location-based systems. In Proc. CCS, pp. 901–914. External Links: Document, ISBN 978-1-4503-2477-9 Cited by: §1, §1.
  • N. E. Bordenabe, K. Chatzikokolakis, and C. Palamidessi (2014) Optimal geo-indistinguishable mechanisms for location privacy. In Proc. CCS, pp. 251–262. Cited by: §1, §1.
  • A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher (2000) Min-wise independent permutations. Journal of Computer and System Sciences 60, pp. 630–659. Cited by: §1, §2.3.
  • M. S. Charikar (2002) Similarity estimation techniques from rounding algorithms. In Proc. STOC, pp. 380–388. Cited by: §1, §2.3.
  • K. Chatzikokolakis, M. E. Andrés, N. E. Bordenabe, and C. Palamidessi (2013) Broadening the scope of Differential Privacy using metrics. In Proc. PETS, pp. 82–102. Cited by: §1, §2.5, §2.5.
  • K. Chatzikokolakis, E. ElSalamouny, and C. Palamidessi (2017) Efficient utility improvement for location privacy. PoPETs 2017 (4), pp. 308–328. Cited by: §1, §1.
  • L. Chen and P. Zhu (2015) Preserving the privacy of social recommendation with a differentially private approach. In Proc. SmartCity, pp. 780–785. Cited by: §1.
  • X. Chen, H. Liu, and D. Yang (2019) Improved lsh for privacy-aware and robust recommender system with sparse data in edge environment. EURASIP Journal on Wireless Communications and Networking 171, pp. 1–11. Cited by: §1.
  • H. Cheng, M. Qian, Q. Li, Y. Zhou, and T. Chen (2018) An efficient privacy-preserving friend recommendation scheme for social network. IEEE Access 6, pp. 56018–56028. Cited by: §1.
  • R. Chow, M. A. Pathak, and C. Wang (2012) A practical system for privacy-preserving collaborative filtering. In 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 547–554. Cited by: §3.1, §3.
  • M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proc. SCG, pp. 253–262. Cited by: §1, §2.3.
  • W. Dong, V. Dave, L. Qiu, and Y. Zhang (2011) Secure friend discovery in mobile social networks. In Proc. InfoCom, pp. 1647–1655. Cited by: §1.
  • C. Dwork, F. Mcsherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Proc. TCC, pp. 265–284. Cited by: §1, §2.6.
  • C. Dwork, G. N. Rothblum, and S. P. Vadhan (2010) Boosting and differential privacy. In Proc. FOCS, pp. 51–60. Cited by: §2.5.
  • C. Dwork and G. N. Rothblum (2016) Concentrated differential privacy. CoRR abs/1603.01887. External Links: 1603.01887 Cited by: §1, §2.5, §2.5, §2.5.
  • C. Dwork (2006) Differential privacy. In Proc. ICALP, pp. 1–12. External Links: ISBN 3-540-35907-9 Cited by: §1, §2.5.
  • Ú. Erlingsson, V. Pihur, and A. Korolova (2014) RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proc. CCS, pp. 1054–1067. Cited by: §4.
  • N. Fernandes, M. Dras, and A. McIver (2018) Processing text for privacy: an information flow perspective. In International Symposium on Formal Methods, pp. 3–21. Cited by: §3.2.
  • N. Fernandes, M. Dras, and A. McIver (2019) Generalised differential privacy for text document processing. In Proc. POST, pp. 123–148. External Links: Document Cited by: 3rd item, §1, item 2.
  • S. R. Ganta, S. P. Kasiviswanathan, and A. Smith (2008) Composition attacks and auxiliary information in data privacy. In Proc. KDD, pp. 265–273. Cited by: §3.2.
  • A. Gionis, P. Indyk, and R. Motwani (1999) Similarity search in high dimensions via hashing. In Proc. VLDB, pp. 518–529. Cited by: §1.
  • T. Guo, K. Dong, L. Wang, M. Yang, and J. Luo (2018) Privacy preserving profile matching for social networks. In Proc. CBD, pp. 263–268. Cited by: §1.
  • P. Indyk and R. Motwani (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    ,
    pp. 604–613. Cited by: §1, §2.4.
  • P. Kairouz, K. Bonawitz, and D. Ramage (2016) Discrete distribution estimation under local privacy. In Proc. ICML, pp. 2436–2444. Cited by: §1.
  • P. Kamalaruban, V. Perrier, H. J. Asghar, and M. A. Kaafar (2020) Not all attributes are created equal: -private mechanisms for linear queries. Proceedings on Privacy Enhancing Technologies (PoPETs) 2020 (1), pp. 103–125. Cited by: §1.
  • Y. Kawamoto and T. Murakami (2018) On the anonymization of differentially private location obfuscation. In Proc. ISITA, pp. 159–163. Cited by: §1.
  • Y. Kawamoto and T. Murakami (2019) Local obfuscation mechanisms for hiding probability distributions. In Proc. ESORICS, pp. 128–148. Cited by: §2.5.
  • D. Li, Q. Lv, L. Shang, and N. Gu (2017a) Efficient privacy-preserving content recommendation for online social communities. Neurocomputing 219, pp. 440–454. Cited by: §1.
  • M. Li, N. Ruan, Q. Qian, H. Zhu, X. Liang, and L. Yu (2017b) SPFM: scalable and privacy-preserving friend matching in mobile clouds. IEEE Internet of Things Journal 4 (2), pp. 583–591. Cited by: §1.
  • C. Liu and P. Mittal (2016) LinkMirage: enabling privacy-preserving analytics on social relationships.. In Proc. NDSS, Cited by: §1.
  • X. Ma, J. Ma, H. Li, Q. Jiang, and S. Gao (2018) ARMOR: a trust-based privacy-preserving framework for decentralized friend recommendation in online social networks. Future Generation Computer Systems 79, pp. 82–94. Cited by: §1.
  • A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber (2008) Privacy: theory meets practice on the map. In Proc. ICDE, pp. 277–286. Cited by: §1.
  • A. Narayanan, N. Thiagarajan, M. Lakhani, M. Hamburg, D. Boneh, et al. (2011) Location privacy via private proximity testing.. In Proc. NDSS, Vol. 11. Cited by: §1.
  • S. Oya, C. Troncoso, and F. Pérez-González (2017) Back to the drawing board: revisiting the design of optimal location privacy-preserving mechanisms. In Proc. CCS, pp. 1959–1972. Cited by: §1, §1.
  • L. Qi, X. Zhang, W. Dou, and Q. Ni (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE Journal on Selected Areas in Communications 35 (11), pp. 2616–2624. Cited by: §1, §3.
  • B. K. Samanthula, L. Cen, W. Jiang, and L. Si (2015) Privacy-preserving and efficient friend recommendation in online social networks.. Trans. Data Privacy 8 (2), pp. 141–171. Cited by: §1.
  • R. Shokri (2015) Privacy games: optimal user-centric data obfuscation. Proceedings on Privacy Enhancing Technologies (PoPETs) 2015 (2), pp. 299–315. Cited by: §1, §1.
  • D. M. Sommer, S. Meiser, and E. Mohammadi (2019)

    Privacy loss classes: the central limit theorem in differential privacy

    .
    PoPETs 2019 (2), pp. 245–269. External Links: Document Cited by: §2.5.
  • J. Wang, W. Liu, S. Kumar, and S. Chang (2016) Learning to hash for indexing big data – a survey. Proceedings of the IEEE 104 (1), pp. 34–57. Cited by: §1.
  • S. L. Warner (1965) Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60 (309), pp. 63–69. Cited by: §1, §2.6.
  • D. Yang, B. Qu, J. Yang, and P. Cudre-Mauroux (2019) Revisiting user mobility and social relationships in LBSNs: a hypergraph embedding approach. In Proc.WWW, pp. 2147–2157. Cited by: §1.
  • H. Zhu, S. Du, M. Li, and Z. Gao (2013) Fairness-aware and privacy-preserving friend matching protocol in mobile social networks. IEEE Transactions on Emerging Topics in Computing 1 (1), pp. 192–200. Cited by: §1.

Appendix A Details on the Technical Results

We first recall Chernoff bound, which is used in the proof for Proposition 4.

Lemma 1 (Chernoff bound).

Let be a real-valued random variable. Then for all ,