1. Introduction
Extended differential privacy (DP), also called privacy (Chatzikokolakis et al., 2013), is a privacy definition that provides rigorous privacy guarantees while enabling high utility. Extended DP is a generalization of standard DP (Dwork, 2006; Dwork et al., 2006) in that the adjacency relation is defined in terms of a general metric (rather than the Hamming metric). A wellknown application of extended DP is geoindistinguishability (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Oya et al., 2017; Shokri, 2015)
, which is an instance of extended differential privacy in the twodimensional Euclidean space. Geoindistinguishability can be used to guarantee that a user’s location is indistinguishable from any location within a certain radius (e.g., a radius of 5km) in the local model, in which each user obfuscates her own data and sends it to a data collector. Geoindistinguishability also results in higher utility for a task of estimating geographic population distributions
(Alvim et al., 2018), when compared with the randomized response for multiple alphabets (Kairouz et al., 2016) under the local differential privacy model (LDP).Since extended DP is defined using a general metric, it has a wide range of potential applications. However, existing extended DP mechanisms are limited to a small number of applications; e.g., locationbased services (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Oya et al., 2017; Shokri, 2015; Kawamoto and Murakami, 2018), document processing (Fernandes et al., 2019), and linear queries in the centralized model (Kamalaruban et al., 2020). One reason for the limited number of applications is that existing work focuses on a specific metric. For example, the existing work on locations (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Oya et al., 2017; Shokri, 2015; Kawamoto and Murakami, 2018), documents (Fernandes et al., 2019), and linear queries (Kamalaruban et al., 2020) focus on extended DP with the Euclidean metric, the Earth Mover’s metric, and the summation of privacy budgets for attributes, respectively. Therefore, their mechanisms cannot be applied to other metrics such as the angular distance (or cosine) metric, Jaccard metric, and metric other than the Euclidean metric (i.e., ). In addition, most of the studies on extended DP have focused on personal data in a twodimensional space (Alvim et al., 2018; Andrés et al., 2013; Bordenabe et al., 2014; Chatzikokolakis et al., 2017; Kamalaruban et al., 2020; Kawamoto and Murakami, 2018; Oya et al., 2017; Shokri, 2015), and do not consider personal data in a highdimensional space.
In this paper, we propose a new mechanism providing extended DP with a wide range of metrics. Our mechanism is based on locality sensitive hashing (LSH) (Gionis et al., 1999; Wang et al., 2016) and randomized response. In a nutshell, our mechanism embeds personal data with the
metric into a binary vector with the Hamming metric by using LSH, and then applies Warner’s randomized response
(Warner, 1965) for each bit of the binary vector. Since LSH can be applied to a wide range of metrics including the angular distance (or cosine) metric (Andoni et al., 2015; Charikar, 2002), Jaccard metric (Broder et al., 2000), Earth Mover’s metric (Charikar, 2002), and metric with (Datar et al., 2004), our mechanism can also be applied to these metrics as well. Moreover, our mechanism works well for personal data in a highdimensional space, because LSH is a dimensionality reduction technique designed for highdimensional data.
We apply our mechanism to friend matching (or friend recommendation) based on personal data (e.g., locations, rating history) (Agarwal and Bharadwaj, 2013; Chen and Zhu, 2015; Cheng et al., 2018; Dong et al., 2011; Guo et al., 2018; Li et al., 2017a, b; Liu and Mittal, 2016; Ma et al., 2018; Narayanan et al., 2011; Samanthula et al., 2015; Yang et al., 2019; Zhu et al., 2013)
. For example, consider a dataset of users who have visited certain Points of Interest (POIs). For each user, we can create a vector of visitcounts; i.e., each element in the user vector includes visitcounts on the corresponding POI (or whether she has visited the POI). Users with similar vectors have a high probability of establishing new friendships, as shown in
(Yang et al., 2019). Therefore, we could use the POI vector to recommend a new friend. Similarly, we can recommend a new friend based on a vector consisting of ratings for items, because users who have similar interests have similar rating vectors, as commonly known in item recommendation (Aggarwal, 2016). Since the distance between two vectors in such applications is usually given by the angular distance (or equivalently, the cosine distance) (Aggarwal, 2016), we use our mechanism with the angular distance metric.The dimensionality of personal data in this application can be very large; e.g., both the numbers of POIs and items can be in the hundreds, thousands, or even millions. For such highdimensional data, even nonprivate friend matching is challenging due to the curse of dimensionality
(Indyk and Motwani, 1998). The problem is much harder when we have to rigorously protect user privacy. We address both the utility issue and privacy issue by introducing an extended DP mechanism based on LSH, as explained above.Note that privacy analysis of the private version of LSH is also very challenging, because it does not preserve the exact information about the original data space; i.e., it approximates the original space via hashing. In fact, a lot of existing works on privacypreserving LSH (Aghasaryan et al., 2013; Qi et al., 2017; Chen et al., 2019) fail to provide rigorous guarantees about user privacy (or only apply LSH and claim that it protects user privacy because LSH is a kind of noninvertible transformation). We point out, using a toy example, how the lack of rigorous guarantees can lead to privacy breach. Then we theoretically analyze the privacy property of our mechanism, and formally prove that it provides concentrated (Dwork and Rothblum, 2016) and probabilistic (Machanavajjhala et al., 2008) versions of extended DP.
Contributions. Our main contributions are as follows:

We propose a new mechanism providing extended DP with a wide range of metrics. Our algorithm is based on LSH and the randomized response, and can be applied to metrics including the angular distance (or cosine) metric, Jaccard metric, Earth Mover’s metric, and metric. To our knowledge, this work is the first to provide extended DP with such a wide range of metrics.

We show that LSH itself does not provide strong privacy guarantees and could result in complete privacy collapse in some situations. We then prove that our extended DP mechanism provides rigorous privacy guarantees: concentrated version and probabilistic version of extended DP.

We apply our mechanism with an angular distance metric to friend matching based on rating history in a dimensional space. Then we compare our mechanism with the multivariate Laplace mechanism (Fernandes et al., 2019). Note that although the multivariate Laplace mechanism is designed for the Euclidean distance, we can convert the Euclidean distance to the angular distance. We show through experiments that our mechanism provides much higher utility than the multivariate Laplace mechanism (Fernandes et al., 2019). Our experimental results also show that our mechanism makes possible friend matching with rigorous privacy guarantees and high utility.
The rest of this paper is organized as follows. Section 2 introduces notations and recalls background on locality sensitive hashing (LSH), privacy measures, and privacy protection mechanisms. Section 3 illustrates how the privacy of LSH alone can break. Section 4 defines our LSHbased privacy mechanisms (LSHPMs). Section 5 presents two types of privacy provided by the LSHPMs. Section 6 shows an experimental evaluation of the LSHPMs. Section 7 concludes. All proofs can be found in Appendix.
2. Preliminaries
In this section we introduce notations used in this paper and recall background on locality sensitive hashing (LSH), privacy measures, and privacy protection mechanisms.
2.1. Notations
Let , , , , , and be the sets of integers, nonnegative integers, positive integers, real numbers, nonnegative real numbers, and positive real numbers, respectively. Let be the set of nonnegative real numbers not greater than . We denote by the dimensional unit sphere (i.e., the set of all points satisfying ). Let be the base of natural logarithm, , and . Let be the Euclidean distance; i.e., for real vectors , .
We denote by the set of all binary data of length , i.e., . The Hamming distance between two binary data is defined by:
We denote the
set of all probability distributions
over a set by . Letbe the normal distribution with a mean
and a variance
. For two finite sets and , we denote by a randomized algorithm from to , and by (resp. by ) the probability that maps to (resp. to some element of ).2.2. Locality Sensitive Hashing (LSH)
We denote by the set of all possible input data. We introduce some (normalized) similarity function such that two inputs and have a larger similarity when they are closer, and that when . Then we define a dissimilarity function over by . When is symmetric and subadditive, then it is a metric. We will later instantiate and with specific metrics corresponding to hashing schemes.
Now we introduce the notion of locality sensitive hashing (LSH) as a family of onebit hash functions such that the probability of two inputs having different hash values is proportional to their dissimilarity .
Definition 2.1 (Locality sensitive hashing).
A locality sensitive hashing (LSH) scheme w.r.t. a dissimilarity function is a family of functions from to coupled with a probability distribution such that for any ,
(1) 
where a function is chosen from from according to the distribution .
Note that by the definition of , the probability of collision of two hash values is given by:
By using hash functions independently drawn from , an input can be embedded to a bit binary code as follows:
The function is called a bit LSH function. We denote by the randomized algorithm that chooses a bit LSH function according to the distribution and outputs the hash value of a given input .
2.3. Examples of LSHs
There are a variety of LSH families corresponding to useful metrics, such as the angular distance (Andoni et al., 2015; Charikar, 2002), Jaccard metric (Broder et al., 2000), Earth Mover’s metric (Charikar, 2002), and metric with (Datar et al., 2004). In this section we present an LSH family based on randomized hashing called randomprojectionbased hashing, which corresponds to the angular distance.
A randomprojectionbased hashing is a onebit hashing associated with a randomly chosen normal vector
that defines a hyperplane through the origin. Formally, we take the input domain as
, and define a randomprojectionbased hashing as a function such that:where is the inner product and is a real vector whose element is independently chosen from the standard normal distribution .
Then the randomprojectionbased hashing is an LSH w.r.t. a similarity measure defined below. The angular distance and the angular similarity are respectively defined as the functions and such that for any ,
(2)  
(3) 
Then two vectors and have different hash values if and only if the random hyperplane separates and . Hence . Therefore is an LSH w.r.t. the angular distance .
For example, iff , while iff . represents that the two vectors and are orthogonal, namely, .
2.4. Approximate Nearest Neighbor Search
Next we recall the nearest neighbor search problem and utility measures for their approximate algorithms.
Definition 2.2 (Nearest neighbor search).
Let be a metric over . Given a dataset , the nearest neighbor search for a data point is the problem of finding the closest point to w.r.t. the metric . A nearest neighbor search for an is the problem of finding the closest points to .
A naive and exact approach to nearest neighbor search is to perform pairwise comparison of data points, requiring operations (respectively operations for nearest neighbors). For a very large dataset , however, nearest neighbor search is computationally expensive, and approaches to improve computational efficiency shift the problem space inefficiency (Andoni et al., 2018). An alternative approach proposed by Indyk and Motwani (Indyk and Motwani, 1998) is to employ LSH to efficiently compute approximate nearest neighbor points.
To evaluate the utility of approximate nearest neighbors, we use the average utility loss defined as follows.
Definition 2.3 (Utility loss).
Assume that an approximate algorithm produces approximate nearest neighbors for a data point in terms of a metric . The average utility loss for w.r.t. the true nearest neighbors is given by:
That is, we compute the average distance of returned nearest neighbors from the data point compared with the average distance of true nearest neighbors. We prefer this measure to recall and precision measures as the LSH algorithm can return many neighbors with the same distance (i.e., in the same hash bucket) and our approximate algorithm determines the choice of neighbors at random (a reasonable choice given the usefulness of the output is determined by how similar the neighbors are to the original data point ).
2.5. Privacy Measures
In this section we recall notions of privacy measures: differential privacy (DP) (Dwork, 2006), extended DP with a metric (Chatzikokolakis et al., 2013), and concentrated DP (Dwork and Rothblum, 2016).
First, the notion of differential privacy of a randomized algorithm represents that cannot distinguish between given adjacent inputs and .
Definition 2.4 (Differential privacy).
A randomized algorithm provides differential privacy (DP) w.r.t. an adjacency relation if for any and any ,
where the probability is taken over the random choices in .
We then review a notion of extended differential privacy (Chatzikokolakis et al., 2013; Kawamoto and Murakami, 2019), which relaxes DP in the sense that when two inputs and are closer, the output distributions are less distinguishable. In this paper, we introduce a generalized definition using a function over inputs and an arbitrary function over rather than a metric as follows.
Definition 2.5 (Extended differential privacy).
Given two functions and , a randomized algorithm provides extended differential privacy (XDP) if for all and for any ,
where the probability is taken over the random choices in .
Hereafter we sometimes abuse notation and write when is a constant function returning the same real number independently of the inputs and .
Next, we review the notion of privacy loss and privacy loss distribution (Dwork and Rothblum, 2016).
Definition 2.6 (Privacy loss random variable).
Let be a randomized algorithm. The privacy loss on an output w.r.t. inputs is defined by:
where the probability is taken over the random choices in . Note that when and , then . When , then . Then the privacy loss random variable of over
is the realvalued random variable representing the privacy loss
where is sampled from .Given inputs and a privacy loss , let . Then the privacy loss distribution of over is defined as a probability distribution over such that:
Then meanconcentrated differential privacy (Dwork and Rothblum, 2016) is defined using the notion of subgaussian random variables as follows.
Definition 2.7 (Subgaussian).
For a , a random variable over is subgaussian if for all , . A random variable is subgaussian if there exists a such that is subgaussian.
Definition 2.8 (Meanconcentrated Dp).
Let . A randomized algorithm provides meanconcentrated differential privacy (CDP) w.r.t. an adjacency relation if for any , the privacy loss random variable of over satisfies that , and that is subgaussian.
Finally, we recall the notion of probabilistic differential privacy (Dwork et al., 2010; Sommer et al., 2019).
Definition 2.9 (Probabilistic Dp).
Let . A randomized algorithm provides probabilistic differential privacy (PDP) w.r.t. an adjacency relation if for any , the privacy loss random variable satisfies and .
2.6. Privacy Mechanisms
Finally, we recall two popular privacy protection mechanisms: the randomized response (Warner, 1965) and the Laplace mechanism (Dwork et al., 2006).
Definition 2.10 (randomized response).
For an , the randomized response (or RR for short) over the binary alphabet is the randomized algorithm that maps a bit to another with the following probability:
Then the RR provides DP.
Definition 2.11 (Laplace mechanism).
Given an , an input , and a metric be a metric over , the Laplace mechanism over is the randomized algorithm that maps an input to an output with probability where .
In Section 6, we use a multivariate Laplace mechanism over .
3. Privacy Properties of LSH
Several works in the literature make reference to the privacypreserving properties of LSH (Qi et al., 2017; Aghasaryan et al., 2013; Chow et al., 2012). The privacy guarantee attributed to LSH mechanisms hinges on its hash function, which ‘protects’ an individual’s private attributes by revealing only their hash bucket. We now apply a formal analysis to LSH and explain why LSH implementations do not provide strong privacy guarantees, and could, in some situations, result in complete privacy collapse for the individual.
3.1. Modeling LSH
We present a simple example to show how privacy can break down. Consider the set of secret inputs . Each element of could, for example, correspond to whether or not an individual has rated two movies and . Then we model an LSH as a probabilistic channel that maps a secret input to a binary observation.
For brevity we deal with a single randomprojectionbased hashing as described in Section 2.3. That is, we randomly choose a vector representing the normal to a hyperplane, and compute the inner product of with an input vector . The hash function outputs if the inner product is negative and otherwise. For example, if is chosen, then the hash function is defined as:
In fact, there are exactly 6 possible (deterministic) hash functions for any choice of the normal vector , corresponding to hyperplanes which separate different pairs of points:
Each of , , , and occurs with probability , while and each occur with probability . The resulting channel , computed as the probabilistic sum of these deterministic hash functions, turns out to leak no information on the secret input (i.e., all outputs have equal probability conditioned on each input).
This indicates that the LSH mechanism above is perfectly private. However, in practice LSH also requires the release of the choice of the normal vector (e.g. (Chow et al., 2012))^{1}^{1}1In fact, since the channel on its own leaks nothing, there must be further information released in order to learn anything useful from this channel.. In other words, the choice of hash function is leaked. Notice that in our example, the functions to correspond to deterministic mechanisms which leak exactly 1 bit of the secret, while and leak nothing. In other words, with probability, 1 bit of the 2bit secret is leaked. Not only that, but mechanisms and leak the secret exactly, and similarly, mechanisms and leak exactly. Thus, the release of the normal vector destroys the privacy guarantee.
3.2. The Guarantee of LSH
In general, for any number of hash functions and any length input, an LSH mechanism which releases its choice of hyperplanes also leaks its choice of deterministic mechanism. This means that it leaks the equivalence classes of the secrets. Such mechanisms belong to the ‘anonymity’style of privacy mechanisms which promise privacy by hiding secrets in equivalence classes of size at least . These have been shown to be unsafe due to their failure to compose well (Ganta et al., 2008; Fernandes et al., 2018). This failure leads to the potential for linkage or intersection attacks by an adversary armed with auxiliary information. For this reason, we consider compositionality an essential property for a privacypreserving system. LSH with hyperplane release does not provide such privacy guarantees.
4. LSHbased Privacy Mechanisms
In this section we introduce a new privacy protection mechanism called an LSHbased privacy mechanism (LSHPM). Roughly speaking, this mechanism is an extension of RAPPOR (Erlingsson et al., 2014) with respect to a locality sensitive hashing (LSH).
4.1. Bitwise Randomized Response
We first define the bitwise randomized response as the privacy mechanism that applies the randomized response to each bit of the input independently.
Definition 4.1 (bitwise randomized response).
Let , , , and be the randomized response. The bitwise randomized response (or BRR for short) is defined as the randomized algorithm that maps a bitstring to another with the following probability:
Then provides XDP w.r.t. the Hamming distance .
Proposition 1 (Xdp of BRR).
Let and . The bitwise randomized response provides XDP.
4.2. Construction of LSHPMs
Now we introduce an LSHbased privacy mechanism (LSHPM) as a randomized algorithm that (i) randomly chooses a bit LSH function , (ii) computes the bit hash code of a given input , and then (iii) applies the bitwise randomized response to .
Formally, we define the mechanism as follows. Recall that is the randomized algorithm that randomly chooses a bit LSH function according to a distribution and outputs the hash value of a given input . (See Section 2.2 for details.)
Definition 4.2 (LSHPM).
Let be the bitwise randomized response. The LSHbased privacy mechanism with a bit LSH function is the randomized algorithm defined by . Given a distribution of the bit LSH functions, the LSHbased privacy mechanism w.r.t. is the randomized algorithm defined by .
It should be noted that there are two kinds of randomness in the LSHPM: (a) the randomness in choosing a (deterministic) LSH function from (e.g., the random seed in the randomprojectionbased hashing ^{2}^{2}2More specifically, in the definition of , a tuple of seeds for the bit LSH function is randomly chosen.), and (b) the random noise added by the bitwise randomized response . We can assume that each user of this privacy mechanism selects the input independently of both kinds of randomness, since they wish to protect their own privacy when publishing .
In practical settings, the same LSH function is often employed to produce hash values of different inputs, namely, multiple hash values are dependent on an identical hash seed (e.g., multiple users share the same LSH function so that they can compare their hash values). Furthermore, the adversary may obtain the hash function itself (or the seed used to produce ), and might be able to learn a set of possible inputs that produce the same hash value without knowing the actual input . Therefore, the hash value may reveal partial information on the input (see Section 3), and the bitwise randomized response is crucial in guaranteeing privacy.
5. Privacy Guarantees for LSHPMs
In this section we show two types of privacy that the LSHPMs guarantee: (i) the exact privacy that we learn after both input vectors and hash seeds are selected, and (ii) the expected privacy that represents the probability distribution of possible levels of exact privacy. To define the latter, we introduce a variety of privacy notions with shared randomness, including new notions of extended/concentrated privacy with a metric.
5.1. Exact Privacy of LSHPMs
The exact degree of privacy provided by LSHPMs depends on the random choice of hash seeds . This means that some user’s input may be protected only weakly when an ‘unlucky’ seed is chosen to produce the hash value . For example, if many hyperplanes given by such unlucky seeds split a small specific area in the input space , then the hash value reveals much information on the input chosen from that small area.
Hence we can learn the exact degree of privacy guarantee only after obtaining the hash seeds and the input vectors. Formally, we show the exact privacy provided by LSHPMs as follows.
Proposition 2 (Exact privacy of ).
Let , be a bit LSH function, and be the pseudometric defined by for each . Then the LSHbased privacy mechanism provides XDP.
The above proposition implies that an unlucky user has a large privacy loss even though the LSHPM uses the randomized response, simply because LSH preserves the distance between inputs only probabilistically and approximately. By Proposition 2, the LSHPM provides DP in the worst case, i.e., when the hamming distance between vectors is maximum due to unlucky hash seeds and/or too large original distance between the inputs .
Proposition 3 (Worstcase privacy of ).
Let and be a bit LSH function. The LSHbased privacy mechanism provides DP.
This shows that achieves weaker privacy for a larger .
5.2. Privacy Notion with a Shared Randomness
Next we introduce expected privacy notions to characterize the range of possible levels of exact privacy. As shown later, the privacy loss follows a probability distribution due to the random choice of hash seeds. For example, the number of hyperplanes splitting a particular small area in the input space can vary depending on the random choice of hash seeds.
To formalize the LSHPM ’s expected privacy, we need to deal with the same hash seeds shared among multiple users. Thus we propose new privacy notions for privacy protection mechanisms that share randomness among them.
Hereafter we denote by a finite set of shared input, and by a randomized algorithm from to with a share input . Given a distribution over , we denote by the randomized algorithm that draws a shared input from and behaves as ; i.e., for and , . For brevity we sometimes abbreviate as .
We first introduce privacy loss variables with shared randomness as follows.
Definition 5.1 (Privacy loss random variable with shared randomness).
Given a randomized algorithm with a shared input , the privacy loss on an output w.r.t. inputs and is defined by:
where the probability is taken over the random choices in . Given a distribution over , the privacy loss random variable of over w.r.t. is the realvalued random variable representing the privacy loss where is sampled from and is sampled from . Here we call a shared randomness.
Then the the privacy loss distribution of over is defined analogously to Definition 2.6. Now we extend meanconcentrated DP with shared randomness as follows.
Definition 5.2 (Meanconcentrated Dp with shared randomness).
Let , , and . A randomized algorithm provides meanconcentrated differential privacy with shared randomness (rCDP) if for all , the privacy loss random variable of over w.r.t. satisfies that , and that is subgaussian.
5.3. Concentrated/Probabilistic Dp with a Metric
Next we introduce a new privacy notion, called meanconcentrated XDP (abbreviated as CXDP) , that relaxes CDP by incorporating a metric as follows.
Definition 5.3 (Meanconcentrated Xdp).
Let , , , and be a metric. A randomized algorithm provides meanconcentrated extended differential privacy (CXDP) if for all , the privacy loss random variable of over w.r.t. satisfies that , and that is subgaussian.
To explain the meaning of CXDP, we also introduce probabilistic XDP (abbreviated as PXDP) as a privacy notion that relaxes XDP and PDP as follows.
Definition 5.4 (Probabilistic Xdp).
Let , , and . A randomized algorithm provides probabilistic extended differential privacy (PXDP) if for all , the privacy loss random variable of over w.r.t. satisfies .
Again, we sometimes abuse notation and simply write when is constant for all .
Now we show that CXDP implies PXDP, and that PXDP implies XDP as follows.
Proposition 4 (Cxdp Pxdp).
Let , , , be a randomized algorithm, and be a metric over . Let and . We define by for all . If provides CXDP, then it provides PXDP.
Proposition 5 (Pxdp Xdp).
Let , be a randomized algorithm, , and . If provides PXDP, then it provides XDP.
5.4. Expected Privacy of LSHPMs
Next we show that the LSHPM provides CXDP and PXDP
. To derive these, we first prove that the Hamming distance between hash values follows a binomial distribution.
Proposition 6 (Distribution of the Hamming distance of hash values).
Let be an LSH scheme w.r.t. a metric over coupled with a distribution . Let be any two inputs, and be the random variable of the Hamming distance between their bit hash values, i.e., where a bit LSH function is drawn from the distribution . Then follows the binomial distribution with mean and variance .
Now we show that the LSHPM provides CXDP as follows.
Theorem 1 (Cxdp of the LSHPMs).
The LSHbased privacy mechanism provides CXDP.
This implies that the LSHPM provides PXDP, hence XDP.
Theorem 2 (Pxdp/Xdp of the LSHPMs).
Let and . We define by . The LSHbased mechanism provides PXDP, hence XDP.
Finally, we present a better bound for PXDP of the proposed mechanism. For , let .
Proposition 7 (Tighter bound for Pxdp/Xdp).
For an , we define and by:
The LSHbased mechanism provides PXDP, hence XDP.
Note that when we apply Proposition 7 in our experiments, we numerically compute from a constant using the convexity of .
6. Experimental Evaluation
In this section we present an experimental evaluation of our LSHbased privacy mechanism (LSHPM) for a problem in privacypreserving nearest neighbor search. Our goal is to evaluate the effectiveness of the LSHPM against a stateoftheart privacy mechanism w.r.t. utility, while promising a similar privacy guarantee.
6.1. Problem Statement
Our problem of interest is privacypreserving collaborative filtering. In this scenario, we are given a dataset of users in which each user is represented as a (realvalued) vector of attributes, i.e., . The data curator’s goal is to provide recommendations to each user based on their nearest neighbors w.r.t. an appropriate distance measure over users. When is large, LSH can be used to provide recommendations based on approximate nearest neighbors.
When attribute values are sensitive, LSH alone does not provide strong privacy guarantees, as shown in Section 3. In this case, to protect users’ sensitive attributes from an untrusted data curator, a differentially private mechanism can be applied to the data prior to sending it to the dataset. The noisy data can then be used for recommendations.
We compare two privacy mechanisms:

LSHPM Mechanism: For each user we apply bit LSH to the attribute vector according to a precomputed (shared) hash function and subsequently apply bitwise randomized response (Definition 4.1) to the computed hash. The noisy hash is used to compute nearest neighbors. We compute the privacy guarantee of this mechanism according to Proposition 7, which gives a XDP style guarantee that depends on .

Multivariate Laplace Mechanism: For each user we first add Laplace noise parametrized by (Definition 2.11) to the attribute vector before applying LSH in order to generate noisy hash values. Our implementation of the multivariate Laplace is private for Euclidean distance and follows that described in (Fernandes et al., 2019); namely, we generate additive noise by constructing a unit vector uniformly at random over
, scaled by a random value generated from the gamma distribution with shape
and scale . Note that this method carries a strong privacy guarantee, namely XDP since the multivariate Laplace mechanism is private.
6.2. Comparing Privacy and Utility
For our evaluation we compare the utility loss of each mechanism for the same privacy guarantee. To establish equivalent privacy guarantees, the overall value of each mechanism is used as the basis for comparison; for the Laplace mechanism this is and for LSHPM this is from Proposition 7. However, as the privacy guarantee of LSHPM depends on and that of the Laplace depends on , we cannot simply compare the guarantees directly; we also need to compare the privacy guarantee w.r.t. the distances between users. Hence we make use of the relationship between the Euclidean and cosine distances provided that attribute vectors are normalized prior to performing experiments. We convert the Euclidean distance to the angular distance using the following formula:
Since depends on and , we perform comparisons against various reasonable ranges of these variables to properly evaluate the range of utility loss. The LSHPM privacy guarantee also includes a value which is fixed to a constant value for all computations.
For utility we use the average distance of the nearest neighbors from the original input vector, as defined in Definition 2.3. We will present an appropriate distance measure for our particular use case in the following section.
6.3. Use Case and Experimental Setup
Our particular application is friendmatching
, in which users are recommended similar users based on common items of interest – in this case, similarly rated movies. For our experiments we use the MovieLens 100k dataset
^{3}^{3}3Downloaded from https://grouplens.org/datasets/movielens/. This contains 943 users with ratings across 1682 movies, with ratings ranging from to . We associate a movie ratings vector to each user, and for nearest neighbor comparison we use the angular distance between ratings vectors (from Eqn (2)), closely related to cosine distance which has been established as a good measure of similarity in previous work. That is, users who have similar movie ratings will be close in cosine (and angular) distance.We computed ratings vectors for each user, where rating corresponds to movie in the dataset; unrated movies are assumed to have a rating value of . We reduced the size of the ratings vectors by including only movies from the 4 most rated genres, and selecting subsets of those movies to produce 100dimensional and 500dimensional ratings vectors for each user. Ratings were normalized by subtracting the total average rating (across all users) from each rated movie. Unrated movies retained their original value of .
We next computed the nearest neighbors (w.r.t. the angular distance ) for each user for using standard nearest neighbor search (i.e., pairwise comparisons over all ratings vectors). We refer to these data points as the True Nearest Neighbors; these are used to evaluate the performance of standard LSH and, by extension, the Laplace and LSHPM mechanisms. The distribution of True Nearest Neighbor distances for is shown in Figure 1.
We note that for the smaller vector length, the nearest neighbors are distance ^{4}^{4}4Ignoring distances of which represent users who have exactly the same ratings vector. This typically occurs when users only rate a single movie. and the average distance to the nearest neighbor is around . For the larger vector length the nearest neighbor distance is just over with an average distance of around . These distances are important for computing the privacy guarantee, which depends on the true distance between users. We therefore use distances between and for our utility comparison.
6.4. Performance of LSH
We performed a baseline comparison of LSH against True Nearest Neighbors to establish the utility of vanilla LSH. bit LSH was implemented (for ) using the randomprojectionbased hashing described in Section 2.3, since this provides a guarantee w.r.t. the angular distance between vectors. A (uniformly random) dimensional normal vector can be constructed by selecting each component from a standard normal distribution; this is done for each of the hashes to generate the LSH hash function . The same function is used to encode each user’s ratings vector into a bit string.
For each user we then computed their nearest neighbors for using the Hamming distance on bitstrings. Where multiple neighbors shared the same distance (i.e., they were in the same hash bucket) we chose a nearest neighbor at random.
The results for utility loss for LSH against True Nearest Neighbors are shown in Figure 2. As expected, the utility loss decreases as the number of bits increases, since each bit can be seen as (probabilistically) encoding some information about the distance; utility loss is also smaller for the shorter vector length, since the dimensionality reduction from elements to bits is smaller for smaller .
6.5. Performance of Privacy Mechanisms
We now compare the performance of LSHPM and the Laplace mechanism against LSH and True Nearest Neighbors. We implemented both mechanisms as described in Section 6.1. For both LSHPM and Laplace, the same hash function was used as for vanilla LSH for the purposes of comparison. We chose values of ranging from through to for LSHPM. To compute the overall XDP guarantee as per Proposition 7, we fixed and varied from to to obtain corresponding values of ^{5}^{5}5 is the distance measure corresponding to in Section 5.4. These are shown in Table 1.
Length of bitstring ()  

10  20  30  50  
0.05  0.31111  0.2028  0.15866  0.11713  
0.1  0.3766  0.25209  0.19988  0.14979  
0.2  0.44181  0.30509  0.24546  0.18683  
0.25  0.45747  0.31977  0.25866  0.19796  
0.3  0.46544  0.32908  0.26749  0.20573  
0.4  0.46266  0.33474  0.27462  0.21313  
0.5  0.43792  0.32553  0.26969  0.21123 
We chose values of corresponding to actual distances observed for true nearest neighbors as described in Section 6.3.
We first plotted the utility loss of LSHPM against vanilla LSH (see Figure 3), since the performance of LSHPM is bounded by that of LSH. We note that the utility of LSHPM improves as increases and approaches the utility of LSH for large . This suggests that LSHPM could perform better for larger datasets for which larger values of would not incur a significant privacy loss for individuals. We also notice that the performance of LSHPM is better with smaller values of , corresponding to more similar users; we suggest that LSHPM would therefore be suitable for datasets in which the vector lengths are smaller (i.e., there are less items in the recommendation engine).
Finally we compared the utility loss of LSHPM against the multivariate Laplace mechanism. We observe that LSHPM consistently outperforms the Laplace mechanism, more notably for larger values of , for larger bit size and for larger values of .
In summary, our experiments show that LSHPM has better performance than the Laplace mechanism for a comparable privacy guarantee, and that the performance of LSHPM approaches the performance of vanilla LSH for larger values of and for larger bit lengths . This suggests that LSHPM could be used for very large datasets in which larger values of are tolerable.
7. Conclusion
In this paper we proposed an LSHbased privacy mechanism that provides extended DP with a wide range of metrics. We first showed that LSH itself does not provide strong privacy guarantees and could result in complete privacy collapse in some situations. We then proved that our LSHbased mechanism provides rigorous privacy guarantees: concentrated/probabilistic versions of extended DP. By experiments with large datasets, we demonstrated that our mechanism provides much higher utility than the multivariate Laplace mechanism, and that it can be applied to friend matching with rigorous privacy guarantees and high utility.
Acknowledgements.
This work was supported by Inria under the project LOGIS, by JSPS KAKENHI Grant Number JP19H04113, and by an Australian Government RTP Scholarship (2017278).References
 A collaborative filtering framework for friends recommendation in social networks based on interaction intensity and adaptive user similarity. Social Network Analysis and Mining 3, pp. 359–379. Cited by: §1.
 Recommender systems. Springer. Cited by: §1.
 On the use of lsh for privacy preserving personalization. In 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pp. 362–371. Cited by: §1, §3.
 Metricbased local differential privacy for statistical applications. CoRR abs/1805.01456. External Links: 1805.01456 Cited by: §1, §1.
 Practical and optimal lsh for angular distance. In Proc. NIPS, pp. 1–9. Cited by: §1, §2.3.
 Approximate nearest neighbor search in high dimensions. In Proc. ICM, pp. 3287–3318. Cited by: §2.4.
 Geoindistinguishability: differential privacy for locationbased systems. In Proc. CCS, pp. 901–914. External Links: Document, ISBN 9781450324779 Cited by: §1, §1.
 Optimal geoindistinguishable mechanisms for location privacy. In Proc. CCS, pp. 251–262. Cited by: §1, §1.
 Minwise independent permutations. Journal of Computer and System Sciences 60, pp. 630–659. Cited by: §1, §2.3.
 Similarity estimation techniques from rounding algorithms. In Proc. STOC, pp. 380–388. Cited by: §1, §2.3.
 Broadening the scope of Differential Privacy using metrics. In Proc. PETS, pp. 82–102. Cited by: §1, §2.5, §2.5.
 Efficient utility improvement for location privacy. PoPETs 2017 (4), pp. 308–328. Cited by: §1, §1.
 Preserving the privacy of social recommendation with a differentially private approach. In Proc. SmartCity, pp. 780–785. Cited by: §1.
 Improved lsh for privacyaware and robust recommender system with sparse data in edge environment. EURASIP Journal on Wireless Communications and Networking 171, pp. 1–11. Cited by: §1.
 An efficient privacypreserving friend recommendation scheme for social network. IEEE Access 6, pp. 56018–56028. Cited by: §1.
 A practical system for privacypreserving collaborative filtering. In 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 547–554. Cited by: §3.1, §3.
 Localitysensitive hashing scheme based on pstable distributions. In Proc. SCG, pp. 253–262. Cited by: §1, §2.3.
 Secure friend discovery in mobile social networks. In Proc. InfoCom, pp. 1647–1655. Cited by: §1.
 Calibrating noise to sensitivity in private data analysis. In Proc. TCC, pp. 265–284. Cited by: §1, §2.6.
 Boosting and differential privacy. In Proc. FOCS, pp. 51–60. Cited by: §2.5.
 Concentrated differential privacy. CoRR abs/1603.01887. External Links: 1603.01887 Cited by: §1, §2.5, §2.5, §2.5.
 Differential privacy. In Proc. ICALP, pp. 1–12. External Links: ISBN 3540359079 Cited by: §1, §2.5.
 RAPPOR: randomized aggregatable privacypreserving ordinal response. In Proc. CCS, pp. 1054–1067. Cited by: §4.
 Processing text for privacy: an information flow perspective. In International Symposium on Formal Methods, pp. 3–21. Cited by: §3.2.
 Generalised differential privacy for text document processing. In Proc. POST, pp. 123–148. External Links: Document Cited by: 3rd item, §1, item 2.
 Composition attacks and auxiliary information in data privacy. In Proc. KDD, pp. 265–273. Cited by: §3.2.
 Similarity search in high dimensions via hashing. In Proc. VLDB, pp. 518–529. Cited by: §1.
 Privacy preserving profile matching for social networks. In Proc. CBD, pp. 263–268. Cited by: §1.

Approximate nearest neighbors: towards removing the curse of dimensionality.
In
Proceedings of the thirtieth annual ACM symposium on Theory of computing
, pp. 604–613. Cited by: §1, §2.4.  Discrete distribution estimation under local privacy. In Proc. ICML, pp. 2436–2444. Cited by: §1.
 Not all attributes are created equal: private mechanisms for linear queries. Proceedings on Privacy Enhancing Technologies (PoPETs) 2020 (1), pp. 103–125. Cited by: §1.
 On the anonymization of differentially private location obfuscation. In Proc. ISITA, pp. 159–163. Cited by: §1.
 Local obfuscation mechanisms for hiding probability distributions. In Proc. ESORICS, pp. 128–148. Cited by: §2.5.
 Efficient privacypreserving content recommendation for online social communities. Neurocomputing 219, pp. 440–454. Cited by: §1.
 SPFM: scalable and privacypreserving friend matching in mobile clouds. IEEE Internet of Things Journal 4 (2), pp. 583–591. Cited by: §1.
 LinkMirage: enabling privacypreserving analytics on social relationships.. In Proc. NDSS, Cited by: §1.
 ARMOR: a trustbased privacypreserving framework for decentralized friend recommendation in online social networks. Future Generation Computer Systems 79, pp. 82–94. Cited by: §1.
 Privacy: theory meets practice on the map. In Proc. ICDE, pp. 277–286. Cited by: §1.
 Location privacy via private proximity testing.. In Proc. NDSS, Vol. 11. Cited by: §1.
 Back to the drawing board: revisiting the design of optimal location privacypreserving mechanisms. In Proc. CCS, pp. 1959–1972. Cited by: §1, §1.
 A distributed localitysensitive hashingbased approach for cloud service recommendation from multisource data. IEEE Journal on Selected Areas in Communications 35 (11), pp. 2616–2624. Cited by: §1, §3.
 Privacypreserving and efficient friend recommendation in online social networks.. Trans. Data Privacy 8 (2), pp. 141–171. Cited by: §1.
 Privacy games: optimal usercentric data obfuscation. Proceedings on Privacy Enhancing Technologies (PoPETs) 2015 (2), pp. 299–315. Cited by: §1, §1.

Privacy loss classes: the central limit theorem in differential privacy
. PoPETs 2019 (2), pp. 245–269. External Links: Document Cited by: §2.5.  Learning to hash for indexing big data – a survey. Proceedings of the IEEE 104 (1), pp. 34–57. Cited by: §1.
 Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60 (309), pp. 63–69. Cited by: §1, §2.6.
 Revisiting user mobility and social relationships in LBSNs: a hypergraph embedding approach. In Proc.WWW, pp. 2147–2157. Cited by: §1.
 Fairnessaware and privacypreserving friend matching protocol in mobile social networks. IEEE Transactions on Emerging Topics in Computing 1 (1), pp. 192–200. Cited by: §1.
Appendix A Details on the Technical Results
We first recall Chernoff bound, which is used in the proof for Proposition 4.
Lemma 1 (Chernoff bound).
Let be a realvalued random variable. Then for all ,
Comments
There are no comments yet.