Differential privacy dwork2006calibrating is a formal notion of privacy that allows a trustworthy data curator, in possession of sensitive data from a large number of individuals, to approximately answer a query submitted by an analyst while maintaining individual privacy. Intuitively, differential privacy ensures that the data analyst knows no more about any individual in the dataset after the analysis, than she knew before the start of the analysis. One common mechanism for achieving differential privacy is to inject random noise to the answer of a query that is carefully calibrated according to the sensitivity of the query and a global privacy budget . Sensitivity, in this case, is defined as the maximum amount of change in answer to a query considering all neighboring datasets, i.e., datasets differing in one row or equivalently having Hamming distance of one. One limitation of this definition is that it provides the same level of protection for all elements in the data universe .
In specific domains, it might be more natural to measure the distinguishability between two datasets by some generic metric instead of just Hamming distance. For instance, consider the location-based systems where it might be acceptable to disclose coarse-grained information about an individual’s location instead of his exact location. In this case, the geographical distance would be an appropriate measure of distinguishability andres2013geo . There are other scenarios where some attributes of the dataset may need more protection than others, and vice versa. As an example, consider a classification problem with instance space where specific features of are highly sensitive than others (maybe due to fairness requirements dwork2012fairness ). In this case, might be a reasonable choice for the metric.
In the applications mentioned above, the standard differential privacy (with global privacy budget) is too strong and compromises much in the utility. To address this limitation, several relaxations of differential privacy have been proposed recently chatzikokolakis2013broadening ; he2014blowfish . In this work, we consider -privacy, an instantiation of the privacy notion introduced in chatzikokolakis2013broadening , for statistical databases. Intuitively, -privacy allows specifying a separate privacy budget for each pair of elements in the data universe , given by the value . In Section 2.2, we formally define the -privacy and discuss its properties.
Our primary objective is to develop -private mechanisms that provide a good trade-off (w.r.t. given utility measure) between privacy and utility. Note that chatzikokolakis2013broadening have only constructed universally optimal mechanisms ghosh2012universally under specific metrics (such as Manhattan metric) for some particular class of queries such as count, sum, average, and percentage queries. In particular, we want to tailor any existing differential private mechanisms to satisfy -privacy. For this, we propose a utility measure dependent pre-processing strategy which applies to any data universe and any choice of the metric .
1.1 Main Contributions
We describe a meta procedure (for any metric) to tailor any existing differentially private mechanism into a -private variant for the case of linear queries. The main challenge is that the privacy budget, i.e., , is specified on the input universe , whereas the noise is added to the query response which belongs to the outcome space. Thus we need to somehow propagate the information contained in the metric to the output space.
The main component of our new mechanisms is a pre-processing optimization (depending on the utility measure of interest) to choose the model parameters of the mechanism. We have provided the explicit formulation of this pre-processing optimization problem for some commonly used utility measures (under any metric). In general, these problems are non-convex and computationally challenging. But we show that for certain loss functions the optimization problem can be approximately solved using heuristic approaches (e.g., Algorithm 3 for squared loss).
Based on the meta procedure, we describe -private variants of several well-known -differentially private algorithms (both online and offline). In particular, we illustrate the -private variants of the Laplace dwork2006calibrating and Exponential mcsherry2007mechanism mechanisms, as well as the SmallDB blum2013learning and MWEM hardt2012simple mechanisms for generating synthetic datasets.
We demonstrate the effectiveness of -privacy in terms of utility, by evaluating the proposed -private Laplace mechanism on both synthetic and real datasets using a set of randomly generated linear queries. In both cases we define the metric as the Euclidean distance between elements in the data universe. Our results show that the utility from the -private variant of the Laplace mechanism is higher than its vanilla counterpart, with some specific queries showing significant improvement. We extend our techniques to Blowfish (Distance Threshold model) he2014blowfish privacy notion as well, and show similar improvement.
2 Background and -Privacy
This section gives the background on differential privacy and associated concepts of linear queries, sensitivity, and utility. We also introduce -privacy and its relation to other privacy notions.
Let for , and . We write if is true and otherwise. Let denote the
th coordinate of the vector, and denote the th row of the matrix . We denote the inner product of two vectors by . The -element vector of all ones is denoted . For two vectors , the operation represents element-wise multiplication. For and , the operation represents row-wise scalar multiplication of by the associated entry of . For a vector represents that the vector is element-wise non-negative. Hamming distance is defined as . The -norms are denoted by . For a matrix , define .
2.1 Differential Privacy
Let denote the data universe and its size. A database of rows is modelled as a histogram (with ), where encodes the number of occurrences of the th element of the universe . Two neighboring databases and (from ) that differ in a single row () correspond to two histograms and (from ) satisfying .
A mechanism (where is the outcome space, and is the query class) is a randomized algorithm which takes a dataset and a query , and answers with some . Informally, a mechanism satisfies differential privacy if the densities of the output distributions on inputs with are pointwise within an multiplicative factor of each other. Here is a parameter that measures the strength of the privacy guarantee (smaller being a stronger guarantee).
Definition 1 (Differential Privacy, dwork2006calibrating ).
A mechanism is called -differentially private if for all such that , for every , and for every measurable , we have
Here we consider a relaxed privacy notion, which is a particular case of the definition from chatzikokolakis2013broadening , for statistical databases. Given a metric on the data universe, a mechanism satisfies -privacy if the densities of the output distributions on input histograms with and differ on -th entries are pointwise within an multiplicative factor of each other.
Definition 2 (-Privacy).
Let be the privacy budget (such that , , , and , ) of the data universe . A mechanism is said to be -private iff s.t. , , and (for some ), and we have
When , we recover the standard -differential privacy.
Most of the desirable properties of differential privacy is carried over to -privacy as well, with suitable generalization.
Fact 1 (Properties of -Privacy).
The -privacy satisfies the following desirable properties:
Resistant to post-processing: If is -private, and is arbitrary any (randomized) function, then the composition is also -private.
Composability: Let be a -private algorithm for . If is defined to be , then is -private.
Group privacy: If is a -private mechanism and satisfy (with ), then and we have
where is the set of indices in which and differ.
-privacy can naturally express the indistinguishability requirement that cannot be represented by the standard notion of distance such as Hamming distance. But the metric in the above definition must be carefully defined to achieve different privacy goals. andres2013geo have used the Euclidean metric with the discrete Cartesian plane as the data universe, for location-based systems. Blowfish he2014blowfish privacy (without constraints) considers a natural metric based on a minimum spanning tree with the elements of the data universe as vertices, and with equal edge weights. Here the adversary may better distinguish the points farther apart in the tree, than those that are closer. But when some elements of the data universe are highly sensitive than others, non-uniform edge weights can capture this priority requirement.
Our main contribution is the construction of utility measure dependent -private mechanisms (confer Section 3). Note that chatzikokolakis2013broadening have only constructed universally optimal mechanisms under certain metrics (such as Manhattan metric) for some special class of queries such as count, sum, average, and percentage queries.
2.3 Linear Queries and Sensitivity
Our focus is on the inherent trade-off between privacy and accuracy when answering a large number of linear queries over histograms. Linear queries include some natural classes of queries such as range queries li2011efficient ; li2012adaptivebarak2007privacy ; fienberg2010differential , and serves as the basis of a wide range of data analysis and learning algorithms (see blum2005practical for some examples). A linear query is specified by a function mapping dataset (histogram) to a real value. Formally, given a query vector , a linear query over the histogram can be expressed as . A set of linear queries can be represented by a query matrix with the vector giving the correct answers to the queries.
For -privacy, we consider a generalized notion of global sensitivity (defined in dwork2006calibrating ):
For (with ), the generalized global sensitivity of a query (w.r.t. ) is defined as follows
Also define (the usual global sensitivity). When , we simply write .
Consider a multi-linear query defined as , where . Then the generalized global sensitivity of (for ) is given by . When , i.e., for a single linear query , we have . Thus, the generalized notion is defined separately for each pair of elements in .
2.4 Laplace and Exponential Mechanism
The Laplace mechanism is defined as follows:
Definition 4 (Laplace Mechanism, dwork2006calibrating ).
For a query function with -sensitivity , Laplace mechanism will output
where , and is a distribution with probability density function
is a distribution with probability density function.
The Laplace mechanism satisfies the -differential privacy, but it satisfies -privacy only with . This would result in large noise addition, and eventually unnecessary compromise on overall utility.
Given some arbitrary range , the exponential mechanism is defined with respect to some utility function , which maps database/output pairs to utility scores. The sensitivity notion that we are interested here is given by:
For (with ) and , the generalized utility sensitivity is defined as follows
Also define .
Formally, the exponential mechanism is:
Definition 6 (The Exponential Mechanism, mcsherry2007mechanism ).
The exponential mechanism selects and outputs an element with probability proportional to .
The exponential mechanism also satisfies the -differential privacy, but it satisfies -privacy only with .
In the differential privacy literature, the performance of a mechanism is usually measured in terms of its worst-case total expected error, defined as follows:
Definition 7 (Error).
Let and . We define the -error of a mechanism as
Here the expectation is taken over the internal coin tosses of the mechanism itself.
In this paper, we are mainly interested in the worst case expected -error (defined by ) for , and -error (given by ). It is also common to analyze high probability bounds on the accuracy of the privacy mechanisms.
Definition 8 (Accuracy).
Given a mechanism , query , sensitive dataset (histogram) , and parameters and , the mechanism is -accurate for on under the -norm if
where -norm can be any vector norm definition. In our analysis, we consider the -norm and the -norm.
3 -Private Mechanisms for Linear Queries
In this section, we design -private mechanisms by extending some of the well known -differentially private (noise adding) mechanisms. The main challenge here is that the privacy budget is defined on the input data universe () and the noise is added to the query response which belongs to the outcome space (). The query response contains only the aggregate information (summary statistic) about the elements of the input database , and it does not capture the -metric structure in the input domain . Thus we need to somehow propagate the information from to .
Given a dataset , and a query , our approach (meta procedure) to design a -private (noise adding) mechanism can be described as follows:
Choose the (approximately optimal) model parameters and such that , , and .
Then use an existing -differentially private mechanism with in place of .
The model parameters and are chosen by (approximately) solving the following pre-processing optimization problem (i.e. ):
where is a surrogate function of the utility measure that we are interested in. Note that this pre-processing optimization depends only on the data universe (or ), the query set , and the database size , but not on the dataset . Thus we don’t compromise any privacy during the optimization procedure. More over, we have to do the pre-processing optimization only once in an offline manner (for given , , and ). The number of constraints in the optimization problem (3) can be exponentially large (), but depending on the structure of the -metric the constraint count can be significantly reduced.
Next we apply the above described abstract meta procedure in extending some -differential privacy mechanisms under different loss measures such as squared loss and absolute loss. We first show that the resulting mechanisms are in fact -private, and then we formulate the appropriate pre-processing optimization problems (3) for them.
3.1 -Private Laplace Mechanism
For a given query over the histogram , consider the following variant of Laplace mechanism (with the model parameters , and which depend on the utility function of the task):
where . When , we choose and as the model parameters i.e., . Below we show that the above variant of Laplace mechanism satisfies the -privacy under a sensitivity bound condition.
If , , then the mechanism given by (4) satisfies the -privacy.
The sensitivity bound condition of the above proposition for a multi-linear query can be written as: . The next proposition characterizes the performance of the mechanism under different choices of utility measures:
Let be a multi-linear query of the form , and let with .
When , we have
where is defined in (2).
When , we have
Note that .
, with probability at least we have
Based on the upper bounds that we obtained in the previous proposition, we can formulate the pre-processing optimization problem to select the model parameters and of the mechanism as follows:
The objective function of the above optimization problem depends on the utility function that we are interested in. For example, when , we can choose . In summary, the -private Laplace mechanism, under -error function, can be described as follows:
Observe that, when , the choices and satisfies the constraints of the optimization problem (5) under squared loss. In fact these choices correspond to the standard Laplace mechanism , and thus our framework is able to recover standard -differential privacy mechanisms as well.
3.2 -Private Exponential Mechanism
For a given utility function over the histogram , consider the following variant of exponential mechanism (with the model parameters , and which will be chosen later based on the utility function):
The mechanism selects and outputs an element with probability proportional to .
Here we note that for ease of presentation, we do not consider using for each . The following theorem provides a sufficient condition for the above mechanism to satisfy -privacy.
If , then the mechanism satisfies -privacy.
For a given histogram and a given utility measure , let denote the maximum utility score of any element with respect to histogram . Below we generalize the Theorem 3.11 from dwork2014algorithmic :
Fixing a database , let denote the set of elements in which attain utility score . Also define . Then for , with probability at least , we have
Since we always have , we get
The exponential mechanism is a natural building block for designing complex -differentially private mechanisms. Next we consider two instantiations of the above variant of exponential mechanism, namely small database mechanism blum2013learning , and multiplicative weights exponential mechanism hardt2012simple . The main trick is that we need to choose appropriate and the utility function .
3.2.1 -Private Small Database Mechanism
Here we consider the problem of answering a large number of real valued linear queries of the form (where , and ) from class via synthetic histogram/database release. For this problem blum2013learning have proposed and studied a simple -differentially private small database mechanism, which is an instantiation of exponential mechanism. They have used a utility function (with ) defined as .
Now we extend the mechanism developed in blum2013learning to obtain a -private version of it using the model parameters and (which are determined later). Algorithm 1 is a modified version of Algorithm 4 from dwork2014algorithmic , where the transformation from to is one-to-one (thus we have ). When answering a query over , we need to output where is the matching element of and is the output of the -private small database mechanism (Algorithm 1). Following proposition provides the -privacy characterization of the small database mechanism.
If , then the small database mechanism is -private.
The following proposition and theorem characterize the performance of the -private small database mechanism.
Proposition 4 (Proposition 4.4, dwork2014algorithmic ).
Let be any class of linear queries. Let be the database output by . Then with probability :
Theorem 3 (Theorem 4.5, dwork2014algorithmic ).
By the appropriate choice of , letting be the database output by , we can ensure that with probability :
From the upper bound of the above theorem, the model parameters and of the small database mechanism can be chosen through the following pre-processing optimization problem:
3.2.2 -Private Multiplicative Weights Exponential Mechanism
times the uniform distribution over.
Exponential Mechanism: Sample a query using the mechanism and the score function given by
Laplace Mechanism: Let measurement with .
Multiplicative Weights: Let be times the distribution whose entries satisfy ,
As in the case of small database mechanism, here also we consider the problem of answering a large number of real valued linear queries in -private manner. Algorithm 2 is a simple modification of Algorithm 1 from hardt2012simple . Following theorem provides the -privacy characterization of the MWEM mechanism.
If , then the MWEM mechanism is -private.
The following theorem characterizes the performance of the MWEM mechanism.
Theorem 5 (Theorem 2.2, hardt2012simple ).
For any dataset , set of linear queries , , and , with probability at least , MWEM produces such that
By setting , we get
The model parameters and of the MWEM mechanism can be chosen through the optimization problem 8 with .
Here we demonstrate the effectiveness of our framework via experiments on both synthetic and real data. We will show that in many situations, we can drastically improve the accuracy of the noisy answer compared to the traditional differentially private mechanisms.
4.1 Single Linear Queries over Synthetic Data
In order to evaluate our mechanism, we first consider randomly generated single linear queries (). To that end, we compare the following two mechanisms: (a) the -differentially private Laplace mechanism (with ): , where , and (b) the -private Laplace mechanism (with the model parameters and ): , where , under the experimental setup given below.
Data and Privacy Matrix: We generate a random dataset (histogram) with records from a data universe of size . We then randomly sample distinct two-dimensional points from a subset , and associate each point with an element () of the data universe. The sampled data universe elements are shown in Figure 0(a). We define the privacy matrix based on the Euclidean distance (metric) on -dimensional space. Specifically, for any , define .
Random Queries: We evaluate the two mechanisms over random single linear queries, where the query coefficients are randomly drawn from a uniform distribution over the real interval .
Performance Measure: We measure the performance of the mechanisms by the root mean squared error (RMSE; between the private response and the actual output) on the above generated data, i.e., we consider the squared loss function . Then the model parameters