Privacy-at-risk
The repository provides basic formulae and a sample code to instantiate "Privacy at risk" for the Laplace mechanism.
view repo
The calibration of noise for a privacy-preserving mechanism depends on the sensitivity of the query and the prescribed privacy level. A data steward must make the non-trivial choice of a privacy level that balances the requirements of users and the monetary constraints of the business entity. We analyse roles of the sources of randomness, namely the explicit randomness induced by the noise distribution and the implicit randomness induced by the data-generation distribution, that are involved in the design of a privacy-preserving mechanism. The finer analysis enables us to provide stronger privacy guarantees with quantifiable risks. Thus, we propose privacy at risk that is a probabilistic calibration of privacy-preserving mechanisms. We provide a composition theorem that leverages privacy at risk. We instantiate the probabilistic calibration for the Laplace mechanism by providing analytical results. We also propose a cost model that bridges the gap between the privacy level and the compensation budget estimated by a GDPR compliant business entity. The convexity of the proposed cost model leads to a unique fine-tuning of privacy level that minimises the compensation budget. We show its effectiveness by illustrating a realistic scenario that avoids overestimation of the compensation budget by using privacy at risk for the Laplace mechanism. We quantitatively show that composition using the cost optimal privacy at risk provides stronger privacy guarantee than the classical advanced composition.
READ FULL TEXT VIEW PDFThe repository provides basic formulae and a sample code to instantiate "Privacy at risk" for the Laplace mechanism.
Dwork et al. Dwork et al. (2014) quantify the privacy level in -differential privacy as an upper bound on the worst-case privacy loss incurred by a privacy-preserving mechanism. Generally, a privacy-preserving mechanism perturbs the results by adding the calibrated amount of random noise to them. The calibration of noise depends on the sensitivity of the query and the specified privacy level. In a real-world setting, a data steward must specify a privacy level that balances the requirements of the users and monetary constraints of the business entity. Garfinkel et al. Garfinkel et al. (2018) report the issues in deploying differential privacy as the privacy definition by the US census bureau. They highlight the lack of analytical methods to choose the privacy level. They also report empirical studies that show the loss in utility due to the application of privacy-preserving mechanisms.
We address the dilemma of a data steward in two ways. Firstly, we propose a probabilistic quantification of privacy levels. Probabilistic quantification of privacy levels provides a data steward a way to take quantified risks under the desired utility of the data. We refer to the probabilistic quantification as privacy at risk. We also derive a composition theorem that leverages privacy at risk. Secondly, we propose a cost model that links the privacy level to a monetary budget. This cost model helps the data steward to choose the privacy level constrained on the estimated budget and vice versa. Convexity of the proposed cost model ensures the existence of a unique privacy at risk that would minimise the budget. We show that the composition with an optimal privacy at risk provides stronger privacy guarantees than the traditional advanced composition Dwork et al. (2014). In the end, we illustrate a realistic scenario that exemplifies how the data steward can avoid overestimation of the budget by using the proposed cost model by using privacy at risk.
The probabilistic quantification of privacy levels depends on two sources of randomness: the explicit randomness induced by the noise distribution and the implicit randomness induced by the data-generation distribution. Often, these two sources are coupled with each other. We require analytical forms of both sources of randomness as well as an analytical representation of the query to derive a privacy guarantee. Computing the probabilistic quantification is generally a challenging task. Although we find multiple probabilistic privacy definitions in the literature Machanavajjhala et al. (2008); Hall et al. (2012), we are missing analytical quantification bridging the randomness and privacy level of a privacy-preserving mechanism. To the best of our knowledge, we are the first to analytically derive such a probabilistic quantification, namely privacy at risk, for the widely used Laplace mechanism Dwork et al. (2006b). We also derive a composition theorem with privacy at risk. It is a special case of the advanced composition theorem Dwork et al. (2014) that deals with a sequential and adaptive use of privacy-preserving mechanisms. We work on a simpler model independent evaluations used in the basic composition theorem Dwork et al. (2014).
The privacy level proposed by the differential privacy framework is too abstract a quantity to be integrated in a business setting. We propose a cost model that maps the privacy level to a monetary budget. The corresponding cost model for the probabilistic quantification of privacy levels is a convex function of the privacy level. Hence, it leads to a unique probabilistic privacy level that minimises the cost. We illustrate a realistic scenario in a GDPR compliant business entity that needs an estimation of the compensation budget that it needs to pay to stakeholders in the unfortunate event of a personal data breach. The illustration shows that the use of probabilistic privacy levels avoids overestimation of the compensation budget without sacrificing utility.
In this work, we comparatively evaluate the privacy guarantees using privacy at risk of using Laplace mechanism. We quantitatively compare the composition under the optimal privacy at risk, which is estimated using the cost model, with traditional composition mechanisms - the basic composition and advanced mechanism Dwork et al. (2014). We observe that it gives stronger privacy guarantees than the ones by the advanced composition without sacrificing on the utility of the mechanism.
In conclusion, benefits of the probabilistic quantification i.e. the privacy at risk are twofold. It not only quantifies the privacy level for a given privacy-preserving mechanism but also facilitates decision-making in problems that focus on the privacy-utility trade-off and the compensation budget minimisation.
We consider a universe of datasets . We explicitly mention when we consider that the datasets are sampled from a data-generation distribution with support . Two datasets of equal cardinality and are said to be neighbouring datasets if they differ in one data point. A pair of neighbouring datasets is denoted by . In this work, we focus on a specific class of queries called numeric queries. A numeric query
is a function that maps a dataset into a real-valued vector, i.e.
. For instance, a sum query returns the sum of the values in a dataset.In order to achieve a privacy guarantee, a privacy-preserving mechanism, or mechanism in short, is a randomised algorithm, that adds noise to the query from a given family of distributions. Thus, a privacy-preserving mechanism of a given family, , for the query and the set of parameters of the given noise distribution, is a function that maps a dataset into a real vector, i.e. . We denote a privacy-preserving mechanism as , when the query and the parameters are clear from the context.
[Differential Privacy Dwork et al. (2014).] A privacy-preserving mechanism , equipped with a query and with parameters , is -differentially private if for all and such that :
-differentially private mechanism is ubiquitously called as -differentially private.
A privacy-preserving mechanism provides perfect privacy if it yields indistinguishable outputs for all neighbouring input datasets. The privacy level quantifies the privacy guarantee provided by -differential privacy. For a given query, the smaller the value of the , the qualitatively higher is the privacy. A randomised algorithm that is -differentially private is also -differential private for any .
In order to satisfy -differential privacy, the parameters of a privacy-preserving mechanism requires a calculated calibration. The amount of noise required to achieve a specified privacy level depends on the query. If the output of the query does not change drastically for two neighbouring datasets, then small amount of noise is required to achieve a given privacy level. The measure of such fluctuations is called the sensitivity of the query. The parameters of a privacy-preserving mechanism are calibrated using the sensitivity of the query that quantifies the smoothness of a numeric query.
[Sensitivity.] The sensitivity of a query is defined as
The Laplace mechanism is a privacy-preserving mechanism that adds scaled noise sampled from a calibrated Laplace distribution to the numeric query. [Papoulis and Pillai (2002)] The Laplace distribution with mean zero and scale
is a probability distribution with probability density function
where . We write
to denote a random variable
[Laplace Mechanism Dwork et al. (2006b).] Given any function and any , the Laplace Mechanism is defined as
where is drawn from and added to the component of .
Dwork et al. (2006b) The Laplace mechanism, , is -differentially private.
The parameters of a privacy-preserving mechanism are calibrated using the privacy level and the sensitivity of the query. A data steward needs to choose appropriate privacy level for practical implementation. Lee et al. Lee and Clifton (2011) show that the choice of an actual privacy level by a data steward in regard to her business requirements is a non-trivial task. Recall that the privacy level in the definition of differential privacy corresponds to the worst case privacy loss. Business users are however used to taking and managing risks, if the risks can be quantified. For instance, Jorion Jorion (2000) defines Value at Risk that is used by risk analysts to quantify the loss in investments for a given portfolio and an acceptable confidence bound. Motivated by the formulation of Value at Risk, we propose to use the use of probabilistic privacy level. It provides us a finer tuning of an -differentially private privacy-preserving mechanism for a specified risk .
[Privacy at Risk.] For a given data generating distribution , a privacy-preserving mechanism , equipped with a query and with parameters , satisfies -differential privacy with a privacy at risk , if for all and sampled from such that :
(1) |
where the outer probability is calculated with respect to the probability space obtained by applying the privacy-preserving mechanism on the data-generation distribution .
If a privacy-preserving mechanism is -differentially private for a given query and parameters , for any privacy level , privacy at risk is . Our interest is to quantify the risk with which -differentially private privacy-preserving mechanism also satisfies a stronger -differential privacy, i.e. .
Unifying Probabilistic and Random Differential Privacy. Interestingly, Equation 1 unifies the notions of probabilistic differential privacy and random differential privacy by accounting for both sources of randomness in a privacy-preserving mechanism. Machanavajjhala et al. Machanavajjhala et al. (2008) define probabilistic differential privacy that incorporates the explicit randomness of the noise distribution of the privacy-preserving mechanism whereas Hall et al. Hall et al. (2012) define random differential privacy that incorporates the implicit randomness of the data-generation distribution. In probabilistic differential privacy, the outer probability is computed over the sample space of and all datasets are equally probable.
Application of -differential privacy to many real-world problem suffers from the degradation of privacy guarantee, i.e. privacy level, over the composition. The basic composition theorem Dwork et al. (2014) dictates that the privacy guarantee degrades linear in the number of evaluations of the mechanism. Advanced composition theorem Dwork et al. (2014) provides a finer analysis of the privacy loss over multiple evaluations and provides a square root dependence on the the number of evaluations. In this section, we provide the composition theorem for privacy at risk.
[Privacy loss random variable.] For a privacy-preserving mechanism and two neighbouring datasets , the privacy loss random variable takes a value
If a privacy-preserving mechanism satisfies differential privacy, then
For all , the class of -differentially private mechanisms, which satisfy -privacy at risk, are -differential privacy under -fold composition where
where, Let, denote the -fold composition of privacy-preserving mechanisms . Each -differentially private also satisfies -privacy at risk for some and appropriately computer . Consider any two neighbouring datasets . Let,
Using the technique in (Dwork et al., 2014, Theorem 3.20), it suffices to show that .
Consider,
(2) |
where in the last line denotes privacy loss random variable related .
Consider, an -differentially private mechanism and -differentially private mechanism . Let satisfy -privacy at risk for and appropriately computed . Each can be simulated as the mechanism with probability and the mechanism otherwise. Therefore, privacy loss random variable for each mechanism can be written as
where, denotes the privacy loss random variable associated with the mechanism and denotes the privacy loss random variable associated with the mechanism . Using (Dwork et al., 2014, Lemma ), we can bound the mean of every privacy loss random variable as,
We have a collection of independent privacy random variables s such that . Using Hoeffding’s bound Hoeffding (1994) on the sample mean for any ,
Rearranging the inequality by renaming the upper bound on the probability as , we get,
Theorem 3.1 is an analogue, in the privacy at risk setting, of the advanced composition of differential privacy (Dwork et al., 2014, Theorem 3.20) under a constraint of independent evaluations. Note that, if one takes , then we obtain the exact same formula as in (Dwork et al., 2014, Theorem 3.20). It provides a sanity check for the consistency of composition using privacy at risk.
In fact, if we consider both sources of randomness, the expected value of loss function must be computed by using the law of total expectation.
Therefore, the exact computation of privacy guarantees after the composition requires access to the data-generation distribution. We assume a uniform data-generation distribution while proving Theorem 3.1. We can obtain better and finer privacy guarantees accounting for data-generation distribution, which we keep as a future work.
In this section, we instantiate privacy at risk for the Laplace mechanism for three cases: two cases involving two sources of randomness and third case involving the coupled effect. Three different cases correspond to three different interpretations of the confidence level, represented by the parameter , corresponding to three interpretation of the support of the outer probability in Definition 1. In order to highlight this nuance, we denote the confidence levels corresponding to the three cases and their three sources of randomness as , and , respectively.
In this section, we study the effect of the explicit randomness induced by the noise sampled from Laplacian distribution. We provide a probabilistic quantification for fine tuning for the Laplace mechanism. We fine-tune the privacy level for a specified risk under by assuming that the sensitivity of the query is known a priori.
For a Laplace mechanism calibrated with sensitivity and privacy level , we present the analytical formula relating privacy level and the risk in Theorem 4.1. The proof is available in Appendix A.
The risk with which a Laplace Mechanism , for a numeric query satisfies a privacy level is given by
(3) |
where is a random variable that follows a distribution with the following density function.
where is the Bessel function of second kind.
Figure 1 shows the plot of the privacy level against risk for different values of and for a Laplace mechanism . As the value of increases, the amount of noise added in the output of numeric query increases. Therefore, for a specified privacy level, the privacy at risk level increases with the value of .
The analytical formula representing as a function of is bijective. We need to invert it to obtain the privacy level for a privacy at risk . However the analytical closed form for such an inverse function is not explicit. We use a numerical approach to compute privacy level for a given privacy at risk from the analytical formula of Theorem 4.1.
Result for a Real-valued Query. For the case , the analytical derivation is fairly straightforward. In this case, we obtain an invertible closed-form of a privacy level for a specified risk. It is presented in Equation 4.
(4) |
Remarks on . For , Figure 2 shows the plot of privacy at risk level versus privacy at risk for the Laplace mechanism . As the value of increases, the probability of Laplace mechanism generating higher value of noise reduces. Therefore, for a fixed privacy level, privacy at risk increases with the value of . The same observation is made for .
In this section, we study the effect of the implicit randomness induced by the data-generation distribution to provide a fine tuning for the Laplace mechanism. We fine-tune the risk for a specified privacy level without assuming that the sensitivity of the query.
If one takes into account randomness induced by the data-generation distribution, all pairs of neighbouring datasets are not equally probable. This leads to estimation of sensitivity of a query for a specified data-generation distribution. If we have access to an analytical form of the data-generation distribution and to the query, we could analytically derive the sensitivity distribution for the query. In general, we have access to the datasets, but not the data-generation distribution that generates them. We, therefore, statistically estimate sensitivity by constructing an empirical distribution. We call the sensitivity value obtained for a specified risk from the empirical cumulative distribution of sensitivity the sampled sensitivity (Definition 4.2). However, the value of sampled sensitivity is simply an estimate of the sensitivity for a specified risk. In order to capture this additional uncertainty introduced by the estimation from the empirical sensitivity distribution rather than the true unknown distribution, we compute a lower bound on the accuracy of this estimation. This lower bound yields a probabilistic lower bound on the specified risk. We refer to it as empirical risk. For a specified absolute risk , we denote by corresponding empirical risk.
For the Laplace mechanism calibrated with sampled sensitivity and privacy level , we evaluate the empirical risk . We present the result in Theorem 5. The proof is available in Appendix B.
Analytical bound on the empirical risk, , for Laplace mechanism with privacy level and sampled sensitivity for a query is
(5) |
where is the number of samples used for estimation of the sampled sensitivity and is the accuracy parameter. denotes the specified absolute risk.
The error parameter controls the closeness between the empirical cumulative distribution of the sensitivity to the true cumulative distribution of the sensitivity. Lower the value of the error, closer is the empirical cumulative distribution to the true cumulative distribution. Figure 4 shows the plot of number of samples as a function of the privacy at risk and the error parameter. Naturally, we require higher number of samples in order to have lower error rate. The number of samples reduces as the privacy at risk increases. The lower risk demands precision in the estimated sampled sensitivity, which in turn requires larger number of samples.
Let, denotes the data-generation distribution, either known apriori or constructed by subsampling the available data. We adopt the procedure of Rubinstein and Aldà (2017) to sample two neighbouring datasets with data points each. We sample data points from that are common to both of these datasets and later two more data points. From those two points, we allot one data point to each of the two datasets.
Let, denotes the sensitivity random variable for a given query , where and are two neighbouring datasets sampled from . Using pairs of neighbouring datasets sampled from , we construct the empirical cumulative distribution, , for the sensitivity random variable.
For a given query and for a specified risk , sampled sensitivity,
, is defined as the value of sensitivity random variable that is estimated using its empirical cumulative distribution function,
, constructed using pairs of neighbouring datasets sampled from the data-generation distribution .If we knew analytical form of the data generation distribution, we could analytically derive the cumulative distribution function of the sensitivity, , and find the sensitivity of the query as . Therefore, in order to have the sampled sensitivity close to the sensitivity of the query, we require the empirical cumulative distributions to be close to the cumulative distribution of the sensitivity. We use this insight to derive the analytical bound in the Theorem 5.
In this section, we study the combined effect of both explicit randomness induced by the noise distribution and implicit randomness in the data-generation distribution respectively. We do not assume the knowledge of the sensitivity of the query.
We estimate sensitivity using the empirical cumulative distribution of sensitivity. We construct the empirical distribution over the sensitivities using the sampling technique presented in the earlier case. Since we use the sampled sensitivity (Definition 4.2) to calibrate the Laplace mechanism, we estimate the empirical risk .
For Laplace mechanism calibrated with sampled sensitivity and privacy level , we present the analytical bound on the empirical sensitivity in Theorem 6 with proof in the Appendix C.
Analytical bound on the empirical risk to achieve a privacy level for Laplace mechanism with sampled sensitivity of a query is
(6) |
where is the number of samples used for estimating the sensitivity, is the accuracy parameter. denotes the specified absolute risk.
The error parameter controls the closeness between the empirical cumulative distribution of the sensitivity to the true cumulative distribution of the sensitivity. Figure 7 shows the dependence of the error parameter on the number of samples. In Figure 5, we observe that the for a fixed number of samples and a privacy level, the privacy at risk decreases with the value of error parameter. For a fixed number of samples, smaller values of the error parameter reduce the probability of similarity between the empirical cumulative distribution of sensitivity and the true cumulative distribution. Therefore, we observe the reduction in the risk for a fixed privacy level. In Figure 6, we observe that for a fixed value of error parameter and a fixed level of privacy level, the risk increases with the number of samples. For a fixed value of the error parameter, larger values of the sample size increase the probability of similarity between the empirical cumulative distribution of sensitivity and the true cumulative distribution. Therefore, we observe the increase in the risk for a fixed privacy level.
Effect of the consideration of implicit and explicit randomness is evident in the analytical expression for in Equation 7. Proof is available in Appendix C. The privacy at risk is composed of two factors whereas the second term is a privacy at risk that accounts for inherent randomness. The first term takes into account the implicit randomness of the Laplace distribution along with a coupling coefficient . We define as the ratio of the true sensitivity of the query to its sampled sensitivity.
(7) |
Many service providers collect users’ data to enhance user experience. In order to avoid misuse of this data, we require a legal framework that not only limits the use of the collected data but also proposes reparative measures in case of a data leak. General Data Protection Regulation (GDPR)^{1}^{1}1https://eugdpr.org/ is such a legal framework.
Section 82 in GDPR states that any person who suffers from material or non-material damage as a result of a personal data breach has the right to demand compensation from the data processor. Therefore, every GDPR compliant business entity that either holds or processes personal data needs to secure a certain budget in the worst case scenario of the personal data breach. In order to reduce the risk of such an unfortunate event, the business entity may use privacy-preserving mechanisms that provide provable privacy guarantees while publishing their results. In order to calculate the compensation budget for a business entity, we devise a cost model that maps the privacy guarantees provided by differential privacy and privacy at risk to monetary costs. The discussions demonstrate the usefulness of probabilistic quantification of differential privacy in a business setting.
Let be the compensation budget that a business entity has to pay to every stakeholder in case of a personal data breach when the data is processed without any provable privacy guarantees. Let be the compensation budget that a business entity has to pay to every stakeholder in case of a personal data breach when the data is processed with privacy guarantees in terms of -differential privacy.
Privacy level, , in -differential privacy is the quantifier of indistinguishability of the outputs of a privacy-preserving mechanism when two neighbouring datasets are provided as inputs. When the privacy level is zero, the privacy-preserving mechanism outputs all results with equal probability. The indistinguishability reduces with increase in the privacy level. Thus, privacy level of zero bears the lowest risk of personal data breach and the risk increases with the privacy level. needs to be commensurate to such a risk and, therefore, it needs to satisfy the following constraints.
For all , .
is a monotonically increasing function of .
As , where is the unavoidable cost that business entity might need to pay in case of personal data breach even after the privacy measures are employed.
As , .
There are various functions that satisfy these constraints. In absence of any further constraints, we model as defined in Equation 8.
(8) |
has two parameters, namely and . controls the rate of change in the cost as the privacy level changes and is a privacy level independent bias. For this study, we use a simplified model with and .
Let, be the compensation that a business entity has to pay to every stakeholder in case of a personal data breach when the data is processed with an -differentially private privacy-preserving mechanism along with a probabilistic quantification of privacy level. Use of such a quantification allows use to provide a stronger a stronger privacy guarantee viz. for a specified privacy at risk at most for Thus, we calculate using Equation 9.
(9) |
We want to find the privacy level, say , that yields the lowest compensation budget. We do that by minimising Equation 9 with respect to .
is a convex function of .
By Lemma 5.2.1, there exists a unique that minimises the compensation budget for a specified parametrisation, say . Since the risk in Equation 9 is itself a function of privacy level , analytical calculation of is not possible in the most general case. When the output of the query is a real number, we derive the analytic form (Equation 4) to compute the risk under the consideration of explicit randomness. In such a case, is calculated by differentiating Equation 9 with respect to and equating it to zero. It gives us Equation 10 that we solve using any root finding technique such as Newton-Raphson method Press (2007) to compute .
(10) |
For a fixed budget, say , re-arrangement of Equation 9 gives us an upper bound on the privacy level . We use the cost model with and to derive the upper bound. If we have a maximum permissible expected mean absolute error , we use Equation 12 to obtain a lower bound on the privacy at risk level. Equation 11 illustrates the upper and lower bounds that dictate the permissible range of that a data publisher can promise depending on the budget and the permissible error constraints.
(11) |
Thus, the privacy level is constrained by the effectiveness requirement from below and by the monetary budget from above. Hsu et al. (2014) calculate upper and lower bound on the privacy level in the differential privacy. They use a different cost model owing to the scenario of research study that compensates its participants for their data and releases the results in a differentially private manner. Their cost model is different than our GDPR inspired modelling.
Suppose that the health centre in a university that complies to GDPR publishes statistics of its staff health checkup, such as obesity statistics, twice in a year. In January 2018, the health centre publishes that 34 out of 99 faculty members suffer from obesity. In July 2018, the health centre publishes that 35 out of 100 faculty members suffer from obesity. An intruder, perhaps an analyst working for an insurance company, checks the staff listings in January 2018 and July 2018, which are publicly available on website of the university. The intruder does not find any change other than the recruitment of John Doe in April 2018. Thus, with high probability, the intruder deduces that John Doe suffers from obesity. In order to avoid such a privacy breach, the health centre decides to publish the results using the Laplace mechanism. In this case, the Laplace mechanism operates on the count query.
In order to control the amount of noise, the health centre needs to appropriately set the privacy level. Suppose that the health centre decides to use the expected mean absolute error, defined in Equation 12, as the measure of effectiveness for the Laplace mechanism.
(12) |
Equation 12 makes use of the fact that the sensitivity of the count query is one. Suppose that the health centre requires the expected mean absolute error of at most two in order to maintain the quality of the published statistics. In this case, the privacy level has to be at least .
In order to compute the budget, the health centre requires an estimate of . Moriarty et al. Moriarty et al. (2012) show that the incremental cost of premiums for the health insurance with morbid obesity ranges between to . With reference to this research, the health centre takes as an estimate of . For the staff size of and the privacy level , the health centre uses Equation 8 in its simplified setting to compute the total budget of .
Is it possible to reduce this budget without degrading the effectiveness of the Laplace mechanism? We show that it is possible by fine-tuning the Laplace mechanism. Under the consideration of the explicit randomness introduced by the Laplace noise distribution, we show that -differentially private Laplace mechanism also satisfies -differential privacy with risk , which is computed using the formula in Theorem 4.1. Fine-tuning allows us to get a stronger privacy guarantee, that requires a smaller budget. In Figure 8, we plot the budget for various privacy levels. We observe that the privacy level , which is same as computed by solving Equation 10, yields the lowest compensation budget of . Thus, by using privacy at risk, the health centre is able to save without sacrificing the quality of the published results.
Convexity of the proposed cost function enables us to estimate the optimal value of the privacy at risk level. We use the optimal privacy value to provide tighter bounds on the composition of Laplace mechanism. In Figure 12, we compare the privacy guarantees obtained by using basic composition theorem Dwork et al. (2014), advanced composition theorem Dwork et al. (2014) and the composition theorem for privacy at risk. We comparatively evaluate them for composition of Laplace mechanisms with privacy levels and . We compute the privacy level after composition by setting to .
We observe that the use of optimal privacy at risk provided significantly stronger privacy guarantees as compared to the conventional composition theorems. Advanced composition theorem is known to provide stronger privacy guarantees for mechanism with smaller s. As we observe in Figure 11 and Figure 10, the composition provides strictly stronger privacy guarantees than basic composition, in the cases where the advanced composition fails.
Calibration of mechanisms.
Researchers have proposed different privacy-preserving mechanisms to make different queries differentially private. These mechanisms can be broadly classified into two categories. In one category, the mechanisms explicitly add calibrated noise, such as Laplace noise in the work of
Dwork et al. (2006c) or Gaussian noise in the work of Dwork et al. (2014), to the outputs of the query. In the other category, Chaudhuri et al. (2011); Zhang et al. (2012); Acs et al. (2012); Hall et al. (2013) propose mechanisms that alter the query function so that the modified function satisfies differentially privacy. Privacy-preserving mechanisms in both of these categories perturb the original output of the query and make it difficult for a malicious data analyst to recover the original output of the query. These mechanisms induce randomness using the explicit noise distribution. Calibration of these mechanisms require the knowledge of the sensitivity of the query. Nissim et al. Nissim et al. (2007) consider the implicit randomness in the data-generation distribution to compute an estimate of the sensitivity. The authors propose the smooth sensitivity function that is an envelope over the local sensitivities for all individual datasets. Local sensitivity of a dataset is the maximum change in the value of the query over all of its neighboring datasets. In general, it is not easy to analytically estimate the smooth sensitivity function for a general query. Rubinstein et al. Rubinstein and Aldà (2017) also study the inherent randomness in the data-generation algorithm. They do not use the local sensitivity. We adopt their approach of sampling the sensitivity from the empirical distribution of the sensitivity. They use order statistics to choose a particular value of the sensitivity. We use the risk, which provides a mediation tool for business entities to assess the actual business risks, on the sensitivity distribution to estimate the sensitivity.Refinements of differential privacy. In order to account for both sources of randomness, refinements of -differential privacy are proposed in order to bound the probability of occurrence of worst case scenarios. Machanavajjhala et al. Machanavajjhala et al. (2008) propose probabilistic differential privacy that considers upper bounds of the worst case privacy loss for corresponding confidence levels on the noise distribution. Definition of probabilistic differential privacy incorporates the explicit randomness induced by the noise distribution and bounds the probability over the space of noisy outputs to satisfy the -differential privacy definition. Dwork et al. Dwork and Rothblum (2016) propose Concentrated differential privacy that considers the expected values of the privacy loss random variables for the corresponding. Definition of concentrated differential privacy incorporates the explicit randomness induced by the noise distribution but considering only the expected value of privacy loss satisfying -differential privacy definition instead of using the confidence levels limits its scope.
Hall et al. Hall et al. (2013) propose random differential privacy that considers the privacy loss for corresponding confidence levels on the implicit randomness in the data-generation distribution. Definition of random differential privacy incorporates the implicit randomness induced by the data-generation distribution and bounds the probability over the space of datasets generated from the given distribution to satisfy the -differential privacy definition. Dwork et al. Dwork et al. (2006a) define approximate differential privacy by adding a constant bias to the privacy guarantee provided by the differential privacy. It is not a probabilistic refinement of the differential privacy.
Around the same time of our work, Triastcyn et al. Triastcyn and Faltings (2019)
independently propose Bayesian differential privacy that takes into account both of the sources of randomness. Despite this similarity, our works differ in multiple dimensions. Firstly, they have shown the reduction of their definition to a variant of Renyi differential privacy that depends on the data-generation distribution. Secondly, they rely on the moment accountant for the composition of the mechanisms. Lastly, they do not provide a finer case-by-case analysis of the source of randomness, which leads to analytical solutions for the privacy guarantee.
Kifer et al. Kifer and Machanavajjhala (2012) define Pufferfish privacy framework, and its variant by Bassily et al. Bassily et al. (2013), that considers randomness due to data-generation distribution as well as noise distribution. Despite the generality of their approach, the framework relies on the domain expert to define a set of secrets that they want to protect.
Composition theorem. Recently proposed technique of the moment accountant Abadi et al. (2016)
has become the state-of-the-art of composing mechanisms in the area of privacy-preserving machine learning. Abadi et al. show that the moment accountant provides much strong privacy guarantees than the conventional composition mechanisms. It works by keeping track of various moments of privacy loss random variable and use the bounds on them to provide privacy guarantees. The moment accountant requires access to data-generation distribution to compute the bounds on the moment. Hence, the privacy guarantees are specific to the dataset.
Cost models. Ghosh and Roth (2015); Chen et al. (2016) propose game theoretic methods that provide the means to evaluate the monetary cost of differential privacy. Our approach is inspired by the approach in the work of Hsu et al. Hsu et al. (2014). They model the cost under a scenario of a research study wherein the participants are reimbursed for their participation. Our cost modelling is driven by the scenario of securing a compensation budget in compliance with GDPR. Our requirement differs from the requirements for the scenario in their work. In our case, there is no monetary incentive for participants to share their data.
In this paper, we provide a means to fine-tune the privacy level of a privacy-preserving mechanism by analysing various sources of randomness. Such a fine-tuning leads to probabilistic quantification on privacy levels with quantified risks, which we call as privacy at risk. We also provide composition theorem that leverages privacy at risk. We analytical calculate privacy at risk for Laplace mechanism. We propose a cost model that bridges the gap between the privacy level and the compensation budget estimated by a GDPR compliant business entity. Convexity of the cost function ensures existence of unique privacy at risk that minimises compensation budget. The cost model helps in not only reinforcing the ease of application in a business setting but also providing stronger privacy guarantees on the composition of mechanism.
Privacy at risk may be fully analytically computed in cases where the data-generation, or the sensitivity distribution, the noise distribution and the query are analytically known and take convenient forms. We are now looking at such convenient but realistic cases.
We want convey a special thanks to Pierre Senellart at DI, École Normale Supérieure, Paris for his careful reading of our drafts and thoughtful interventions.
The effects of incremental costs of smoking and obesity on health care costs among adults: a 7-year longitudinal study.
Journal of Occupational and Environmental Medicine, 54(3):286–291, 2012.Proceedings of the thirty-ninth annual ACM symposium on Theory of computing
, pages 75–84. ACM, 2007.Functional mechanism: regression analysis under differential privacy.
Proceedings of the VLDB Endowment, 5(11):1364–1375, 2012.Although a Laplace mechanism induces higher amount of noise on average than a Laplace mechanism for , there is a non-zero probability that induces noise commensurate to . This non-zero probability guides us to calculate the privacy at risk for the privacy at risk level . In order to get an intuition, we illustrate the calculation of the overlap between two Laplace distributions as an estimator of similarity between the two distributions. [Overlap of Distributions, Papoulis and Pillai (2002)] The overlap, , between two probability distributions with support is defined as
The overlap between two probability distributions, and , such that , is given by
where .
Using the result in Lemma A, we note that the overlap between two distributions with and is . Thus, induces noise that is more than times similar to the noise induced by . Therefore, we can loosely say that at least of the times a Laplace Mechanism will provide the same privacy as a Laplace Mechanism .
Although the overlap between Laplace distributions with different scales offers an insight into the relationship between different privacy levels, it does not capture the constraint induced by the sensitivity. For a given query , the amount of noise required to satisfy differential privacy is commensurate to the sensitivity of the query. This calibration puts a constraint on the noise that is required to be induced on a pair of neighbouring datasets. We state this constraint in Lemma A, which we further use to prove that the Laplace Mechanism satisfies -privacy at risk.
For a Laplace Mechanism , the difference in the absolute values of noise induced on a pair of neighbouring datasets is upper bounded by the sensitivity of the query. Suppose that two neighbouring datasets and are given input to a numeric query . For any output of the Laplace Mechanism ,
We use triangular inequality in the first step and Definition 2 of sensitivity in the second step.
We write to denote a random variable sampled from an exponential distribution with scale . We write to denote a random variable sampled from a gamma distribution with shape and scale . [Papoulis and Pillai (2002)] If a random variable follows Laplace Distribution with mean zero and scale , .
[Papoulis and Pillai (2002)] If are i.i.d. random variables each following the Exponential Distribution with scale , .
If and are two i.i.d. Gamma random variables, the probability density function for the random variable is given by
where is the modified Bessel function of second kind. Let and be two i.i.d.
random variables. Characteristic function of a Gamma random variable is given as
Therefore,
Probability density function for the random variable is given by,
where is the Bessel function of second kind. Let . Therefore,
We use Mathematica Inc. to solve the above integral.
If and are two i.i.d. Gamma random variables and , then follows the distribution with probability density function:
where is the probability density function of defined in Lemma A.
For Laplace Mechanism with query and for any output , ,
where follows the distribution in Lemma A, . Let, and be two datasets such that . Let be some numeric query. Let and denote the probabilities of getting the output for Laplace mechanisms and respectively. For any point and ,
(13) |
By Definition 2,
(14) |
Application of Lemma A and Lemma A yields,
(15) |
Using Equations 14, 15, and Lemma A, A, we get
(16) |
since, . Therefore,
(17) |
where follows the distribution in Lemma A. We use Mathematica Inc. to analytically compute,
where is the regularised generalised hypergeometric function as defined in Askey and Daalhuis (2010). From Equation A and 17,
This completes the proof of Theorem 4.1.
Laplace Mechanism with is -probabilistically differentially private where
and follows .
Let, and be any two neighbouring datasets sampled from the data generating distribution . Let, be the sampled sensitivity for query . Let, and denote the probabilities of getting the output for Laplace mechanisms and respectively. For any point and ,
(18) |
We used triangle inequality in the penultimate step.
Using the trick in the work of Rubinstein and Aldà (2017), we define following events. Let, denotes the set of pairs neighbouring dataset sampled from for which the sensitivity random variable is upper bounded by . Let, denotes the set of sensitivity random variable values for which deviates from the unknown cumulative distribution of , , at most by the accuracy value . These events are defined in Equation 19.
(19) |
(20) | ||||
(21) |
In the last step, we use the definition of the sampled sensitivity to get the value of the first term. The last term is obtained using DKW-inequality, as defined in Massart et al. (1990), where the denotes the number of samples used to build empirical distribution of the sensitivity, .
Proof of Theorem 6 builds upon the ideas from the proofs for the rest of the two cases. In addition to the events defined in Equation 19, we define an additional event , defined in Equation 22, as a set of outputs of Laplace mechanism that satisfy the constraint of -differential privacy for a specified privacy at risk level .
(22) |