Differential Privacy Via a Truncated and Normalized Laplace Mechanism

11/01/2019 ∙ by William Lee Croft, et al. ∙ 0

When querying databases containing sensitive information, the privacy of individuals stored in the database has to be guaranteed. Such guarantees are provided by differentially private mechanisms which add controlled noise to the query responses. However, most such mechanisms do not take into consideration the valid range of the query being posed. Thus, noisy responses that fall outside of this range may potentially be produced. To rectify this and therefore improve the utility of the mechanism, the commonly used Laplace distribution can be truncated to the valid range of the query and then normalized. However, such a data-dependent operation of normalization leaks additional information about the true query response thereby violating the differential privacy guarantee. Here, we propose a new method which preserves the differential privacy guarantee through a careful determination of an appropriate scaling parameter for the Laplace distribution. We also generalize the privacy guarantee in the context of the Laplace distribution to account for data-dependent normalization factors and study this guarantee for different classes of range constraint configurations. We provide derivations of the optimal scaling parameter (i.e., the minimal value that preserves differential privacy) for each class or provide an approximation thereof. As a consequence of this work, one can use the Laplace distribution to answer queries in a range-adherent and differentially private manner.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since its introduction in 2006, differential privacy [8] has become one of the most well-studied disclosure control methods that offer a formal privacy guarantee for the release of query responses for sensitive databases. At a high level, it prevents accurate inferences about the contents of the database by using a randomization mechanism to add noise to the query responses. By carefully controlling the noise, a data custodian can protect the privacy of individuals in the database while allowing for aggregate level analysis to be effectively carried out.

In many applications, queries will have natural and publicly known constraints on their range. For example, the percentage of satisfied (or dissatisfied) customers in a survey must fall in the range of 0 to 100. Inferences about the range of a query can often be drawn with ease. The sum of positive valued attributes cannot be less than zero and the number of records in a database that satisfy a particular predicate cannot be greater than the size of the database. If a valid range is known for a particular attribute, a valid range for most queries posed over that attribute can be calculated. In practice, most differentially private mechanisms add noise without consideration for such constraints. This implies that noisy responses may be generated outside of the valid range of the query.

The generation of out-of-bounds noisy query responses can be detrimental both in terms of downstream compatibility of the responses with other software and in terms of the ability to perform useful analysis on the database. To avoid this, the noisy responses should conform to the true range of the query if it is public information. A boundary-snapping process which involves snapping out-of-bounds values to the nearest valid value is sometimes used in practice, however, this does not take into consideration the impact of this operation on the overall utility of the noisy responses. To achieve a low level of expected distortion in noisy query responses, probability mass should ideally be focused near the true query response as much as possible. Boundary-snapping leads to pooling of probability mass at the boundaries of the valid range, which may not be near the true query response.

We provide an alternative to boundary-snapping where we first truncate the probability density function (PDF) used for the generation of noisy query responses to the valid range of the query and then normalize it. Subsequent normalization is required to restore the truncated function to a valid PDF. Since normalization involves the multiplication of the function by a factor greater than 1 and most PDFs have higher probability density around the location parameter, this process leads to greater increases in probability density for noisy responses nearby the true response compared to those farther away. This is beneficial for the utility of the mechanism as it leads to a lower level of expected distortion when noise is drawn from the new distribution.

The method of truncation and normalization has been previously applied [14] in the context of the Laplace distribution, a very widely-used distribution in differential privacy. However, that study did not account for the fact that the normalization factor is a function of the true query response, which is sensitive information. Due to this, normalization in fact leaks further information, and additional modification to the PDF is required to account for this. In this work, we study this leakage of information and address it by proposing new versions of truncated and normalized PDFs from a Laplace distribution which, as we show, preserve the differential privacy guarantee.

1.1 Contributions

Our work focuses on the design of differentially private mechanisms which adhere to query range constraints via the truncation and normalization of the PDF from a Laplace distribution. More specifically, our main contributions are:

  • We demonstrate that the process of truncation and normalization of a Laplace PDF is not sufficient on its own to preserve the differential privacy guarantee. The main reason for this results from the data-dependent nature of the normalization.

  • We show how this can be corrected by carefully calculating the scaling parameter for the Laplace distribution. For this, we generalize the differential privacy guarantee in the context of a Laplace mechanism to incorporate data-dependent normalization factors.

  • We use our generalized privacy guarantee to guide a study of range-adherent mechanisms with respect to different classes of range constraints they are able to adhere to. For each class we show how to derive an optimal scaling parameter or an approximation thereof.

2 Literature Review

Privacy-preserving analysis of sensitive data has a long history as can be witnessed by the many different research directions that have been investigated; for a survey of early such works, see [1]. In recent years, the framework of differential privacy [6] has been gaining a great deal of traction as a preferred method for analysis of sensitive databases. By adding noise drawn from an appropriately configured Laplace distribution to database query responses prior to their release, it has been proven that the ability of attackers to distinguish between potential configurations of database records can be limited [8]. This concept has subsequently spawned diverse studies including the number of queries that can be safely answered, the utility of the noisy responses, and variations on how to add noise; see e.g., [7, 15].

Some studies on preserving utility when adding noise have focused on publicly known constraints. In these, the goal for the generation of noisy responses is to respect the constraints and thereby provide better utility. In the non-interactive setting, where batches of related queries are posed, known relationships between the responses to a batch of queries should be preserved. This was first considered in the context of the queries needed to break a contingency table for a database up into marginals (i.e., projections of record counts over subsets of the database attributes). Linear programming was applied after a transformation to the Fourier domain to ensure consistency such that the noisy marginals don’t appear to have been derived from different databases

[3]

. Subsequent work has improved further upon this concept by applying post-processing to vectors of noisy query responses in order to achieve consistency between the noisy values while minimizing the distance between the original noisy vector and the post-processed version

[12, 13].

In the interactive setting in differential privacy independent queries are answered on an individual basis. There, it may be desirable to ensure that noisy responses are consistent with a (known) valid range of the query. A well-known implementation of this is the truncated geometric mechanism [11], a variant of the geometric mechanism [10] which snaps out-of-bounds values to the nearest valid response. In many cases, this approach can lead to a high density of noisy responses on the boundaries of the valid range of the query. To avoid undesirable properties such as these types of spikes in probability density, the explicit fair mechanism [5] was designed with to answer queries in a range-adherent manner while providing properties such as a monotonically decaying PDF. Both the geometric and the explicit fair mechanisms are specifically designed to answer counting queries (i.e., the number of database records satisfying a particular predicate). As such, they only produce integer-valued responses and cannot be applied to more general numeric queries.

The concept of snapping out-of-bounds noisy responses to the nearest valid value can be equally applied to any type of randomization mechanism. This has been studied in the context of the Laplace mechanism which produces real-valued output and can be applied to general numeric queries. Another variant of the Laplace mechanism involves truncation of the PDF used by a Laplace distribution to the valid range of the query followed by normalization [14]. Although the method of normalization might be preferred for a greater inflation of probability mass around the true query response as opposed to the boundaries of the valid range, the study neglected to consider the impact of the information leakage inherent in the process of normalization. We note that other mechanisms using truncated PDFs have been studied as well, however, the truncation was not designed to match the valid range of the query and thus they cannot be applied as a means to achieve adherence to the valid range [9, 2].

3 Preliminaries for Truncation and Normalization

Differential privacy [6] (Definition 1) enforces the guarantee that similar databases must have similar probabilities of producing the same noisy query responses. More specifically, it guarantees that for every pair of adjacent databases (i.e., databases that differ by a single record) and every noisy query response, the respective probabilities of the two databases to produce the given noisy query response must be within a factor of of each other, where is a privacy parameter specified by the data custodian. The lower the value of is set, the stricter the guarantee becomes as the distributions over the noisy query responses are forced to become more similar to each other.

Definition 1.

Let be the set of potential database configurations, be a query function and be a randomization mechanism. For to satisfy the differential privacy guarantee [6], the following condition must hold for all pairs of adjacent databases :

(1)

For answering general numeric queries in a differentially private manner, the Laplace distribution is commonly used to implement the randomization mechanism.

Definition 2.

The Laplace distribution

for a continuous random variable

, configured by a location parameter and a scaling parameter is given by:

(2)

To satisfy the differential privacy guarantee, the scaling parameter must be set in terms of the chosen value of and the query sensitivity, , which is the largest possible difference between the true query responses of a pair of adjacent databases. By setting the scaling parameter to and the location parameter to the true query response, values drawn from the resulting distribution can be used as the noisy query responses. This configuration is referred to as the Laplace mechanism [6].

When answering queries in a range-adherent manner, the PDF of a Laplace distribution can be truncated to the valid range of the query and then normalized. Since the probability density in the Laplace distribution exponentially decays as distance from the location parameter increases, multiplication by the normalization factor will induce greater increases in probability density for noisy query responses that are closer to the true response than those that are farther away. This is favourable for the utility of the mechanism as it is conducive to a low level of expected distortion in the noisy query responses. However, the operation of normalization is dependent upon the true query response and leaks sensitive information. Verifying the privacy guarantee becomes much more complex with this method and requires further modification of the PDF. We propose an alternative method (to that which is used by the standard Laplace mechanism) for the calculation of the scaling parameter in the Laplace distribution that accounts for the data-dependent normalization in order to preserve the privacy guarantee of differential privacy.

In this section, we provide the preliminaries that are needed to study the impact of truncation and normalization of a Laplace PDF with respect to the privacy guarantee. We first formalize our treatment of range constraints and lay out the calculations of normalization factors for classes of constraint configurations. We then examine the impact of adherence to range constraints and data-dependent normalization on the privacy guarantee. We define a generalized privacy guarantee to use in this setting. With these components laid out, we provide a small example to demonstrate the violation of the privacy guarantee when truncation and normalization are applied without further modification to the PDF.

In Table 1, we provide a quick reference for the definitions of the variables used throughout the paper.

The set of potential database configurations
A function representing a randomization mechanism
A function representing a query posed on a database
Query sensitivity - The maximum possible difference between the query responses of two adjacent databases. This is a non-zero, positive, real value.
Privacy parameter - The user-specified, desired level of privacy. This is a non-zero, positive, real value.
Scaling parameter (for cases where all PDFs share the same scaling parameter) - A value calculated to determine the scale of noise required to release a database query. This is a non-zero, positive, real value.
The continuous random variable used for a PDF
The normalization factor required for database
The scaling parameter required for database
The distance between and the continuous random variable
The set of finite constraints to the left of
The set of finite constraints to the right of
A boolean variable used to indicate the presence of a constraint spanning to negative infinity
A boolean variable used to indicate the presence of a constraint spanning to infinity
The distance between and the right endpoint of a constraint spanning to negative infinity
The distance between and the left endpoint of a constraint spanning to infinity
The distance between and the right endpoint of a constraint
The distance between and the left endpoint of a constraint
A summation of the integrals of all constraints to the left of
A summation of the integrals of all constraints to the right of
A real-valued variable used to indicate the number of spans of between two databases
Table 1: Definitions of variables

3.1 Constraint Configurations

Given constraints on the range of a query represented as a set of ranges within which query responses cannot occur, the goal is to truncate a Laplace PDF to remove these constrained ranges, calculate an appropriate normalization factor to restore the integral over the truncated span(s) to 1, and calculate an appropriate scaling parameter. The required scaling parameter is dependent upon a privacy parameter specified by the data custodian and the position of the location parameter (i.e., the true query response) relative to the constraints.

The scaling parameter must be selected such that the differential privacy guarantee is satisfied. The higher the scaling parameter is set, the farther the probability mass becomes spread from the location parameter. Therefore, calculating the lowest possible scaling parameter that satisfies the privacy guarantee is one of the core aspects of our work. We consider the selection of the lowest possible scaling parameter to be the criterion for optimal utility within the class of mechanisms using a truncated and normalized Laplace PDF.

Allowing for truncated spans in a PDF provides a natural way to incorporate range constraints into a randomization mechanism. Given the wide variety of queries that can be posed on a database, many different configurations of constraints may arise in practice. To simplify the work required in designing appropriate mechanisms for the various constraint configurations, we categorize these configurations into classes such that for each class, a corresponding mechanism can be developed. Visual representations of possible constraint configurations from each class are shown in Figure 1.

The first class is characterized by a single constraint that begins at any point and spans either to infinity or to negative infinity (Figure 1-a). An example of this is a query that has the set of natural numbers or positive real numbers as its range.

The second class may have any number of finite spans of constraints (Figure 1-b). These configurations would be typical of queries in which the range is subject to a periodic function. For example, a query may be posed about dates on which individuals visited a particular medical clinic where it is known that the clinic is closed during certain periods of the year.

We also consider a final class of arbitrary constraint configurations to handle any combination of the first two classes (Figure 1-c). Common instantiations of this class are queries that have a single finite span of valid responses. For example, a query may have a single finite range of possible age values.

Figure 1: An example from each class of constraint configurations is shown. The black line represents a real-valued, one-dimensional range and the shaded red blocks represent constraints. Constraints shown on the endpoint of the line span infinitely in that direction. The constraints classes are: a) a single infinite constraint, b) arbitrary finite constraints, and c) arbitrary constraints.

3.2 Normalization Factors

Since constraints represent infeasible spans in the range of the query, the PDF must be truncated to remove these spans. This has the effect of producing a piece-wise function whose integral is no longer equal to 1. In order to restore this to a PDF, it must be scaled by a normalization factor such that the integral over the valid space is equal to 1. The normalization factor is calculated as the reciprocal of the integral of the truncated PDF. This integral is calculated as the sum of the integrals of the removed spans of the PDF subtracted from 1. An example of a truncated and normalized Laplace PDF is shown in Figure 2.

Figure 2: A graph of two Laplace PDFs with location parameters of 0 and scaling parameters of 1. The shaded blocks show an example of different types of constraints (infinite and finite on both the left and right of the location parameter). The red line shows the original Laplace PDF and the blue line shows one that has been truncated and then normalized.

Integrals of a Laplace PDF can be easily calculated using the cumulative distribution function (CDF).

Definition 3.

The cumulative distribution function corresponding to a Laplace distribution of is given by:

(3)

We define the following notation for the representation of constraints. Let be the set of finite constraints to the left of for any database . For any constraint , let and be the distance between and the left and right endpoints, respectively, of . Let be an indicator variable set to either 0 or 1 to denote the absence or presence of a constraint that spans to negative infinity and let be the distance between the right endpoint of such a constraint, if it exists, and . These concepts are all defined analogously for a set of finite constraints to the right of with acting as the indicator variable for a constraint spanning to infinity and acting as the distance between the left endpoint of such a constraint, if it exists, and . Finally, let be a data-dependent scaling parameter to be used when querying a database .

Let , defined in Formula 4, be the sum of all integrals of the constraints to the left of . The integral of a finite constraint to the left of can be calculated as the CDF at the right endpoint of the constraint minus the CDF at the left endpoint (shown as the expression inside the summation of Formula (4)). The integral of a constraint that spans to negative infinity can be calculated as the CDF at the right endpoint of the constraint (shown in the term outside the summation of Formula (4)).

(4)

Analogous calculations for the integrals of finite constraints to the right of and a constraint spanning to infinity are shown in Formula (5), where is the sum of all integrals of the constraints to the right of .

(5)

The total integral of the removed space is simply the summation of all constraint integrals (the sum of Formulae (4) and (5)). The integral of the truncated PDF is the sum of the constraint integrals subtracted from 1. The normalization factor is the reciprocal of the integral of the truncated PDF, as shown in Formula (6).

(6)

3.3 Generalized Differential Privacy Guarantee

When using the basic Laplace distribution to implement the randomization mechanism, the privacy guarantee of Definition 1 can be simplified to the form shown in Formula (7) for all pairs of adjacent databases , where is the distance between between the continuous random variable and for any database .

(7)

When using normalization factors induced by constraints and allowing for differing scaling parameters, we obtain a new general form of the privacy guarantee for all pairs of adjacent databases shown in Formula (8).

(8)

To simplify the analysis, we assume that when two databases, are being compared in a privacy guarantee, it will always be the case that . As a result, a form for the symmetric case of the guarantee (i.e., where ) must also be considered, as shown in Formula (9). To assert the differential privacy guarantee, both forms must be satisfied for all pairs of adjacent databases that apply to their respective cases.

(9)

3.4 Guarantee Generalization for Arbitrary Distances

The differential privacy guarantee is generally interpreted in terms of pairs of adjacent databases. For such databases, the distance between their true query responses is upper bounded by the query sensitivity . Although not typically done, the privacy guarantee can also be interpreted in terms of the actual distance between query responses of arbitrary databases (i.e, pairs that are not necessarily adjacent). This is a useful interpretation for our setting since the distance between the databases can be used as a means to calculate the difference between their normalization factors as well.

For a pair of databases having PDFs with location parameters in the same span of truncated space, the constraint endpoint distances of can be written in terms of the variables used for . This can be done by applying an offset equal to the distance between the two location parameters. We define a distance variable , representing the number of spans of between the location parameters, which we utilize to create a new representation of the constraint integral calculations. Since we assume that , we know that will be closer to all constraints to the left and farther from all constraints to the right compared to . We therefore use an offset of for distances to the left and an offset of for distances to the right. The distance offset versions of the constraint integral calculations are shown in Formulae (10) and (11).

(10)
(11)

To use these integral calculations in the privacy guarantee, we must consider how the required multiplicative bound between the probabilities of the databases producing the same noisy response should reflect this modification. Since is the multiplicative bound for adjacent databases (meaning their query responses are separated by a distance of at most ), it follows that databases whose query responses are separated by spans of should have a multiplicative bound given by the product of multiplied by itself times. In other words, the multiplicative bound must become .

In fact, does not need to be an integer; it can be real-valued. The important requirement to maintain in this generalization is that for any two databases separated by , the multiplicative difference in their probabilities will be bounded by . To show this, consider two databases separated by spans of . Since times that distance is equal to , we must show that the multiplicative bound for of the total distance is . This is easily shown as follows:

(12)

To apply this generalization to the privacy guarantee, we use and defined according to Formulae (4) and (5) as well as and defined according to Formulae (10) and (11) to produce the generalized form of the privacy guarantee shown in Formula (13).

(13)

3.5 Continuous Random Variable Worst Case Analysis

In the generalized privacy guarantee of Formula (13), we can see that the variables related to the continuous random variable of the PDFs appear only in the third fraction of the left-hand side of the inequality. We now look for the worst case with respect to the continuous random variable placement in order to eliminate the need for explicit quantification over this variable in the privacy guarantee. Since the left-hand side of the guarantee must always be less than or equal to the right-hand side, the worst case occurs when the fraction is maximized.

Applying identities, the fraction containing the continuous random variable can be re-written as:

(14)

This expression is maximized when the exponent is maximized. If we redefine the in terms of based on the distance between the location parameters, we can write out three different cases which cover the possible placements of the continuous random variable. These cases are shown in Formula (15). Recall that and .

(15)

Of these cases, the third will produce the highest value, thus the continuous random variable should be placed somewhere in the span of truncated space from up to infinity. In order to determine which position within this space maximizes the value of the expression, we must consider the relationship between and .

If were always equal to then the placement of the continuous random variable within the identified span would have no impact on the value of the expression. However, we would ideally like to take advantage of the fact that the farther a location parameter is from a constraint, the less of the PDF integral is lost during truncation. A larger remaining integral means that a lower normalization factor is needed. As we later show, higher scaling parameters are needed to compensate for the increased normalization factors. However, higher scaling parameters lead to higher levels of expected noise. Therefore, it is desirable to allow the scaling parameters to decrease as the normalization factors decrease.

We first consider the case of a single infinitely spanning constraint. If the constraint spans to infinity, this implies that in this configuration, . A greater value means that the value of the expression goes up at increases. We must therefore work with the highest possible value. If the guarantee can be satisfied for this value then any other possible value will cause the value to the left-hand side to decrease, meaning that the guarantee will hold. We therefore place right on the border of the constraint going to infinity (if one exists). This means that is equal to . The fraction can be updated to:

(16)

While similar analysis could be provided for the case of a single constraint spanning to negative infinity, we do not explicitly consider this case as it can be easily treated as the case of a constraint spanning to infinity after a horizontal reflection has been applied.

In constraint configurations where there are no infinite constraints, the worst-case with differing scaling parameters is undefined since the continuous random variable can always be placed farther away. In such a configuration, it becomes necessary to use the same scaling parameter everywhere to avoid this problem. When this occurs, the fraction can be written as:

(17)

We must also consider the form for the symmetric case of the privacy guarantee. This time, the worst-case occurs when we maximize:

(18)

When redefining the in terms of based on the distance between the location parameters, the three possible cases are as shown in Formula (19).

(19)

The first and second cases will be larger than the third. The worst-case occurs when the second term of the exponent is minimized. The smallest possible value it can take on is 0, which occurs when . The fraction can therefore be re-written as:

(20)

For cases where the same values are used due to the worst-case analysis from the symmetric form, this would be:

(21)

3.6 Motivating Example

We conclude this section with an example to illustrate the need for an alternate calculation of the scaling parameter when the Laplace PDF is truncated and normalized. Consider a case with a single constraint spanning to negative infinity where a fixed scaling parameter is used regardless of the position of the location parameter. The privacy guarantee would be written as shown in Formula (22).

(22)

The regular calculation of the scaling parameter for a Laplace mechanism is to set . When making this substitution, we have the form shown in Formula (23).

(23)

Now we consider the comparison of a database with a location parameter on the border of the constraint to any other database. To do so, we set to 0 and simplify to produce the form shown in Formula (24).

(24)

We can see that for any (i.e., any pair of databases that do not share the same true query response), the privacy guarantee is not satisfied. This shows that the regular calculation of the scaling parameter cannot be applied when normalization has occurred. In the remainder of the paper, we show how to calculate appropriate scaling parameters to preserve the differential privacy guarantee. Furthermore, since different location parameters induce different normalizations factors, we allow for the scaling parameter to be calculated as a function of the position of the location parameter relative to the constraints in order to achieve lower scaling parameters and thus better utility.

4 Constraint Configurations

Having defined classes of range constraints and their corresponding normalization factors, we now study these classes in the context of our generalized privacy guarantee. This form of the privacy guarantee captures the information leakage induced by data-dependent normalization, allowing us to investigate how the selection of appropriate scaling parameters can be used to account for the information leakage. Throughout this section, we define an instantiation of the privacy guarantee for each class of constraint configurations according to the corresponding calculation of the normalization factors and the worst-case analysis of the continuous random variable. Through the manipulation of these instantiations, we derive the required scaling parameters. For conciseness within this section, the full proofs are provided in the appendix.

4.1 Single Infinite Constraint

We begin with the class of single infinitely spanning constraints. We show the corresponding configuration of the privacy guarantee as well as the derivations required for the calculation of optimal scaling parameters that satisfy the guarantee. In this class, there is a single span of infeasible space extending either to infinity or to negative infinity. The complement of this space is a single span of feasible space. Without loss in generality, we assume the constraint extends to infinity. The case of going to negative infinity can be seen as a horizontal reflection of this, in which all analysis is the same.

Since there is only a single constraint, the normalization factor can be calculated as the reciprocal of the integral of the truncated space. We apply this normalization factor to the privacy guarantee of Formula (8) along with the distance generalization of Section 3.4 and the continuous random variable worst-case in the third case of Formula (15). The privacy guarantee can now be written as:

(25)

The formulae in this section make use of the multi-valued LambertW function [4]. As we are interested only in real-valued output, we restrict our attention to the single-valued functions of the primary and -1 branches. We refer to these functions as and for shorthand and use for instances where either branch may be applied.

Definition 4.

The LambertW function [4] is given by the inverse function of where is a complex number. For the branches to be single-valued, the following conditions apply: and .

(26)
(27)
Lemma 1.

For any PDF with a location parameter at distance from the constraint, bounds on the possible values of its scaling parameter are determined by the following four inequalities:

(28)
(29)
(30)
(31)
Proof.

By isolating in Formula (25), we obtain an inequality which contains the LambertW function. Since the LambertW function has two branches, this gives two restrictions on the possible values that the values can take on. By setting , we determine that must fall in the intersection of the two spans specified in Formulae (28) and (29) in order for the privacy guarantee to be satisfied for a pair of PDFs with location parameters at distances of 0 and away from the constraint. Since the LambertW function does not produce real-valued output for any input values less than , we are also able to derive restrictions on the allowable values that produce valid input for the LambertW function. In this way, we additionally determine that the value for a PDF with a location parameter at distance away from the constraint must fall into one of the two spans specified by Formulae (30) and (31). Note that these conditions are necessary but not sufficient for satisfying the privacy guarantee. ∎

Lemma 2.

Through the selection of an appropriate value when , it is possible to calculate values for any PDF with a location parameter at distance away from the constraint such that the inequalities of Lemma 1 are satisfied.

Proof.

By calculating as shown in Formula (32) when , the right-hand sides of Formulae (28) and (29) become equal to the right-hand sides of Formulae (30) and (31), respectively. As a result, in order to satisfy all four inequalities of Lemma 1, a PDF with a location parameter at distance from the constraint can be assigned one of the two possible values (by setting Z to either 0 or -1) calculated using Formula (33).

(32)
(33)

Lemma 3.

The sign of the denominator in the derivative taken with respect to of the calculation of Lemma 2 depends on which branch is indicated by the branch index variable .

Proof.

Through analysis of the signs of all factors in the denominator, we show that the sign will always be negative for the principal branch and will always be positive for the -1 branch. ∎

Lemma 4.

As a function of , the calculation of Lemma 2 is unimodal for either branch index.

Proof.

Since the sign of the denominator of its derivative cannot change when restricted to the same branch index, a sign change could only be induced by the numerator. Through analysis of the factors in the numerator, we show that exactly one sign change occurs for each branch index. Since the full derivative has only one sign change when considering and separately, each of the functions are unimodal. ∎

Lemma 5.

As a function of , the input to in the calculation of Lemma 2 is unimodal and is initially decreasing.

Proof.

Through analysis of the input, we can show that it is equal to 0 when and is initially decreasing. Furthermore, we show that the derivative has exactly one sign change at , making the function unimodal. ∎

Lemma 6.

As a function of , the mode of the calculation of Lemma 2 is a minimum for the principal branch and a maximum for the -1 branch.

Proof.

From Lemma 3, we know that the sign of the denominator in the derivative of Formula 33 with respect to will always be negative for the principal branch and will always be positive for the -1 branch. From Lemma 4, we know that a single sign change occurs in the numerator. Through further analysis of the numerator, we show that the numerator is initially positive until its zero, from which point it is negative. This results in the mode being a minimum for the principal branch and a maximum for the -1 branch. ∎

Theorem 1.

The calculation method of Lemma 2 provides a solution which optimally satisfies the differential privacy guarantee.

Proof.

To show that the worst-case analysis of Section 3.5 holds, we show that the values are monotonically decreasing as increases. Since the privacy guarantee of Formula 25 is non-symmetric, we also apply the calculations to an alternate formulation which represents the symmetric case and show that this formulation holds. Finally, we must consider the optimality of the values. Since lower values are preferable, we show through further examination of the four inequalities of Lemma 1 that it is not possible to calculate lower values that could satisfy the privacy guarantee. ∎

4.2 Arbitrary Finite Constraints

In this configuration, we have an arbitrary set of finite spans of infeasible space. The complement of these spans is the feasible space, thus it extends to both positive and negative infinity.

As stated in Section 3.5, since we have no constraints that go to infinity, the continuous random variable worst-case analysis requires that we use the same scaling parameter everywhere. The normalization factors that we use here correspond to summations using finite integrals. We start by considering pairs of PDFs with location parameters in the same span of truncated space. Since the scaling parameters are all the same, we can apply the identities and to produce the privacy guarantee shown in Formula (34). Note that since there are no infinite constraints present, the indicator variables for such constraints inside and are equal to 0.

(34)

For pairs of PDFs with location parameters in different spans of truncated space, the sets of constraints to their left and right will differ. This would require each normalization factor to be written in terms of its own sets of constraints. As we will see later, there is only one specific case of PDFs with location parameters in different spans of truncated space that we must explicitly consider in order to prove the guarantee for any pair of databases.

Lemma 7.

For each span of truncated space, there exists a value of for which the privacy guarantee is satisfied for any pair of PDFs with location parameters within that span.

Proof.

When comparing the left and right-hand sides of the privacy guarantee for , both sides are equal to 1. In order for the inequality of the privacy guarantee to be satisfied, it is therefore a necessary condition that the left-hand side is initially increasing at a lower rate than the right-hand side. By taking the derivative of each side and enforcing this condition as an inequality, it is possible to derive a lower bound for shown in Formula (35). By using this lower bound as the chosen value, we show that the privacy guarantee is satisfied for all pairs of PDFs with location parameters in the same span of truncated space.

(35)

Lemma 8.

Within each span of truncated space, the value determined from Lemma 7 acts as a lower bound for the value of required to satisfy the privacy guarantee.

Proof.

With an alternate representation of the privacy guarantee which uses a value of arbitrarily larger than that which is given by the lower bound in Formula (35), we show that the privacy guarantee will still be satisfied. ∎

Lemma 9.

For each finite constraint, there exists a value of that satisfies the privacy guarantee for the pair of PDFs with location parameters on the endpoints of the constraint.

Proof.

The guarantee form used thus far has used the same sets of finite constraints to the left and right of both location parameters, implying that they must both lie within the same span of truncated space. We must also be able to show that the privacy guarantee holds for location parameters in different spans of truncated space. To show this, we first consider a pair of PDFs with location parameters that lie on opposite endpoints of a finite constraint (with as always being the point on the right). We provide this instantiation of the privacy guarantee in Formula 37, using Formula 36 to calculate the value of the integral corresponding to the span covered by the finite constraint. Since both location parameters are adjacent to this span, the value of the integral is the same for both PDFs.

(36)
(37)

We know from Lemma 7 that a sufficiently high value of can satisfy the guarantee without the modification made here and from Lemma 8 that raising the value of beyond the requirement will not cause that form of the privacy guarantee to be violated. We show that increasing reduces the influence of in Formula 37 by causing the value of to asymptotically approach 0. It therefore follows that a sufficiently high value of will also satisfy this form of the privacy guarantee and that increasing beyond that value will not violate the guarantee. ∎

Lemma 10.

All lower bounds on identified in Lemmas 7 and 9 are less than .

Proof.

The calculations for lower bounds on can be handled by treating Formulae (35) and (37) as equalities and solving for . We can identify bounds on the possible values of by studying the bounds on the variables and . Since represents the sum of the integrals of the constraints to the left of it can be as low as 0 (if no constraints are present to the left of ) and can approach but not reach 0.5 (since half of the integral exists on the left hand side). The bounds on are characterized in the same way. By studying the ranges of the calculations for in the context of these bounds, we show will fall in the range of . ∎

Theorem 2.

The optimal value that satisfies the privacy guarantee for all pairs of databases can be found by taking the maximum out of lower bounds, where is the number of finite constraints.

Proof.

Lemma 7 provides a lower bound on for pairs of PDFs with location parameters in the same span of truncated space. This applies to a form of the privacy guarantee in which holds. The symmetric case can be handled in the same way after an application of horizontal reflection to the configuration. Lemma 9 provides an additional lower bound on for PDFs with location parameters on opposite endpoints of a constraint. In Formula (37), if the fraction on the left-hand side is inversed, as it would be in a symmetric form, the left-hand side decreases. Thus, if the form in Formula (37) is satisfied, the symmetric form will be as well. For constraints, there are lower bounds on in Lemma 7 from the regular form of the privacy guarantee and an additional lower bounds for the symmetric form. From Lemma 9, there are lower bounds, giving a total of . By selecting the largest of these, the guarantee is satisfied for all pairs of PDFs with location parameters in any same span of truncated space and for all pairs of PDFs with location parameters on opposite endpoints of a constraint. Since each of the lower bounds must be adhered to, it is not possible to select a lower value of than this.

It remains to be shown that any arbitrary pair of databases is also protected. This follows as a transitive property of multiple applications of the guarantee forms used throughout the lemmas. For any pair of PDFs with location parameters at arbitrary points in truncated space, it is possible to represent this as a sequence of points where each adjacent pair of points in the sequence corresponds to a pair of PDF location parameters in the configuration used in either Lemma 7 or Lemma 9. The multiplicative bound for the arbitrary pair is therefore the product of the bounds of the adjacent pairs in the sequence. Since each of the adjacent pairs satisfy the privacy guarantee, the product will also satisfy the guarantee for the arbitrary pair. ∎

Theorem 3.

The optimal value of for any configuration of arbitrary finite constraints can be calculated to a precision of decimal places in time.

Proof.

As stated in Theorem 2, must be chosen as the maximum value out of the lower bounds calculated from Formulae (35) and (37). The variables and in these inequalities represent summations of exponential functions where each function contains an instance of in the denominator of its exponent. We know of no method to isolate for such configurations. It therefore takes time to check whether a given value of is above all lower bounds.

From Lemmas 8 and 9, we know that any value of larger than the required value will also satisfy the privacy guarantee. This implies that once the inequality is satisfied, increasing further will never violate the inequality. Lemma 10 indicates that the value of will always be between 0 and , meaning that for a decimal precision of , there are possible values for . By performing a binary search for the optimal value, the logarithm of the number of possible values of must be checked, leading to an overall time complexity of . In most cases, the values of and are likely to be small, making a more practical representation of the time complexity. ∎

4.3 Arbitrary Constraints

While the privacy guarantee forms in the configurations covered thus far have been manageable in their complexity, configurations with multiple constraints where at least one spans infinitely lead to much more complex expressions due to an increase in the number of exponential functions which appear in the normalization factors combined with variable scaling parameters. The expressions that must be worked with now take the form of transcendental functions having differing polynomials as coefficients and exponents. Exact calculations necessary for determining optimal scaling parameter values now become very difficult, if not impossible, due to the necessary step of root-finding for these functions. Unlike the case of the single infinite constraint, there is no function such as the LambertW that can be easily substituted in these expressions to allow for solutions to be calculated.

When the number of constraints is small, it may still be possible to apply some form of analysis to produce very good approximate solutions which can approach the optimal values within a desired level of precision. We leave the analysis of such configurations as an open problem. For cases with greater numbers of constraints, it is likely that the analysis becomes unmanageable. For such cases, we thus assign the same scaling parameters for every PDF instead of considering variable scaling parameters. In essence, we sacrifice the better utility obtained by the calculation of optimal scaling parameters for the ability to calculate the optimal value when using the same scaling parameter for all PDFs such that we are able to satisfy the privacy guarantee. The corresponding privacy guarantee is shown in Formula (38). This is a modified version of Formula (13) using same scaling parameters and the appropriate continuous random variable worst-case analysis of Section 3.5.

(38)
Corollary 1.

The optimal uniform value of for any configuration of arbitrary constraints can be calculated to a precision of decimal places in time.

Proof.

The calculation of and the related proofs required here are almost identical to those of Section 4.2. The only difference is that and may now contain integrals of constraints that span to negative infinity and infinity respectively. Due to this, it is now possible for and to be equal to 0.5, whereas before they could only approach this value. Note however that they cannot be simultaneously equal to 0.5 as this would imply that no feasible space exists. This implies that rather than the upper bound on being one that approaches , is it now equal to this value. The algorithm for the approximation of the optimal value of from Theorem 3 can thus be employed here as well. ∎

Constraint Configuration Calculation Bound(s) Notes
Single Infinite Constraint : Formula (32) Optimal values
: Formula (33)
Arbitrary Finite Constraints Section 4.2 Approximation of optimal values
Arbitrary Constraints Section 4.2 Approximation of optimal uniform values
Table 2: Summary of information on for each constraint configuration.

4.4 Summary of Results

Through the study of query range constraints, we have presented a framework for the design of mechanisms that allow for the truncation and normalization of the Laplace PDF according to the configuration of constraints. We have provided a detailed analysis of the impact of constraints on the mechanisms in our framework. An important characteristic of this framework is that it maintains the ability to apply an arbitrary discretization to the truncated range of the mechanisms.

We have derived a generalized differential privacy guarantee and have applied our method of design to various classes of constraint configurations. For each mechanism, we have proven its correctness and where applicable, we have also proven optimality with respect to the calculation of the scaling parameters. Information on the calculation of scaling parameters and their bounds is summarized in Table 2 for each of the constraint configuration classes that we have studied.

5 Conclusions

When posing queries on a sensitive database, natural range constraints of the query may be publicly known, however, most randomization mechanisms do not adhere to range constraints when adding noise to query responses. Truncation and normalization of a Laplace PDF offers a means to generate noisy responses within a specified range while improving the utility of the mechanism by inducing an increase in probability density that is greater for noisy responses nearer the true response than for responses farther away. We show that since the normalization of the PDF is a data-dependent operation, this leaks sensitive information and violates the differential privacy guarantee. We propose a method to correct this which involves determining a scaling parameter then used by the Laplace distribution. We introduce a generalization of the differential privacy guarantee which can be applied to the Laplace distribution to incorporate data-dependent normalization factors. Using our generalized guarantee, we study different classes of range constraint configurations and provide a derivation of optimal scaling parameters or approximations thereof. Using our proposed calculations, a data custodian can now apply the Laplace distribution to answer general numeric queries in a range-adherent and differentially private manner.

References

  • [1] N. R. Adam and J. C. Worthmann (1989) Security-Control Methods for Statistical Databases: A Comparative Study. ACM Computing Surveys 21 (4), pp. 515–556. External Links: Document Cited by: §2.
  • [2] J. Awan and A. Slavkovic (2018) Differentially Private Uniformly Most Powerful Tests for Binomial Data. Note: https://arxiv.org/abs/1805.09236 Cited by: §2.
  • [3] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar (2007) Privacy, Accuracy, and Consistency Too: A Holistic Solution to Contingency Table Release. In Proceedings of ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 273–282. External Links: Document Cited by: §2.
  • [4] R.M. Corless, G.H. Gonnet, D.E.G. Hare, D.J. Jeffrey, and D.E. Knuth (1996) On the Lambert W Function. Advances in Computational Mathematics 5, pp. 329–359. External Links: Document Cited by: §4.1, Definition 4.
  • [5] G. Cormode, T. Kulkarni, and D. Srivastava (2018) Constrained Private Mechanisms for Count Data. In 34th IEEE International Conference on Data Engineering, pp. 845–856. External Links: Document Cited by: §2.
  • [6] C. Dwork and A. Roth (2014) The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science 9, pp. 211–407. External Links: Document Cited by: §2, §3, §3, Definition 1.
  • [7] C. Dwork (2008) Differential Privacy: A Survey of Results. In Proceedings of the 5th International Conference on Theory and Applications of Models of Computation,

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , Vol. 4978, pp. 1–19.
    External Links: Document Cited by: §2.
  • [8] C. Dwork (2006) Differential Privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II, pp. 1–12. External Links: Document Cited by: §1, §2.
  • [9] Q. Geng, W. Ding, R. Guo, and S. Kumar (2018) Truncated Laplacian Mechanism for Approximate Differential Privacy. CoRR. External Links: Link Cited by: §2.
  • [10] A. Ghosh, T. Roughgarden, and M. Sundararajan (2012) Universally Utility-Maximizing Privacy Mechanisms. SIAM Journal on Computing 41 (6), pp. 1673–1693. External Links: Document Cited by: §2.
  • [11] M. Gupte and M. Sundararajan (2010) Universally Optimal Privacy Mechanisms for Minimax Agents. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 135–145. External Links: Document Cited by: §2.
  • [12] M. Hay, V. Rastogi, G. Miklau, and D. Suciu (2010) Boosting the Accuracy of Differentially Private Histograms Through Consistency. In Proceedings of the Very Large Data Base Endowment, pp. 1021–1032. External Links: Document Cited by: §2.
  • [13] J. Lee, Y. Wang, and D. Kifer (2015) Maximum Likelihood Postprocessing for Differential Privacy Under Consistency Constraints. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 635–644. External Links: Document Cited by: §2.
  • [14] F. Liu (2018) Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Bounding Constraints. Note: https://arxiv.org/abs/1607.08554 Cited by: §1, §2.
  • [15] H. H. Nguyen, J. Kim, and Y. Kim (2013) Differential Privacy in Practice. Journal of Computing Science and Engineering 7, pp. 177–186. External Links: Document Cited by: §2.

Appendix

Lemma 1.

For any PDF with a location parameter at distance from the constraint, bounds on the possible values of its scaling parameter are determined by the following four inequalities:

(1)
(2)
(3)
(4)
Proof.

Recall from the main text that the privacy guarantee to be satisfied is as shown in Formula (5).

(5)

By isolating in the privacy guarantee, we obtain Formulae (6) and (7). Given a PDF for database with a scaling parameter , a paired PDF for database satisfies the privacy guarantee if its scaling parameter falls in the intersection of the two spans given by these inequalities.

(6)
(7)

In order for this intersection to be a real-valued range, it is necessary for the input to the LambertW functions in the inequalities to always be greater than or equal to . To ensure that this condition is met for all possible values of , we first take the derivative of the input with respect to as shown in Formula (8).

(8)

Three possible zeros for the derivative can be calculated as shown in Formulae (9) (where the variable Z can be replaced with 0 or -1) and (10).

(9)
(10)

Since the substitution of the zero from Formula (9) into the input of the LambertW function does not allow for easy steps of simplification when using -1 as the value of , we take an alternate approach to first show that the value of the input to the LambertW function is the same at both zeros defined in Formula (9) (using 0 or -1 as the value of ). Let the first zero occur at and the second occur at . The equality between the values at these zeros is shown in Formulae (11) - (16).

(11)
(12)
(13)
(14)

At this point, we replace and with their values as determined by Formula (9) and then we simplify the expression. In the interest of space, we substitute the instances of the LambertW function from Formula (9), using for the 0 branch and for the -1 branch.

(15)
(16)

Since the LambertW functions of and both have the same input, the equality of Formula (16) is confirmed to be valid. Now by taking the zero of Formula (9) using 0 as the value for Z, we rewrite the zero in terms of as shown in Formula (17)

(17)

When substituting in the original LambertW input with the expression in Formula (17), the value of the input becomes which is outside of the allowable range of input. Since we have shown that the value of the zero using -1 as the value of Z will be the same, both of the potential modes determined by these zeros can be ignored. The function of the input is therefore unimodal and it remains to determine whether the mode is a minimum or a maximum. By setting to 0 in the derivative, we obtain the expression shown in Formula (18).

(18)

It is clear that the denominator is always positive as both factors are squared. In the numerator, four factors are present. The first, being the constant 2, and the third, being an exponential function, must always be positive. The sign of the second and fourth factors remains to be determined. We show that the second factor is always negative by proving Formula (19).