1 Introduction
DP (Differential Privacy) [21, 22] is becoming a gold standard for data privacy; it enables big data analysis while protecting users’ privacy against adversaries with arbitrary background knowledge. According to the underlying architecture, DP can be categorized into the one in the centralized model and the one in the local model [22]. In the centralized model, a “trusted” database administrator, who can access to all users’ personal data, obfuscates the data (e.g., by adding noise, generalization) before providing them to a (possibly malicious) data analyst. Although DP was extensively studied for the centralized model at the beginning, the original personal data in this model can be leaked from the database by illegal access or internal fraud. This issue is critical in recent years, because the number of data breach incidents is increasing [15].
The local model does not require a “trusted” administrator, and therefore does not suffer from the data leakage issue explained above. In this model, each user obfuscates her personal data by herself, and sends the obfuscated data to a data collector (or data analyst). Based on the obfuscated data, the data collector can estimate some statistics (e.g., histogram, heavy hitters [44]) of the personal data. DP in the local model, which is called LDP (Local Differential Privacy) [19], has recently attracted much attention in the academic field [5, 12, 24, 29, 30, 39, 42, 44, 45, 49, 56], and has also been adopted by industry [16, 48, 23].
However, LDP mechanisms regard all personal data as equally sensitive, and leave a lot of room for increasing data utility. For example, consider questionnaires such as: “Have you ever cheated in an exam?” and “Were you with a prostitute in the last month?” [11]. Obviously, “Yes” is a sensitive response to these questionnaires, whereas “No” is not sensitive. A RR (Randomized Response) method proposed by Mangat [37]
utilizes this fact. Specifically, it reports “Yes” or “No” as follows: if the true answer is “Yes”, always report “Yes”; otherwise, report “Yes” and “No” with probability
and , respectively. Since the reported answer “Yes” may come from both the true answers “Yes” and “No”, the confidentiality of the user reporting “Yes” is not violated. Moreover, since the reported answer “No” is always come from the true answer “No”, the data collector can estimate a distribution of true answers with higher accuracy than Warner’s RR [51], which simply flips “Yes” and ”No” with probability . However, Mangat’s RR does not provide LDP, since LDP regards both “Yes” and “No” as equally sensitive.There are a lot of “nonsensitive” data for other types of data. For example, locations such as hospitals and home can be sensitive, whereas visited sightseeing places, restaurants, and coffee shops are nonsensitive for many users. Divorced people may want to keep their divorce secret, while the others may not care about their marital status. The distinction between sensitive and nonsensitive data can also be different from user to user (e.g., home address is different from user to user; some people might want to keep secret even the sightseeing places). To explain more about this issue, we briefly review related work on LDP and variants of DP.
Related work. Since Dwork [21] introduced DP, a number of its variants have been studied to provide different types of privacy guarantees; e.g., LDP [19], privacy [8], Pufferfish privacy [32], dependent DP [36], Bayesian DP [53], mutualinformation DP [14], Rényi DP [38], and distribution privacy [31]. In particular, LDP [19] has been widely studied in the literature. For example, Erlingsson et al. [23] proposed the RAPPOR as an obfuscation mechanism providing LDP, and implemented it in Google Chrome browser. Kairouz et al. [29] showed that under the and losses, the randomized response (generalized to multiple alphabets) and RAPPOR are order optimal among all LDP mechanisms in the low and high privacy regimes, respectively. Wang et al. [50] generalized the RAPPOR and a random projectionbased method [6]
, and found parameters that minimize the variance of the estimate.
Some studies also attempted to address the nonuniformity of privacy requirements among records (rows) or among items (columns) in the centralized DP: Personalized DP [28], Heterogeneous DP [3], and Onesided DP [17]. However, obfuscation mechanisms that address the nonuniformity among input values in the “local” DP have not been studied, to our knowledge. In this paper, we show that data utility can be significantly increased by designing such local mechanisms.
Our contributions. The goal of this paper is to design obfuscation mechanisms in the local model that achieve high data utility while providing DP for sensitive data. To achieve this, we introduce the notion of ULDP (Utilityoptimized LDP), which provides a privacy guarantee equivalent to LDP only for sensitive data, and obfuscation mechanisms providing ULDP. As a task for the data collector, we consider discrete distribution estimation [2, 24, 27, 29, 39, 45, 23, 56], where personal data take discrete values. Our contributions are as follows:

We first consider the setting in which all users use the same obfuscation mechanism, and propose two ULDP mechanisms: utilityoptimized RR and utilityoptimized RAPPOR. We prove that when there are a lot of nonsensitive data, our mechanisms provide much higher utility than two stateoftheart LDP mechanisms: the RR (for multiple alphabets) [29, 30] and RAPPOR [23]. We also prove that when most of the data are nonsensitive, our mechanisms even provide almost the same utility as a nonprivate mechanism that does not obfuscate the personal data in the low privacy regime where the privacy budget is for a set of personal data.

We then consider the setting in which the distinction between sensitive and nonsensitive data can be different from user to user, and propose a PUM (Personalized ULDP Mechanism) with semantic tags. The PUM keeps secret what is sensitive for each user, while enabling the data collector to estimate a distribution using some background knowledge about the distribution conditioned on each tag (e.g., geographic distributions of homes). We also theoretically analyze the data utility of the PUM.

We finally show that our mechanisms are very promising in terms of utility using two largescale datasets.
The proofs of all statements in the paper are given in the appendices.
Cautions and limitations. Although ULDP is meant to protect sensitive data, there are some cautions and limitations.
First, we assume that each user sends a single datum and that each user’s personal data is independent (see Section 2.1). This is reasonable for a variety of personal data (e.g., locations, age, sex, marital status), where each user’s data is irrelevant to most others’ one. However, for some types of personal data (e.g., flu status [47]), each user can be highly influenced by others. There might also be a correlation between sensitive data and nonsensitive data when a user sends multiple data (on a related note, nonsensitive attributes may lead to reidentification of a record [40]). A possible solution to these problems would be to incorporate ULDP with Pufferfish privacy [32, 47], which is used to protect correlated data. We leave this as future work (see Section 7 for discussions on the case of multiple data per user and the correlation issue).
We focus on a scenario in which it is easy for users to decide what is sensitive (e.g., cheating experience, location of home). However, there is also a scenario in which users do not know what is sensitive. For the latter scenario, we cannot use ULDP but can simply apply LDP.
Apart from the sensitive/nonsensitive data issue, there are scenarios in which ULDP does not cover. For example, ULDP does not protect users who have a sensitivity about “information disclosure” itself (i.e., those who will not disclose any information). We assume that users have consented to information disclosure. To collect as much data as possible, we can provide an incentive for the information disclosure; e.g., provide a reward or pointofinterest (POI) information nearby a reported location. We also assume that the data collector obtains a consensus from users before providing reported data to third parties. Note that these cautions are common to LDP.
There might also be a risk of discrimination; e.g., the data collector might discriminate against all users that provide a yesanswer, and have no qualms about small false positives. False positives decrease with increase in . We note that LDP also suffer from this attack; the false positive probability is the same for both ULDP and LDP with the same .
In summary, ULDP provides a privacy guarantee equivalent to LDP for sensitive data under the assumption of the data independence. We consider our work as a buildingblock of broader DP approaches or the basis for further development.
2 Preliminaries
2.1 Notations
Let be the set of nonnegative real numbers. Let be the number of users, , (resp. ) be a finite set of personal (resp. obfuscated) data. We assume continuous data are discretized into bins in advance (e.g., a location map is divided into some regions). We use the superscript “” to represent the th user. Let (resp.
) be a random variable representing personal (resp. obfuscated) data of the
th user. The th user obfuscates her personal data via her obfuscation mechanism , which maps to with probability , and sends the obfuscated data to a data collector. Here we assume that each user sends a single datum. We discuss the case of multiple data in Section 7.We divide personal data into two types: sensitive data and nonsensitive data. Let be a set of sensitive data common to all users, and be the remaining personal data. Examples of such “common” sensitive data are the regions including public sensitive locations (e.g., hospitals) and obviously sensitive responses to questionnaires described in Section 1^{1}^{1}1Note that these data might be sensitive for many/most users but not for all in practice (e.g., some people might not care about their cheating experience). However, we can regard these data as sensitive for all users (i.e., be on the safe side) by allowing a small loss of data utility..
Furthermore, let () be a set of sensitive data specific to the th user (here we do not include into because is protected for all users in our mechanisms). is a set of personal data that is possibly nonsensitive for many users but sensitive for the th user. Examples of such “userspecific” sensitive data are the regions including private locations such as their home and workplace. (Note that the majority of working population can be uniquely identified from their home/workplace location pairs [25].)
In Sections 3 and 4, we consider the case where all users divide into the same sets of sensitive data and of nonsensitive data, i.e., , and use the same obfuscation mechanism (i.e., ). In Section 5, we consider a general setting that can deal with the userspecific sensitive data and userspecific mechanisms . We call the former case a commonmechanism scenario and the latter a personalizedmechanism scenario.
We assume that each user’s personal data
is independently and identically distributed (i.i.d.) with a probability distribution
, which generates with probability . Let and be tuples of all personal data and all obfuscated data, respectively. The data collector estimates from by a method described in Section 2.5. We denote by the estimate of . We further denote by the probability simplex; i.e., .2.2 Privacy Measures
LDP (Local Differential Privacy) [19] is defined as follows:
Definition 1 (Ldp).
Let . An obfuscation mechanism from to provides LDP if for any and any ,
(1) 
LDP guarantees that an adversary who has observed cannot determine, for any pair of and , whether it is come from or with a certain degree of confidence. As the privacy budget approaches to , all of the data in become almost equally likely. Thus, a user’s privacy is strongly protected when is small.
2.3 Utility Measures
In this paper, we use the loss (i.e., absolute error) and the loss (i.e., squared error) as utility measures. Let (resp. ) be the (resp.
) loss function, which maps the estimate
and the true distribution to the loss; i.e., , . It should be noted that is generated from and is generated from using . Since is computed from , both the and losses depend on .In our theoretical analysis in Sections 4 and 5, we take the expectation of the loss over all possible realizations of . In our experiments in Section 6, we replace the expectation of the loss with the sample mean over multiple realizations of and divide it by to evaluate the TV (Total Variation). In Appendix E, we also show that the loss has similar results to the ones in Sections 4 and 6 by evaluating the expectation of the loss and the MSE (Mean Squared Error), respectively.
2.4 Obfuscation Mechanisms
We describe the RR (Randomized Response) [29, 30] and a generalized version of the RAPPOR [50] as follows.
Randomized response. The RR for ary alphabets was studied in [29, 30]. Its output range is identical to the input domain; i.e., .
Formally, given , the RR is an obfuscation mechanism that maps to with the probability:
(2) 
Generalized RAPPOR. The RAPPOR (Randomized Aggregatable PrivacyPreserving Ordinal Response) [23] is an obfuscation mechanism implemented in Google Chrome browser. Wang et al. [50] extended its simplest configuration called the basic onetime RAPPOR by generalizing two probabilities in perturbation. Here we call it the generalized RAPPOR and describe its algorithm in detail.
The generalized RAPPOR is an obfuscation mechanism with the input alphabet and the output alphabet . It first deterministically maps to , where is the
th standard basis vector. It then probabilistically flips each bit of
to obtain obfuscated data , where is the th element of . Wang et al. [50] compute from two parameters (representing the probability of keeping unchanged) and (representing the probability of flipping into ). In this paper, we compute from two parameters and .Specifically, given and , the ()generalized RAPPOR maps to with the probability:
where if and , and if and , and if and , and otherwise. The basic onetime RAPPOR [23] is a special case of the generalized RAPPOR where . also provides LDP.
2.5 Distribution Estimation Methods
Here we explain the empirical estimation method [2, 27, 29] and the EM reconstruction method [1, 2]. Both of them assume that the data collector knows the obfuscation mechanism used to generate from .
Empirical estimation method. The empirical estimation method [2, 27, 29] computes an empirical estimate of using an empirical distribution of the obfuscated data . Note that , , and can be represented as an dimensional vector, dimensional vector, and matrix, respectively. They have the following equation:
(3) 
The empirical estimation method computes by solving (3).
Let be the true distribution of obfuscated data; i.e., . As the number of users increases, the empirical distribution converges to . Therefore, the empirical estimate also converges to . However, when the number of users is small, many elements in can be negative. To address this issue, the studies in [23, 50] kept only estimates above a significance threshold determined via Bonferroni correction, and discarded the remaining estimates.
EM reconstruction method.
The EM (ExpectationMaximization) reconstruction method
[1, 2] (also called the iterative Bayesian technique [2]) regards as a hidden variable and estimates from using the EM algorithm [26] (for details of the algorithm, see [1, 2]). Let be an estimate of by the EM reconstruction method. The feature of this algorithm is that is equal to the maximum likelihood estimate in the probability simplex (see [1] for the proof). Since this property holds irrespective of the number of users , the elements in are always nonnegative.In this paper, our theoretical analysis uses the empirical estimation method for simplicity, while our experiments use the empirical estimation method, the one with the significance threshold, and the EM reconstruction method.
3 UtilityOptimized LDP (ULDP)
In this section, we focus on the commonmechanism scenario (outlined in Section 2.1) and introduce ULDP (Utilityoptimized Local Differential Privacy), which provides a privacy guarantee equivalent to LDP only for sensitive data. Section 3.1 provides the definition of ULDP. Section 3.2 shows some theoretical properties of ULDP.
3.1 Definition
Figure 1 shows an overview of ULDP. An obfuscation mechanism providing ULDP, which we call the utilityoptimized mechanism, divides obfuscated data into protected data and invertible data. Let be a set of protected data, and be a set of invertible data.
The feature of the utilityoptimized mechanism is that it maps sensitive data to only protected data . In other words, it restricts the output set, given the input , to . Then it provides LDP for ; i.e., for any and any . By this property, a privacy guarantee equivalent to LDP is provided for any sensitive data , since the output set corresponding to is restricted to . In addition, every output in reveals the corresponding input in (as in Mangat’s randomized response [37]) to optimize the estimation accuracy.
We now formally define ULDP and the utilityoptimized mechanism:
Definition 2 (Uldp).
Given , , and , an obfuscation mechanism from to provides ULDP if it satisfies the following properties:

For any , there exists an such that
(4) 
For any and any ,
(5)
We refer to an obfuscation mechanism providing ULDP as the utilityoptimized mechanism.
Example. For an intuitive understanding of Definition 2, we show that Mangat’s randomized response [37] provides ULDP. As described in Section 1, this mechanism considers binary alphabets (i.e., ), and regards the value as sensitive (i.e., ). If the input value is , it always reports as output. Otherwise, it reports and with probability and , respectively. Obviously, this mechanism does not provide LDP for any . However, it provides ULDP.
ULDP provides a privacy guarantee equivalent to LDP for any sensitive data , as explained above. On the other hand, no privacy guarantees are provided for nonsensitive data because every output in reveals the corresponding input in . However, it does not matter since nonsensitive data need not be protected. Protecting only minimum necessary data is the key to achieving locally private distribution estimation with high data utility.
We can apply any LDP mechanism to the sensitive data in to provide ULDP as a whole. In Sections 4.1 and 4.2, we propose a utilityoptimized RR (Randomized Response) and utilityoptimized RAPPOR, which apply the RR and RAPPOR, respectively, to the sensitive data .
It might be better to generalize ULDP so that different levels of can be assigned to different sensitive data. We leave introducing such granularity as future work.
Remark. It should also be noted that the data collector needs to know to estimate from (as described in Section 2.5), and that the utilityoptimized mechanism itself includes the information on what is sensitive for users (i.e., the data collector learns whether each belongs to or not by checking the values of for all ). This does not matter in the commonmechanism scenario, since the set of sensitive data is common to all users (e.g., public hospitals). However, in the personalizedmechanism scenario, the utilityoptimized mechanism , which expands the set of personal data to , includes the information on what is sensitive for the th user. Therefore, the data collector learns whether each belongs to or not by checking the values of for all , despite the fact that the th user wants to hide her userspecific sensitive data (e.g., home, workplace). We address this issue in Section 5.
3.2 Basic Properties of ULDP
Previous work showed some basic properties of differential privacy (or its variant), such as compositionality [22] and immunity to postprocessing [22]. We briefly explain theoretical properties of ULDP including the ones above.
Sequential composition. ULDP is preserved under adaptive sequential composition when the composed obfuscation mechanism maps sensitive data to pairs of protected data. Specifically, consider two mechanisms from to and from to such that (resp. ) maps sensitive data to protected data (resp. ). Then the sequential composition of and maps sensitive data to pairs of protected data ranging over:
Then we obtain the following compositionality.
Proposition 1 (Sequential composition).
Let . If provides ULDP and provides ULDP for each , then the sequential composition of and provides ULDP.
For example, if we apply an obfuscation mechanism providing ULDP for times, then we obtain ULDP in total (this is derived by repeatedly using Proposition 1).
Postprocessing. ULDP is immune to the postprocessing by a randomized algorithm that preserves data types: protected data or invertible data. Specifically, if a mechanism provides ULDP and a randomized algorithm maps protected data over (resp. invertible data) to protected data over (resp. invertible data), then the composite function provides ULDP.
Note that needs to preserve data types for utility; i.e., to make all invertible (as in Definition 2) after postprocessing. The DP guarantee for is preserved by any postprocessing algorithm. See Appendix B.2 for details.
Compatibility with LDP. Assume that data collectors A and B adopt a mechanism providing ULDP and a mechanism providing LDP, respectively. In this case, all protected data in the data collector A can be combined with all obfuscated data in the data collector B (i.e., data integration) to perform data analysis under LDP. See Appendix B.3 for details.
Lower bounds on the and losses. We present lower bounds on the and losses of any ULDP mechanism by using the fact that ULDP provides (5) for any and any . Specifically, Duchi et al. [20] showed that for , the lower bounds on the and losses (minimax rates) of any LDP mechanism can be expressed as and , respectively. By directly applying these bounds to and , the lower bounds on the and losses of any ULDP mechanisms for can be expressed as and , respectively. In Section 4.3, we show that our utilityoptimized RAPPOR achieves these lower bounds when is close to (i.e., high privacy regime).
4 UtilityOptimized Mechanisms
In this section, we focus on the commonmechanism scenario and propose the utilityoptimized RR (Randomized Response) and utilityoptimized RAPPOR (Sections 4.1 and 4.2). We then analyze the data utility of these mechanisms (Section 4.3).
4.1 UtilityOptimized Randomized Response
We propose the utilityoptimized RR, which is a generalization of Mangat’s randomized response [37] to ary alphabets with sensitive symbols. As with the RR, the output range of the utilityoptimized RR is identical to the input domain; i.e., . In addition, we divide the output set in the same way as the input set; i.e., , .
Figure 2 shows the utilityoptimized RR with and . The utilityoptimized RR applies the RR to . It maps to () with the probability so that (5) is satisfied, and maps to itself with the remaining probability. Formally, we define the utilityoptimized RR (uRR) as follows:
Definition 3 (utilityoptimized RR).
Let and . Let , , and . Then the utilityoptimized RR (uRR) is an obfuscation mechanism that maps to () with the probability defined as follows:
(6) 
Proposition 2.
The uRR provides ULDP.
4.2 UtilityOptimized RAPPOR
Next, we propose the utilityoptimized RAPPOR with the input alphabet and the output alphabet . Without loss of generality, we assume that are sensitive and are nonsensitive; i.e., , .
Figure 3 shows the utilityoptimized RAPPOR with and . The utilityoptimized RAPPOR first deterministically maps to the th standard basis vector . It should be noted that if is sensitive data (i.e., ), then the last elements in are always zero (as shown in the upperleft panel of Figure 3). Based on this fact, the utilityoptimized RAPPOR regards obfuscated data such that as protected data; i.e.,
(7) 
Then it applies the ()generalized RAPPOR to , and maps to (as shown in the lowerleft panel of Figure 3) with the probability so that (5) is satisfied. We formally define the utilityoptimized RAPPOR (uRAP):
Definition 4 (utilityoptimized RAPPOR).
Let , , and . Let , . Then the utilityoptimized RAPPOR (uRAP) is an obfuscation mechanism that maps to with the probability given by:
(8) 
where is written as follows:

if :
(9) 
if :
(10)
Proposition 3.
The uRAP provides ULDP, where is given by (7).
Although we used the generalized RAPPOR in and in Definition 4, hereinafter we set in the same way as the original RAPPOR [23]. There are two reasons for this. First, it achieves “order” optimal data utility among all ULDP mechanisms in the high privacy regime, as shown in Section 4.3. Second, it maps to with probability , which is close to when is large (i.e., low privacy regime). Wang et al. [50] showed that the generalized RAPPOR with parameter minimizes the variance of the estimate. However, our uRAP with parameter maps to with probability which is less than for any and is less than even when goes to infinity. Thus, our uRAP with maps to with higher probability, and therefore achieves a smaller estimation error over all nonsensitive data. We also consider that an optimal for our uRAP is different from the optimal () for the generalized RAPPOR. We leave finding the optimal for our uRAP (with respect to the estimation error over all personal data) as future work.
We refer to the uRAP with in shorthand as the uRAP.
4.3 Utility Analysis
We evaluate the loss of the uRR and uRAP when the empirical estimation method is used for distribution estimation^{2}^{2}2We note that we use the empirical estimation method in the same way as [29], and that it might be possible that other mechanisms have better utility with a different estimation method. However, we emphasize that even with the empirical estimation method, the uRAP achieves the lower bounds on the and losses of any ULDP mechanisms when , and the uRR and uRAP achieve almost the same utility as a nonprivate mechanism when and most of the data are nonsensitive.. In particular, we evaluate the loss when is close to (i.e., high privacy regime) and (i.e., low privacy regime). Note that ULDP provides a natural interpretation of the latter value of . Specifically, it follows from (5) that if , then for any , the likelihood that the input data is is almost equal to the sum of the likelihood that the input data is . This is consistent with the fact that the RR with parameter sends true data (i.e., in (2)) with probability about and false data (i.e., ) with probability about , and hence provides plausible deniability [29].
uRR in the general case. We begin with the uRR:
Proposition 4 ( loss of the uRR).
Let , , , and . Then the expected loss of the uRR mechanism is given by:
(11) 
where represents .
Let
be the uniform distribution over
; i.e., for any , , and for any , . Symmetrically, let be the uniform distribution over .For , the loss is maximized by :
Proposition 5.
For , the loss is maximized by a mixture distribution of and :
Proposition 6.
Let be a distribution over defined by:
(13) 
Then for any , (11) is maximized by :
(14) 
where represents .
Next, we instantiate the loss in the high and low privacy regimes based on these propositions.
uRR in the high privacy regime. When is close to , we have . Thus, the righthand side of (12) in Proposition 5 can be simplified as follows:
(15) 
It was shown in [29] that the expected loss of the RR is at most when . The righthand side of (15) is much smaller than this when . Although both of them are “upperbounds” of the expected losses, we show that the total variation of the uRR is also much smaller than that of the RR when in Section 6.
uRR in the low privacy regime. When and , the righthand side of (14) in Proposition 6 can be simplified by using :
It should be noted that the expected loss of the nonprivate mechanism, which does not obfuscate the personal data at all, is at most [29]. Thus, when and , the uRR achieves almost the same data utility as the nonprivate mechanism, whereas the expected loss of the RR is twice larger than that of the nonprivate mechanism [29].
uRAP in the general case. We then analyze the uRAP:
Proposition 7 ( loss of the uRAP).
Let , , and . The expected loss of the uRAP mechanism is:
(16) 
where represents .
When , the loss is maximized by the uniform distribution over :
Proposition 8.
Note that this proposition covers a wide range of . For example, when , it covers both the high privacy regime () and low privacy regime (), since . Below we instantiate the loss in the high and low privacy regimes based on this proposition.
uRAP in the high privacy regime. If is close to , we have
Comments
There are no comments yet.