DP (Differential Privacy) [21, 22] is becoming a gold standard for data privacy; it enables big data analysis while protecting users’ privacy against adversaries with arbitrary background knowledge. According to the underlying architecture, DP can be categorized into the one in the centralized model and the one in the local model . In the centralized model, a “trusted” database administrator, who can access to all users’ personal data, obfuscates the data (e.g., by adding noise, generalization) before providing them to a (possibly malicious) data analyst. Although DP was extensively studied for the centralized model at the beginning, the original personal data in this model can be leaked from the database by illegal access or internal fraud. This issue is critical in recent years, because the number of data breach incidents is increasing .
The local model does not require a “trusted” administrator, and therefore does not suffer from the data leakage issue explained above. In this model, each user obfuscates her personal data by herself, and sends the obfuscated data to a data collector (or data analyst). Based on the obfuscated data, the data collector can estimate some statistics (e.g., histogram, heavy hitters ) of the personal data. DP in the local model, which is called LDP (Local Differential Privacy) , has recently attracted much attention in the academic field [5, 12, 24, 29, 30, 39, 42, 44, 45, 49, 56], and has also been adopted by industry [16, 48, 23].
However, LDP mechanisms regard all personal data as equally sensitive, and leave a lot of room for increasing data utility. For example, consider questionnaires such as: “Have you ever cheated in an exam?” and “Were you with a prostitute in the last month?” . Obviously, “Yes” is a sensitive response to these questionnaires, whereas “No” is not sensitive. A RR (Randomized Response) method proposed by Mangat 
utilizes this fact. Specifically, it reports “Yes” or “No” as follows: if the true answer is “Yes”, always report “Yes”; otherwise, report “Yes” and “No” with probabilityand , respectively. Since the reported answer “Yes” may come from both the true answers “Yes” and “No”, the confidentiality of the user reporting “Yes” is not violated. Moreover, since the reported answer “No” is always come from the true answer “No”, the data collector can estimate a distribution of true answers with higher accuracy than Warner’s RR , which simply flips “Yes” and ”No” with probability . However, Mangat’s RR does not provide LDP, since LDP regards both “Yes” and “No” as equally sensitive.
There are a lot of “non-sensitive” data for other types of data. For example, locations such as hospitals and home can be sensitive, whereas visited sightseeing places, restaurants, and coffee shops are non-sensitive for many users. Divorced people may want to keep their divorce secret, while the others may not care about their marital status. The distinction between sensitive and non-sensitive data can also be different from user to user (e.g., home address is different from user to user; some people might want to keep secret even the sightseeing places). To explain more about this issue, we briefly review related work on LDP and variants of DP.
Related work. Since Dwork  introduced
a number of its variants have been studied to provide different types of privacy guarantees;
Pufferfish privacy ,
dependent DP ,
Bayesian DP ,
mutual-information DP ,
Rényi DP , and
distribution privacy .
In particular, LDP  has been widely studied in the literature.
Erlingsson et al.  proposed the RAPPOR as an obfuscation mechanism providing LDP, and implemented it in Google Chrome browser.
Kairouz et al.  showed that under the and losses,
the randomized response (generalized to multiple alphabets) and RAPPOR
are order optimal among all LDP mechanisms in the low and high privacy regimes, respectively.
Wang et al.  generalized the RAPPOR and
a random projection-based method  ,
and found parameters that minimize the variance of the estimate.
, and found parameters that minimize the variance of the estimate.
Some studies also attempted to address the non-uniformity of privacy requirements among records (rows) or among items (columns) in the centralized DP: Personalized DP , Heterogeneous DP , and One-sided DP . However, obfuscation mechanisms that address the non-uniformity among input values in the “local” DP have not been studied, to our knowledge. In this paper, we show that data utility can be significantly increased by designing such local mechanisms.
Our contributions. The goal of this paper is to design obfuscation mechanisms in the local model that achieve high data utility while providing DP for sensitive data. To achieve this, we introduce the notion of ULDP (Utility-optimized LDP), which provides a privacy guarantee equivalent to LDP only for sensitive data, and obfuscation mechanisms providing ULDP. As a task for the data collector, we consider discrete distribution estimation [2, 24, 27, 29, 39, 45, 23, 56], where personal data take discrete values. Our contributions are as follows:
We first consider the setting in which all users use the same obfuscation mechanism, and propose two ULDP mechanisms: utility-optimized RR and utility-optimized RAPPOR. We prove that when there are a lot of non-sensitive data, our mechanisms provide much higher utility than two state-of-the-art LDP mechanisms: the RR (for multiple alphabets) [29, 30] and RAPPOR . We also prove that when most of the data are non-sensitive, our mechanisms even provide almost the same utility as a non-private mechanism that does not obfuscate the personal data in the low privacy regime where the privacy budget is for a set of personal data.
We then consider the setting in which the distinction between sensitive and non-sensitive data can be different from user to user, and propose a PUM (Personalized ULDP Mechanism) with semantic tags. The PUM keeps secret what is sensitive for each user, while enabling the data collector to estimate a distribution using some background knowledge about the distribution conditioned on each tag (e.g., geographic distributions of homes). We also theoretically analyze the data utility of the PUM.
We finally show that our mechanisms are very promising in terms of utility using two large-scale datasets.
The proofs of all statements in the paper are given in the appendices.
Cautions and limitations. Although ULDP is meant to protect sensitive data, there are some cautions and limitations.
First, we assume that each user sends a single datum and that each user’s personal data is independent (see Section 2.1). This is reasonable for a variety of personal data (e.g., locations, age, sex, marital status), where each user’s data is irrelevant to most others’ one. However, for some types of personal data (e.g., flu status ), each user can be highly influenced by others. There might also be a correlation between sensitive data and non-sensitive data when a user sends multiple data (on a related note, non-sensitive attributes may lead to re-identification of a record ). A possible solution to these problems would be to incorporate ULDP with Pufferfish privacy [32, 47], which is used to protect correlated data. We leave this as future work (see Section 7 for discussions on the case of multiple data per user and the correlation issue).
We focus on a scenario in which it is easy for users to decide what is sensitive (e.g., cheating experience, location of home). However, there is also a scenario in which users do not know what is sensitive. For the latter scenario, we cannot use ULDP but can simply apply LDP.
Apart from the sensitive/non-sensitive data issue, there are scenarios in which ULDP does not cover. For example, ULDP does not protect users who have a sensitivity about “information disclosure” itself (i.e., those who will not disclose any information). We assume that users have consented to information disclosure. To collect as much data as possible, we can provide an incentive for the information disclosure; e.g., provide a reward or point-of-interest (POI) information nearby a reported location. We also assume that the data collector obtains a consensus from users before providing reported data to third parties. Note that these cautions are common to LDP.
There might also be a risk of discrimination; e.g., the data collector might discriminate against all users that provide a yes-answer, and have no qualms about small false positives. False positives decrease with increase in . We note that LDP also suffer from this attack; the false positive probability is the same for both ULDP and LDP with the same .
In summary, ULDP provides a privacy guarantee equivalent to LDP for sensitive data under the assumption of the data independence. We consider our work as a building-block of broader DP approaches or the basis for further development.
Let be the set of non-negative real numbers. Let be the number of users, , (resp. ) be a finite set of personal (resp. obfuscated) data. We assume continuous data are discretized into bins in advance (e.g., a location map is divided into some regions). We use the superscript “” to represent the -th user. Let (resp.
) be a random variable representing personal (resp. obfuscated) data of the-th user. The -th user obfuscates her personal data via her obfuscation mechanism , which maps to with probability , and sends the obfuscated data to a data collector. Here we assume that each user sends a single datum. We discuss the case of multiple data in Section 7.
We divide personal data into two types: sensitive data and non-sensitive data. Let be a set of sensitive data common to all users, and be the remaining personal data. Examples of such “common” sensitive data are the regions including public sensitive locations (e.g., hospitals) and obviously sensitive responses to questionnaires described in Section 1111Note that these data might be sensitive for many/most users but not for all in practice (e.g., some people might not care about their cheating experience). However, we can regard these data as sensitive for all users (i.e., be on the safe side) by allowing a small loss of data utility..
Furthermore, let () be a set of sensitive data specific to the -th user (here we do not include into because is protected for all users in our mechanisms). is a set of personal data that is possibly non-sensitive for many users but sensitive for the -th user. Examples of such “user-specific” sensitive data are the regions including private locations such as their home and workplace. (Note that the majority of working population can be uniquely identified from their home/workplace location pairs .)
In Sections 3 and 4, we consider the case where all users divide into the same sets of sensitive data and of non-sensitive data, i.e., , and use the same obfuscation mechanism (i.e., ). In Section 5, we consider a general setting that can deal with the user-specific sensitive data and user-specific mechanisms . We call the former case a common-mechanism scenario and the latter a personalized-mechanism scenario.
We assume that each user’s personal data
is independently and identically distributed (i.i.d.) with a probability distribution, which generates with probability . Let and be tuples of all personal data and all obfuscated data, respectively. The data collector estimates from by a method described in Section 2.5. We denote by the estimate of . We further denote by the probability simplex; i.e., .
2.2 Privacy Measures
LDP (Local Differential Privacy)  is defined as follows:
Definition 1 (-Ldp).
Let . An obfuscation mechanism from to provides -LDP if for any and any ,
LDP guarantees that an adversary who has observed cannot determine, for any pair of and , whether it is come from or with a certain degree of confidence. As the privacy budget approaches to , all of the data in become almost equally likely. Thus, a user’s privacy is strongly protected when is small.
2.3 Utility Measures
In this paper, we use the loss (i.e., absolute error) and the loss (i.e., squared error) as utility measures. Let (resp. ) be the (resp.
) loss function, which maps the estimateand the true distribution to the loss; i.e., , . It should be noted that is generated from and is generated from using . Since is computed from , both the and losses depend on .
In our theoretical analysis in Sections 4 and 5, we take the expectation of the loss over all possible realizations of . In our experiments in Section 6, we replace the expectation of the loss with the sample mean over multiple realizations of and divide it by to evaluate the TV (Total Variation). In Appendix E, we also show that the loss has similar results to the ones in Sections 4 and 6 by evaluating the expectation of the loss and the MSE (Mean Squared Error), respectively.
2.4 Obfuscation Mechanisms
Formally, given , the -RR is an obfuscation mechanism that maps to with the probability:
Generalized RAPPOR. The RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response)  is an obfuscation mechanism implemented in Google Chrome browser. Wang et al.  extended its simplest configuration called the basic one-time RAPPOR by generalizing two probabilities in perturbation. Here we call it the generalized RAPPOR and describe its algorithm in detail.
The generalized RAPPOR is an obfuscation mechanism with the input alphabet and the output alphabet . It first deterministically maps to , where is the
-th standard basis vector. It then probabilistically flips each bit ofto obtain obfuscated data , where is the -th element of . Wang et al.  compute from two parameters (representing the probability of keeping unchanged) and (representing the probability of flipping into ). In this paper, we compute from two parameters and .
Specifically, given and , the ()-generalized RAPPOR maps to with the probability:
where if and , and if and , and if and , and otherwise. The basic one-time RAPPOR  is a special case of the generalized RAPPOR where . also provides -LDP.
2.5 Distribution Estimation Methods
Here we explain the empirical estimation method [2, 27, 29] and the EM reconstruction method [1, 2]. Both of them assume that the data collector knows the obfuscation mechanism used to generate from .
Empirical estimation method. The empirical estimation method [2, 27, 29] computes an empirical estimate of using an empirical distribution of the obfuscated data . Note that , , and can be represented as an -dimensional vector, -dimensional vector, and matrix, respectively. They have the following equation:
The empirical estimation method computes by solving (3).
Let be the true distribution of obfuscated data; i.e., . As the number of users increases, the empirical distribution converges to . Therefore, the empirical estimate also converges to . However, when the number of users is small, many elements in can be negative. To address this issue, the studies in [23, 50] kept only estimates above a significance threshold determined via Bonferroni correction, and discarded the remaining estimates.
EM reconstruction method.
The EM (Expectation-Maximization) reconstruction method[1, 2] (also called the iterative Bayesian technique ) regards as a hidden variable and estimates from using the EM algorithm  (for details of the algorithm, see [1, 2]). Let be an estimate of by the EM reconstruction method. The feature of this algorithm is that is equal to the maximum likelihood estimate in the probability simplex (see  for the proof). Since this property holds irrespective of the number of users , the elements in are always non-negative.
In this paper, our theoretical analysis uses the empirical estimation method for simplicity, while our experiments use the empirical estimation method, the one with the significance threshold, and the EM reconstruction method.
3 Utility-Optimized LDP (ULDP)
In this section, we focus on the common-mechanism scenario (outlined in Section 2.1) and introduce ULDP (Utility-optimized Local Differential Privacy), which provides a privacy guarantee equivalent to -LDP only for sensitive data. Section 3.1 provides the definition of ULDP. Section 3.2 shows some theoretical properties of ULDP.
Figure 1 shows an overview of ULDP. An obfuscation mechanism providing ULDP, which we call the utility-optimized mechanism, divides obfuscated data into protected data and invertible data. Let be a set of protected data, and be a set of invertible data.
The feature of the utility-optimized mechanism is that it maps sensitive data to only protected data . In other words, it restricts the output set, given the input , to . Then it provides -LDP for ; i.e., for any and any . By this property, a privacy guarantee equivalent to -LDP is provided for any sensitive data , since the output set corresponding to is restricted to . In addition, every output in reveals the corresponding input in (as in Mangat’s randomized response ) to optimize the estimation accuracy.
We now formally define ULDP and the utility-optimized mechanism:
Definition 2 (-Uldp).
Given , , and , an obfuscation mechanism from to provides -ULDP if it satisfies the following properties:
For any , there exists an such that
For any and any ,
We refer to an obfuscation mechanism providing -ULDP as the -utility-optimized mechanism.
Example. For an intuitive understanding of Definition 2, we show that Mangat’s randomized response  provides -ULDP. As described in Section 1, this mechanism considers binary alphabets (i.e., ), and regards the value as sensitive (i.e., ). If the input value is , it always reports as output. Otherwise, it reports and with probability and , respectively. Obviously, this mechanism does not provide -LDP for any . However, it provides -ULDP.
-ULDP provides a privacy guarantee equivalent to -LDP for any sensitive data , as explained above. On the other hand, no privacy guarantees are provided for non-sensitive data because every output in reveals the corresponding input in . However, it does not matter since non-sensitive data need not be protected. Protecting only minimum necessary data is the key to achieving locally private distribution estimation with high data utility.
We can apply any -LDP mechanism to the sensitive data in to provide -ULDP as a whole. In Sections 4.1 and 4.2, we propose a utility-optimized RR (Randomized Response) and utility-optimized RAPPOR, which apply the -RR and -RAPPOR, respectively, to the sensitive data .
It might be better to generalize ULDP so that different levels of can be assigned to different sensitive data. We leave introducing such granularity as future work.
Remark. It should also be noted that the data collector needs to know to estimate from (as described in Section 2.5), and that the -utility-optimized mechanism itself includes the information on what is sensitive for users (i.e., the data collector learns whether each belongs to or not by checking the values of for all ). This does not matter in the common-mechanism scenario, since the set of sensitive data is common to all users (e.g., public hospitals). However, in the personalized-mechanism scenario, the -utility-optimized mechanism , which expands the set of personal data to , includes the information on what is sensitive for the -th user. Therefore, the data collector learns whether each belongs to or not by checking the values of for all , despite the fact that the -th user wants to hide her user-specific sensitive data (e.g., home, workplace). We address this issue in Section 5.
3.2 Basic Properties of ULDP
Previous work showed some basic properties of differential privacy (or its variant), such as compositionality  and immunity to post-processing . We briefly explain theoretical properties of ULDP including the ones above.
Sequential composition. ULDP is preserved under adaptive sequential composition when the composed obfuscation mechanism maps sensitive data to pairs of protected data. Specifically, consider two mechanisms from to and from to such that (resp. ) maps sensitive data to protected data (resp. ). Then the sequential composition of and maps sensitive data to pairs of protected data ranging over:
Then we obtain the following compositionality.
Proposition 1 (Sequential composition).
Let . If provides -ULDP and provides -ULDP for each , then the sequential composition of and provides -ULDP.
For example, if we apply an obfuscation mechanism providing -ULDP for times, then we obtain -ULDP in total (this is derived by repeatedly using Proposition 1).
Post-processing. ULDP is immune to the post-processing by a randomized algorithm that preserves data types: protected data or invertible data. Specifically, if a mechanism provides -ULDP and a randomized algorithm maps protected data over (resp. invertible data) to protected data over (resp. invertible data), then the composite function provides -ULDP.
Note that needs to preserve data types for utility; i.e., to make all invertible (as in Definition 2) after post-processing. The DP guarantee for is preserved by any post-processing algorithm. See Appendix B.2 for details.
Compatibility with LDP. Assume that data collectors A and B adopt a mechanism providing ULDP and a mechanism providing LDP, respectively. In this case, all protected data in the data collector A can be combined with all obfuscated data in the data collector B (i.e., data integration) to perform data analysis under LDP. See Appendix B.3 for details.
Lower bounds on the and losses. We present lower bounds on the and losses of any ULDP mechanism by using the fact that ULDP provides (5) for any and any . Specifically, Duchi et al.  showed that for , the lower bounds on the and losses (minimax rates) of any -LDP mechanism can be expressed as and , respectively. By directly applying these bounds to and , the lower bounds on the and losses of any -ULDP mechanisms for can be expressed as and , respectively. In Section 4.3, we show that our utility-optimized RAPPOR achieves these lower bounds when is close to (i.e., high privacy regime).
4 Utility-Optimized Mechanisms
In this section, we focus on the common-mechanism scenario and propose the utility-optimized RR (Randomized Response) and utility-optimized RAPPOR (Sections 4.1 and 4.2). We then analyze the data utility of these mechanisms (Section 4.3).
4.1 Utility-Optimized Randomized Response
We propose the utility-optimized RR, which is a generalization of Mangat’s randomized response  to -ary alphabets with sensitive symbols. As with the RR, the output range of the utility-optimized RR is identical to the input domain; i.e., . In addition, we divide the output set in the same way as the input set; i.e., , .
Figure 2 shows the utility-optimized RR with and . The utility-optimized RR applies the -RR to . It maps to () with the probability so that (5) is satisfied, and maps to itself with the remaining probability. Formally, we define the utility-optimized RR (uRR) as follows:
Definition 3 (-utility-optimized RR).
Let and . Let , , and . Then the -utility-optimized RR (uRR) is an obfuscation mechanism that maps to () with the probability defined as follows:
The -uRR provides -ULDP.
4.2 Utility-Optimized RAPPOR
Next, we propose the utility-optimized RAPPOR with the input alphabet and the output alphabet . Without loss of generality, we assume that are sensitive and are non-sensitive; i.e., , .
Figure 3 shows the utility-optimized RAPPOR with and . The utility-optimized RAPPOR first deterministically maps to the -th standard basis vector . It should be noted that if is sensitive data (i.e., ), then the last elements in are always zero (as shown in the upper-left panel of Figure 3). Based on this fact, the utility-optimized RAPPOR regards obfuscated data such that as protected data; i.e.,
Then it applies the ()-generalized RAPPOR to , and maps to (as shown in the lower-left panel of Figure 3) with the probability so that (5) is satisfied. We formally define the utility-optimized RAPPOR (uRAP):
Definition 4 (-utility-optimized RAPPOR).
Let , , and . Let , . Then the -utility-optimized RAPPOR (uRAP) is an obfuscation mechanism that maps to with the probability given by:
where is written as follows:
The -uRAP provides -ULDP, where is given by (7).
Although we used the generalized RAPPOR in and in Definition 4, hereinafter we set in the same way as the original RAPPOR . There are two reasons for this. First, it achieves “order” optimal data utility among all -ULDP mechanisms in the high privacy regime, as shown in Section 4.3. Second, it maps to with probability , which is close to when is large (i.e., low privacy regime). Wang et al.  showed that the generalized RAPPOR with parameter minimizes the variance of the estimate. However, our uRAP with parameter maps to with probability which is less than for any and is less than even when goes to infinity. Thus, our uRAP with maps to with higher probability, and therefore achieves a smaller estimation error over all non-sensitive data. We also consider that an optimal for our uRAP is different from the optimal () for the generalized RAPPOR. We leave finding the optimal for our uRAP (with respect to the estimation error over all personal data) as future work.
We refer to the -uRAP with in shorthand as the -uRAP.
4.3 Utility Analysis
We evaluate the loss of the uRR and uRAP when the empirical estimation method is used for distribution estimation222We note that we use the empirical estimation method in the same way as , and that it might be possible that other mechanisms have better utility with a different estimation method. However, we emphasize that even with the empirical estimation method, the uRAP achieves the lower bounds on the and losses of any ULDP mechanisms when , and the uRR and uRAP achieve almost the same utility as a non-private mechanism when and most of the data are non-sensitive.. In particular, we evaluate the loss when is close to (i.e., high privacy regime) and (i.e., low privacy regime). Note that ULDP provides a natural interpretation of the latter value of . Specifically, it follows from (5) that if , then for any , the likelihood that the input data is is almost equal to the sum of the likelihood that the input data is . This is consistent with the fact that the -RR with parameter sends true data (i.e., in (2)) with probability about and false data (i.e., ) with probability about , and hence provides plausible deniability .
uRR in the general case. We begin with the uRR:
Proposition 4 ( loss of the uRR).
Let , , , and . Then the expected loss of the -uRR mechanism is given by:
where represents .
be the uniform distribution over; i.e., for any , , and for any , . Symmetrically, let be the uniform distribution over .
For , the loss is maximized by :
For any and , (11) is maximized by :
where represents .
For , the loss is maximized by a mixture distribution of and :
Let be a distribution over defined by:
Then for any , (11) is maximized by :
where represents .
Next, we instantiate the loss in the high and low privacy regimes based on these propositions.
It was shown in  that the expected loss of the -RR is at most when . The right-hand side of (15) is much smaller than this when . Although both of them are “upper-bounds” of the expected losses, we show that the total variation of the -uRR is also much smaller than that of the -RR when in Section 6.
It should be noted that the expected loss of the non-private mechanism, which does not obfuscate the personal data at all, is at most . Thus, when and , the -uRR achieves almost the same data utility as the non-private mechanism, whereas the expected loss of the -RR is twice larger than that of the non-private mechanism .
uRAP in the general case. We then analyze the uRAP:
Proposition 7 ( loss of the uRAP).
Let , , and . The expected -loss of the -uRAP mechanism is:
where represents .
When , the loss is maximized by the uniform distribution over :
For any and , (16) is maximized when :
where represents .
Note that this proposition covers a wide range of . For example, when , it covers both the high privacy regime () and low privacy regime (), since . Below we instantiate the loss in the high and low privacy regimes based on this proposition.
uRAP in the high privacy regime. If is close to , we have