Fine-grained Poisoning Attacks to Local Differential Privacy Protocols for Mean and Variance Estimation

Local differential privacy (LDP) protects individual data contributors against privacy-probing data aggregation and analytics. Recent work has shown that LDP for some specific data types is vulnerable to data poisoning attacks, which enable the attacker to alter analytical results by injecting carefully-crafted bogus data. In this work, we focus on applying data poisoning attack to unexplored statistical tasks, i.e. mean and variance estimations. In contrast to prior work that aims for overall LDP performance degradation or straightforward attack gain maximization, our attacker can fine-tune the LDP estimated mean/variance to the desired target values and simultaneously manipulate them. To accomplish this goal, we propose two types of data poisoning attacks: input poisoning attack (IPA) and output poisoning attack (OPA). The former is independent of LDP while the latter utilizes the characteristics of LDP, thus being more effective. More intriguingly, we observe a security-privacy consistency where a small ϵ enhances the security of LDP contrary to the previous conclusion of a security-privacy trade-off. We further study the consistency and reveal a more holistic view of the threat landscape of LDP in the presence of data poisoning attacks. We comprehensively evaluate the attacks on three real-world datasets and report their effectiveness for achieving the target values. We also explore defense mechanisms and provide insights into the secure LDP design.

READ FULL TEXT VIEW PDF
11/22/2021

Poisoning Attacks to Local Differential Privacy Protocols for Key-Value Data

Local Differential Privacy (LDP) protocols enable an untrusted server to...
11/05/2019

Data Poisoning Attacks to Local Differential Privacy Protocols

Local Differential Privacy (LDP) protocols enable an untrusted data coll...
04/12/2019

Towards Formalizing the GDPR's Notion of Singling Out

There is a significant conceptual gap between legal and mathematical thi...
08/14/2019

Aggregating Votes with Local Differential Privacy: Usefulness, Soundness vs. Indistinguishability

Voting plays a central role in bringing crowd wisdom to collective decis...
07/21/2021

Secure Random Sampling in Differential Privacy

Differential privacy is among the most prominent techniques for preservi...
01/24/2022

Adversarial Classification under Gaussian Mechanism: Calibrating the Attack to Sensitivity

This work studies anomaly detection under differential privacy with Gaus...
05/02/2019

Passive and active attackers in noiseless privacy

Differential privacy offers clear and strong quantitative guarantees for...

1. Introduction

Local differential privacy (LDP) (Duchi et al., 2013), a variant of differential privacy (Dwork et al., 2014) in a distributed environment, was developed to protect individual user data against an untrusted data collector regardless of the adversary’s background knowledge. Numerous LDP protocols have been proposed for various statistical tasks such as frequency (Wang et al., 2017, 2019a; Erlingsson et al., 2014; Wang et al., 2020; Warner, 1965), mean/variance (Duchi et al., 2018; Wang et al., 2019b) and distribution (Murakami and Kawamoto, 2019; Li et al., 2020). LDP has also been integrated into many real-world applications as a de facto privacy-preserving data collection tool. For example, Google deployed LDP in Chrome browser to collect users’ homepages (Erlingsson et al., 2014); Microsoft implemented LDP in Windows 10 to analyze application usage statistics of customers (Ding et al., 2017).

Recently, Cao et al. (Cao et al., 2021a) and Cheu et al. (Cheu et al., 2021) independently studied the security of LDP under data poisoning attacks (or called manipulation attacks in (Cheu et al., 2021)

). They found that malicious users could send carefully crafted false data to skew the collector’s statistical estimate effectively by leveraging the randomization of LDP. In particular, an

untargeted attack is presented in (Cheu et al., 2021) to allow an attacker to compromise a group of legitimate users and inject false data, hence degrading the overall performance of the LDP. On the other hand, the data poisoning attacks in (Cao et al., 2021a; Wu et al., 2022) aim to promote the attacker-selected targeted items, e.g. in a recommender system, by maximizing the associated estimated statistics, such as frequency and key-value data.

Figure 1. Illustration of our fine-grained data poisoning attacks on LDP-based mean/variance estimation.

In this work, we investigate the fine-grained data poisoning attack against LDP protocols for mean and variance estimation, which has not been explored in the literature. Mean/variance estimation is a crucial component for many data analytical applications. For example, a company conducts a market survey to identify the target market segments based on their customers’ income (Scharfenaker et al., 2019) as shown in Figure 1. From the survey, the company estimates the mean and variance of the income so as to make informed decisions on the product price, related services, etc. In order to enhance customers’ privacy, LDP can be adopted to obfuscate individual customer’s raw income value before being sent to the company for mean and variance estimation. Meanwhile, a rival company may launch a fine-grained data poisoning attack by injecting erroneous data into the data collection process to bring the final estimates as close to the target values as possible. Consequently, the resultant estimate deviates from reality and leads to a deceptive conclusion, e.g. the customers in the middle income quintile are mistakenly believed to come from a lower quintile (Fontenot et al., 2018). Note that existing work does not support such estimate fine-tuning (Cao et al., 2021a; Cheu et al., 2021; Wu et al., 2022).

We present two types of fine-grained data poisoning attacks on the local user side – input poisoning attack (IPA) and output poisoning attack (OPA) against two state-of-the-art LDP protocols for mean and variance, i.e., Stochastic Rounding (SR) (Duchi et al., 2018) and Piecewise Mechanism (PM) (Wang et al., 2019b). Consistent with prior work, we assume that the attacker can control a group of fake users by purchasing accounts from dark markets (Cao et al., 2021a). As illustrated in Figure 1, the attacker in IPA can inject false input data to the local LDP instance through these fake users, while an OPA attacker can modify the output of the LDP perturbation mechanism on the controlled user end. Leveraging the LDP, OPA is more effective in terms of attack results. In the end, the remote server will receive polluted data that results in skewed mean/variance estimates close to the intended values of the attacker.

To control the LDP estimate at a finer level, the attack depends on two observations in reality. First, companies and governments, for commercial/public interest or as required by regulations, need to periodically collect user information to learn the status quo and then publish the related statistical results (Erlingsson et al., 2014; Microsoft, 2022; Jingdong Big Data Research, 2020; Shrider et al., 2021). Second, those historical results regarding the same entity tend to be close if data collections are made over a short period of time (Statista, 2021; Shrider et al., 2021; Fontenot et al., 2018). As a result, the attacker can leverage the data transparency and the predictable information changes to enable fine-grained data manipulation. Specifically, we assume that the attacker can acquire related statistics about genuine users from recent, publicly available statistical reports or by compromising a small number of users (see Threat model in Section 3).

Besides precise control, another challenge for the attacker is to manipulate more than one statistical estimate, i.e., control mean and variance at the same time. This is common for applications that rely on multiple measures from the surveyed population. For example, a company may be interested in both the income average (the mean) and inequality (the variance) of customers. This kind of multi-task estimation via a single query is also enabled by LDP (Li et al., 2020). Hence, the attacker must consider the correlation between different measures. To this end, we formulate the attack as a simultaneous equation solving problem and coordinate the generation of the poisonous data across the controlled fake users.

We systematically study the proposed attacks. We first analyze the sufficient conditions to launch IPA and OPA. We further discuss the lower bound on the required number of fake users given the target mean and variance. We are particularly interested in the relationship between various attack parameters and performance, as well as the associated implications. Thus, we theoretically study the MSE between the target value and the final estimate. For the mean estimation, OPA has a smaller MSE, because direct manipulation of the local LDP output will ignore the effect of perturbation noise and give the attacker a significant advantage in producing an estimate close to the target. For the variance estimation, we cannot derive a conclusive theoretical result to favor either of the attacking strategies. This is because the bias is data-dependent in the bias-variance decomposition of MSE. We provide more detail in Sections 4.2 and 5.2.

In prior work (Cheu et al., 2021; Cao et al., 2021a; Wu et al., 2022), a security-privacy trade-off for LDP protocols was revealed: a small (strong privacy guarantee) leads to a less secure LDP protocol against their attacks. However, we in this work have the opposite observation that weak privacy protection with a large is vulnerable to our attacks. We call this security-privacy consistency for LDP protocols. We analyze the two assertions and show that, surprisingly, they are both valid and that, together, they provide a holistic understanding of the threat landscape. The conclusion is disturbing since it complicates the already elusive reasoning and selection of privacy budget in LDP and makes designing a secure LDP more difficult (see Section 6). To mitigate our attacks, we also propose a clustering-based method for fake user detection and discuss the relevant defenses in Section 8. Our main contributions are:

  • [leftmargin=*]

  • We are the first to study the fine-grained data poisoning attack against the state-of-the-art LDP protocols for mean and variance estimation.

  • We propose two types of attacks, input poisoning attack and output poisoning attack in order to precisely control the statistical estimates to the intended values. The former is independent of LDP protocols while the latter takes advantage of LDP for improved performance in general.

  • We theoretically analyze the sufficient conditions to launch the proposed attacks, study the introduced errors in the attacks, and discuss the factors that impact the attack effectiveness.

  • We discover a fundamental security-privacy consistency associated with our attacks, which is at odds with the prior finding of a security-privacy trade-off. We provide an in-depth analysis and discussions to reveal the cause of difference.

  • We empirically evaluate our attacks on three real-world datasets. The results show that given the target values, our attacks can effectively manipulate the mean and variance only with small errors. We also propose and evaluate a countermeasure, and provide insights into secure LDP design and other mitigation methods.

2. Background and Related Work

2.1. Local Differential Privacy

In the local setting of differential privacy, it is assumed that there is no trusted third party. In this paper, we consider there are users and one remote server. Each user possesses a data value , and the server wants to estimate the mean and variance of values from all local users. To protect privacy, each user randomly perturbs his/her using an algorithm , where is the output domain of , and sends to the server.

Definition 0 (-Local Differential Privacy (-Ldp) (Duchi et al., 2013)).

An algorithm satisfies -LDP () if and only if for any input , the following inequality holds:

Intuitively, an attacker cannot deduce with high confidence whether the input is or given the output of an LDP mechanism. The offered privacy is controlled by , i.e., small (large) results in a strong (weak) privacy guarantee and a low (high) data utility. Since the user only reports the privatized result instead of the original value , even if the server is malicious, the users’ privacy is protected. In our attack, the attacker can manipulate a group of fake users in order to change the estimates of mean/variance on the server (See Section 3 for the detailed threat model).

2.2. Mean and Variance Estimation with LDP

We introduce two widely-used LDP mechanisms for mean and variance estimation, Stochastic Rounding (SR) (Duchi et al., 2018) and Piecewise Mechanism (PM) (Wang et al., 2019b). Note that they were originally developed for mean estimation only and were subsequently adapted to support variance estimation in (Li et al., 2020). In this work, we use the adapted version.

2.2.1. SR mechanism

The SR mechanism first partitions all users into two groups: the group reports their squared original values and the group submits their original values. All the values must be transformed into before being used in the LDP.

Perturbation. SR first converts the value into the range . Suppose that the range of the original input values is . SR calculates transformation coefficients for and for and derives for or for . Then SR perturbs each value as follows

where and .

Aggregation. It has been proven that . The server calculates for and for

, and estimates their mean. The process provides unbiased estimation of the mean of

and , denoted by and respectively. The variance of is estimated as .

2.2.2. PM mechanism

PM also divides users into groups and in which users report the squared values and original values respectively.

Perturbation. In PM, the input domain is and the output domain is , where . Similar to SR, PM first transforms the value into the range via the same steps in SR. Then PM perturbs each value as follows

where , and .

Aggregation. It has been proven that in PM. The server re-converts to for and to for , and then estimates their mean, from which the server can get the unbiased mean estimations and . The variance of is estimated as .

The following lemma shows the error of the SR and PM mechanisms, which is useful for later analysis of the attack error.

Lemma 0 (Error of SR and PM mechanisms (Wang et al., 2019b)).

Assume there are users with the values . Let be the mean of those values, and and be the mean estimated by the SR and PM respectively. The error is bounded by

It is also shown in (Wang et al., 2019b) that the PM mechanism has smaller error than the SR mechanism when is large.

2.3. Data Poisoning Attack to LDP Protocols

We discuss the related work that studied the data poisoning attacks against LDP protocols. In particular, (Cheu et al., 2021) studied the untargeted attacks. They focused on degrading the overall performance of LDP protocols regardless of the underlying statistical task. The core idea of the attack is that an accurate aggregator must be sensitive to small changes in the distribution of perturbed data. Thus, the attacker can send false data to distort the distribution and thereby impair the accuracy. The results showed that the vulnerability due to their attack is inherent to the LDP, i.e., every non-interactive LDP protocol suffers from their attacks.

In contrast, targeted attacks were investigated in (Cao et al., 2021a; Wu et al., 2022). Albeit they aim for different types of data, i.e., frequency data and key-value data, their attacks are carried out for the attacker-chosen target items and share a similar idea. In particular, they both begin by defining the overall attack gain with respect to the relevant statistics of target items (or keys) given the fake data using knowledge of the LDP aggregation. Then they formulate the attacks as an optimization problem with the objective of maximizing the overall attack gain, and the solution being the fake data that the attacker will send to the data collector.

In this work, we expand the knowledge from prior work. We consider more sophisticated, fine-grained attacks for the mean and variance estimation under LDP. Our attacker can calibrate the fake values in order to set the final estimates to be the desired values. The manipulation of mean and variance estimation can be done within a single invocation of the underlying LDP protocol. This work also provides important new insights into the analysis of attack impact and mitigation design.

3. Threat Model

In this section, we present our threat model, including the attacker’s capabilities and objectives.

3.1. Assumption

Our attacks rely on the following assumptions. First, we assume that the data collector periodically collects user information to derive the intended statistical results. For privacy concerns, LDP may be adopted. This periodical data collection is important and even mandatory in practice for the update on the status quo in order to make informed decisions for relevant activities in the future. For various reasons, such as transparency, research and regulatory compliance (Erlingsson et al., 2014; Microsoft, 2022; Jingdong Big Data Research, 2020; Shrider et al., 2021; Fontenot et al., 2018), the results will also be made public, thus accessible to the attacker. Second, if the respective data collections are made over a short period of time, the trend of those historical results with respect to the same entity tends to be “stable”, i.e. their values are close (Statista, 2021; Shrider et al., 2021; Fontenot et al., 2018). Therefore, the attacker can use the statistics from the most recent data report to improve the attack accuracy. Specifically, our attacker needs to estimate the number of authentic users , the sum of the input values of genuine users and the sum of the squared values of genuine users . Additionally, we assume that the attacker can inject fake users into the LDP protocol that already contains genuine users, thus totaling users in the system. This is consistent with prior work showing that an attacker can inject a large number of fake accounts/users into a variety of web services with minimal cost (Cao et al., 2021a; Wu et al., 2022). Next, we discuss the estimation of the required information.

  • [leftmargin=*]

  • Estimating . Denote as the estimate of . The attacker can deduce from publicly available and reliable sources, e.g. service providers often disclose the number of users under the LDP protection for publicity (Erlingsson et al., 2014; Microsoft, 2022).

  • Estimating and . Let and be the estimate of and respectively. We offer two intuitive estimating methods.

    (1) From public historical data. This is the most straightforward way. Given the estimated user number , the historical mean and variance , the attacker can derive , .

    (2) Compromising a small number of genuine users. The attacker can compromise out of genuine users and obtain their original input values . This is reasonable in practice for a small number and also a prerequisite for prior work (Cheu et al., 2021). Thus the attacker can estimate , .

We differentiate the attacker’s ability to interfere with LDP in the proposed IPA and OPA attacks. Those capabilities are aligned with prior work (Cao et al., 2021a; Cheu et al., 2021). We make no assumptions about additional advantages of the attacker. Specifically,

  • [leftmargin=*]

  • Input poisoning attacker: In the input poisoning attack, the attacker only knows the input range of the LDP and can control the fake users to generate falsified values in the input domain of the local LDP instance.

  • Output poisoning attacker: In addition to the knowledge in IPA, an OPA attacker can gain access to the implementation of the LDP and know the related parameters and output domain of the local LDP. Therefore, the attacker can leverage the knowledge of LDP to produce bogus data in the output domain and send it to the remote server in order to manipulate the final estimate.

3.2. Attack Objectives

The attacker’s goal is to modify the estimated mean and variance through LDP to be as close to the target mean and variance as possible. Meanwhile, the attacker wishes to simultaneously manipulate and . We adopt the adapted versions of PM and SR mechanisms to privately estimate the mean and variance within one protocol invocation. Note that our attack objective also implicitly covers the situation of maximizing (minimizing) the mean and variance by setting a significantly large (small) target and . In what follows, we will elaborate on our attacks. Some important notations are summarized in Table 1.

Notation Description
The number of genuine users
The attacker-estimated
The number of fake users
The group reporting the squared values
The group reporting the original values
The attacker-estimated
The attacker-estimated
The attacker’s target mean
The attacker’s target variance
The transformation coefficient for
The transformation coefficient for
Table 1. Notations.

4. Input Poisoning Attack

4.1. Attack Details

The main idea of the IPA is to craft the input values for the controlled fake users in order to alter the mean and variance estimates to be close to the attacker’s desired mean and variance . Note that launching IPA does not rely on the implementation details of the underlying LDP protocol. Therefore, we generalize the attack for both SR and PM mechanisms. Formally, we denote the original input of genuine users as , and the crafted input of fake users as . We formulate IPA as finding such that

To solve , the attacker needs to know , and , which can be estimated from published information or by compromising a small number of genuine users as described in Section 3. By substituting , and with their estimates , and , a set of desired fake values should satisfy

(1)
(2)

We first transform Equations (1) and (2) into the following optimization problem and solve it to find a set of valid fake values111

In this work, we use the Adam optimizer in PyTorch framework

(Opacus, 2019) to solve the problem (3) with the learning rate and iterations..

(3)

4.2. Theoretical Analysis

In this subsection, we analyze IPA in terms of the sufficient conditions to launch the attack, the number of fake users and the introduced error. We assume that the data values in and have been transformed into . Later on, the analysis results can be scaled by the factors , , and to recover the corresponding representations in the original value range .

4.2.1. Sufficient Condition to Launch IPA

The sufficient condition to launch IPA is that Equations (1) and (2) are solvable so that the attacker can find a set of fake input values of the LDP protocol. Specifically, the IPA can be launched if the inequalities hold below.

(4)
(5)

where and are the maximum and minimum of under the constraint .

Here we explain how to obtain the above sufficient condition. Since the input value is in the range and there are fake users, Equation (1) is solvable if holds. We then need to determine if Equation (2) is solvable under the constraint . When the range of under this constraint covers the target , the equation is solvable. To this end, we solve the following optimization problem to find the upper and lower bounds of the term . We first study the maximum of , i.e., the minimum of .

(6)
Theorem 1 ().

Let , when fake values are , fake values are and one fake value is , achieves the maximum.

Proof.

See Appendix A. ∎

Similarly, we can determine the lower bound of by changing the objective function from to . We omit the detailed steps here but share the result: when all fake values are , achieves the minimum. Given the maximum and minimum of denoted by and respectively, we can get the above sufficient condition in (4) and (5).

4.2.2. Number of fake users .

The sufficient condition gives the relationship that the target values and should satisfy to launch the attack. Here we further discuss the minimum number of fake users required to satisfy the sufficient condition given and . Note that it is difficult to provide a definite expression of the lower bound on , which relies on , , , and . These values in turn determine the coefficients of and , as well as the sign direction of the inequalities (4) and (5). On the other hand, since the inequalities only contain a linear term and a quadratic term of , it is easy to solve the lower bound on using the quadratic formula given , , , and . We empirically study the minimum number of fake users given and in Section 7.2.4.

4.2.3. Error of IPA

Theorem 2 and Theorem 3 present the error of IPA against the SR and PM mechanisms respectively.

Theorem 2 (Error of Input Poisoning Attack on SR).

Denote the estimated mean and variance (after IPA) as and , we can bound the error of and by

where , , , and .

Proof.

See Appendix B. ∎

Theorem 3 (Error of Input Poisoning Attack on PM).

Denote the estimated mean and variance (after IPA) as and , we can bound the error of and by

where , , , , and .

Proof.

See Appendix C. ∎

We find that all errors are data-dependent due to the terms , and . For the attack error on the target mean, when is small, the error of IPA on the SR mechanism is smaller than that on the PM mechanism. When is large, the attack against the PM performs better because the PM introduces less LDP error.

For the target variance, we cannot draw a similar conclusion because the term depends on . If this term for the SR mechanism is small enough, IPA has better results against SR. Likewise, for IPA against PM.

5. Output Poisoning Attack

5.1. Attack Details

In this section, we propose the output poisoning attack that crafts the output of the LDP instance to set the final estimates to the target mean and variance . Notice that the attacker in OPA can gain access to the LDP implementation and knows which group each fake user belongs to.

Let the number of genuine users in and be and , and the number of fake users be and respectively. Denote the input of the genuine users in as and the input of the fake users in as . Because of the randomness in the LDP local output, the objective of OPA is to produce fake values such that the expected mean and variance are the attacker-intended and respectively. However, it is difficult to calculate because and the variance is data-dependent. To address this problem, we slack the attack goal by replacing with . Formally, we intend to achieve the following attack objective in practice.

(7)
(8)

Since the perturbation and aggregation are different for SR and PM, the remainder of this subsection will study how to solve Equation (7) and (8) and generate the fake values accordingly.

5.1.1. OPA against SR

By substituting , and in Equations (7) and (8) with their estimates , and , we have

(9)
(10)

where and are the transformation coefficients and is the lower bound of the input range. In SR, the fake value is either or . Consequently, the attacker can prepare the fake values by determining how many “” and “” respectively to be assigned to the fake users. Suppose in group there are fake users with and fake users with . Per Equations (9) and (10), we have

For the fake users in each group, there are two unknown variables and two equations. Therefore, the attacker can solve the above equations to derive the number of and in each group and then randomly assigns them to the fake users in and .

5.1.2. OPA against PM

In PM, the output value is in the range . According to Equations (7) and (8), the attacker can calculate the fake values by solving the following equations

An intuitive method to solve this equation is to divide the right-hand-side by or . However, because the fake values generated by this method are equal, the server can easily detect the fake users. For instance, if all fake users in report and those in report

, the server can easily recognize such outlier values because it is statistically unlikely that many genuine users will send the same perturbed values. To address this problem, the attacker first solves the equations using the method described above, and then randomly perturbs each value while maintaining the sum and keeping the values in

. Finally, the attacker randomly assigns the values to each fake user in the groups and .

Advantages of OPA by accessing the LDP implementation. By accessing the implementation of the underlying LDP protocols, the attacker can generate and inject poisonous data values that are more effective in affecting the server’s final estimation. Specifically, the attacker knows how to solve Equations (7) and (8) by leveraging the knowledge of the LDP perturbation and aggregation . For example, by gaining access to the related parameters, e.g. , , , , and in and of SR, the attacker can solve Equations (9) and (10), producing and directly injecting fake values into the output domain of the local LDP instance to launch OPA. As a result, OPA in general will improve the attack performance since the attacker effectively circumvents the LDP perturbation for fake users, thus introducing less noise in the estimation (see the following error analysis).

5.2. Theoretical Analysis

In this subsection, we discuss the sufficient conditions to launch the output poisoning attack, as well as the error and associated bound of . We assume that the data values in and have been converted into the range .

5.2.1. Sufficient Conditions for OPA

SR mechanism. The sufficient conditions to launch OPA is that Equations (9) and (10) are solvable so that the attacker can produce viable output for the local LDP instance in order to manipulate the estimate on the server. In SR, the output is either or . Therefore, Equations (9) and (10) are solvable if the following hold

(11)
(12)

In practice, the attacker first needs to know if the conditions are met to launch the attack. However, and are known only after the users are partitioned. To solve this issue, we estimate and to be in that all users are uniformly grouped into and . Therefore, we obtain the sufficient conditions by determining the value of that satisfies (11) and (12).

PM mechanism. The analysis of PM is similar to that of SR. In PM, the output is in the range where . Thus, Equations (9) and (10) are solvable if the following inequalities hold. We also estimate and to be .

(13)
(14)

5.2.2. Number of fake users .

We discuss the minimum number of fake users required to satisfy the sufficient condition given , . Due to the similar reason for IPA, it is difficult to give a definite expression of the lower bound on . However, given , , , and , we can solve the lower bound on such that (11) and (12) (for SR) or (13) and (14) (for PM) hold. Since we only have linear terms of , the lower bound on can be derived using simple algebraic. We empirically study the minimum number of fake users given and in Section 7.2.4. The results show that given the same and , OPA can satisfy the sufficient condition with fewer fake users versus IPA.

5.2.3. Error Analysis

Theorem 1 and Theorem 2 present the error of OPA against SR and PM respectively.

Theorem 1 (Error of Output Poisoning Attack against SR).

Denote the estimated mean and variance (after OPA) as and , we can bound the error of and by

where , , , and