A Shuffling Framework for Local Differential Privacy

06/11/2021 ∙ by Casey Meehan, et al. ∙ University of Wisconsin-Madison University of California, San Diego 2

ldp deployments are vulnerable to inference attacks as an adversary can link the noisy responses to their identity and subsequently, auxiliary information using the order of the data. An alternative model, shuffle DP, prevents this by shuffling the noisy responses uniformly at random. However, this limits the data learnability – only symmetric functions (input order agnostic) can be learned. In this paper, we strike a balance and propose a generalized shuffling framework that interpolates between the two deployment models. We show that systematic shuffling of the noisy responses can thwart specific inference attacks while retaining some meaningful data learnability. To this end, we propose a novel privacy guarantee, d-sigma privacy, that captures the privacy of the order of a data sequence. d-sigma privacy allows tuning the granularity at which the ordinal information is maintained, which formalizes the degree the resistance to inference attacks trading it off with data learnability. Additionally, we propose a novel shuffling mechanism that can achieve d-sigma privacy and demonstrate the practicality of our mechanism via evaluation on real-world datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Differential Privacy (DP) and its local variant (LDP) are the most commonly accepted notions of data privacy. LDP has the significant advantage of not requiring a trusted centralized aggregator, and has become a popular model for commercial deployments, such as those of Microsoft [Microsoft], Apple [Apple], and Google [Rappor1, Rappor2, Prochlo]. Its formal guarantee asserts that an adversary cannot infer the value of an individual’s private input by observing the noisy output. However in practice, a vast amount of public auxiliary information, such as address, social media connections, court records, property records [home], income and birth dates [birth], is available for every individual. An adversary, with access to such auxiliary information, can learn about an individual’s private data from several other participants’ noisy responses. We illustrate this as follows. Problem. An analyst runs a medical survey in Alice’s community to investigate how the prevalence of a highly contagious disease changes from neighborhood to neighborhood. Community members report a binary value indicating whether they have the disease. Next, consider the following two data reporting strategies. Strategy .

Each data owner passes their data through an appropriate randomizer (that flips the input bit with some probability) in their local devices and reports the noisy output to the untrusted data analyst.

Strategy . The noisy responses from the local devices of each of the data owners are collected by an intermediary trusted shuffler which dissociates the device IDs (metadata) from the responses and uniformly randomly shuffles them before sending them to the analyst. Strategy corresponds to the standard LDP deployment model (for example, Apple and Microsoft’s deployments). Here the order of the noisy responses is informative of the identity of the data owners – the noisy response at index corresponds to the first data owner and so on. Thus, the noisy responses can be directly linked with its associated device ID and subsequently, auxiliary information. For instance, an adversaryaaaThe analyst and the adversary could be same, we refer to them separately for the ease of understanding. may know the home addresses of the participants and use this to identify the responses of all the individuals from Alice’s household. Being highly infectious, all or most of them

(a) Original Data
(b) LDP  (c) Our scheme: (d) Our scheme: (e) Uniform shuffle (f) Attack: LDP (g) Attack: (h) Attack: (i) Attack: unif. shuff.
Figure 1: Demonstration of how our proposed scheme thwarts inference attacks at different granularities. Fig. 0(a) depicts the original sensitive data (such as income bracket) with eight color-coded labels. The position of the points represents public information (such as home address) used to correlate them. There are three levels of granularity: warm vs. cool clusters, blue vs. green and red vs. orange crescents, and light vs. dark within each crescent. Fig. 0(b) depicts LDP. Fig. 0(c) and 0(d) correspond to our scheme, each with (privacy parameter, Def. 3.3). The former uses a smaller distance threshold (, used to delineate the granularity of grouping – see Eq. 3) that mostly shuffles in each crescent. The latter uses a larger distance threshold () that shuffles within each cluster. Figures in the bottom row demonstrate an inference attack (uses Gaussian process correlation) on all four cases. We see that LDP reveals almost the entire dataset (Fig. 0(f)) while uniform shuffling prevents all classification (0(i)). However, the granularity can be controlled with our scheme (Figs. 0(g), 0(h)).

will have the same true value ( or ). So, the adversary can reliably infer Alice’s value by taking a simple majority vote of her and her household’s noisy responses. Note that this does not violate the LDP guarantee since the inputs are appropriately randomized when observed in isolation. We call such threats inference attacks – recovering an individual’s private input using all or a subset of other participants’ noisy responses. It is well known that protecting against inference attacks, that rely on underlying data correlations, is beyond the purview of DP [Pufferfish, DDP, definetti, sok].

Strategy 2 corresponds to the recently introduced shuffle DP model, such as Google’s Prochlo [Prochlo]. Here, the noisy responses are completely anonymized – the adversary cannot identify which LDP responses correspond to Alice and her household. Under such a model, only information that is completely order agnostic (i.e., symmetric functions that can be computed over just the bag of values, such as aggregate statistics) can be extracted. Consequently, the analyst also fails to accomplish their original goal as all the underlying data correlation is destroyed.

Thus, we see that the two models of deployment for LDP present a trade-off between vulnerability to inference attacks and scope of data learnability. In fact, as demonstrated by Kifer et. al [Kifer], it is impossible to defend against all inference attacks while simultaneously maintaining utility for learning. In the extreme case that the adversary knows everyone in Alice’s community has the same true value (but not which one), no mechanism can prevent revelation of Alice’s datapoint short of destroying all utility of the dataset. This then begs the question: Can we formally suppress specific inference attacks targeting each data owner while maintaining some meaningful learnability of the private data? Referring back to our example, can we thwart attacks inferring Alice’s data using specifically her households’ responses and still allow the medical analyst to learn its target trends? Can we offer this to every data owner participating?

In this paper, we strike a balance and we propose a generalized shuffle framework for deployment that can interpolate between the two extremes. Our solution is based on the key insight is that the order of the data acts as the proxy for the identity of data owners as illustrated above. The granularity at which the ordering is maintained formalizes resistance to inference attacks while retaining some meaningful learnability of the private data. Specifically, we guarantee each data owner that their data is shuffled together with a carefully chosen group of other data owners. Revisiting our example, consider uniformly shuffling the responses from Alice’s household and her immediate neighbors. Now an adversary cannot use her household’s responses to predict her value any better than they could with a random sample of responses from this group. In the same way that LDP prevents reconstruction of her datapoint using specifically her noisy response, this scheme prevents reconstruction of her datapoint using specifically her households’ responses. The real challenge is offering such guarantees equally to every data owner. Bob, Alice’s neighbor, needs his households’ responses shuffled in with his neighbors, as does Luis, a neighbor of Bob, who is not Alice’s neighbor. In this way, we have data owners with distinct groups that are most likely overlapping with each other. This disallows the trivial strategy of shuffling the noisy responses of each group uniformly. To this end, we propose shuffling the responses in a systematic manner that tunes the privacy guarantee, trading it off with data learnability. For the above example, our scheme can formally protect each data owner from inference attacks using specifically their household, while still learning how disease prevalence changes across the neighborhoods of Alice’s community.

This work offers two key contributions to the machine learning privacy literature:

  • Novel privacy guarantee. We propose a novel privacy definition, -privacy that captures the privacy of the order of a data sequence (Sec. 3.2) and formalizes the degree of resistance against inference attacks (Sec. 3.3). Intuitively, -privacy allows assigning a group, , for each data owner, , and protects against inference attacks that utilize the data of any subset of members of . The group assignment is based on a public auxiliary information – individuals of a single group are ‘similar’ w.r.t the auxiliary information. For instance, the groups can represent individuals in the same age bracket, ‘friends’ on social media or individuals living in each other’s vicinity (as in case of Alice in our example). This grouping determines a threshold of learnability – any learning that is order agnostic within a group (disease prevalence in a neighborhood – the data analyst’s goal in our example) is utilitarian and allowed; whereas analysis that involves identifying the values of individuals within a group (disease prevalence within specific households – the adversary’s goal) is regarded as a privacy threat and protected against. See Fig. 1 for a toy demonstration of how our guarantee allows tuning the granularity at which trends can be learned.

  • Novel shuffling framework. We propose a novel mechanism that shuffles the data systematically and achieves -privacy. This provides us a generalized shuffle framework for deployment that can interpolate between no shuffling (LDP) and uniform random shuffling (shuffle model). Our experimental results (Sec. 4) demonstrates its efficacy against realistic inference attacks.

1.1 Related Work

The shuffle model of DP [Bittau2017, shuffle2, shuffling1] differs from our scheme as follows. These works study DP benefits of shuffling where we study the inferential privacy benefits and only study uniformly random shuffling where ours generalizes this to tunable, non-uniform shuffling (see App. 7.12).

A steady line of work has studied inferential privacy [semantics, Kifer, IP, Dalenius:1977, dwork2010on, sok]. Our work departs from those in that we focus on local inferential privacy and do so via the new angle of shuffling.

Older works such as -anonymity [kanon], -diversity [ldiv], Anatomy [anatomy], and others [older1, older2, older3, older4, older5] have studied the privacy risk of non-sensitive auxiliary information, or ‘quasi identifiers’. These works focus on the setting of dataset release, where we focus on dataset collection and do not offer each data owner formal inferential guarantees, whereas this work does.

The De Finetti attack [definetti] shows how shuffling schemes are vulnerable to inference attacks that correlate records together to recover the original permutation of sensitive attributes. A strict instance of our privacy guarantee can thwart such attacks (at the cost of no utility, App. 7.2).

2 Background

Notations. Boldface (such as ) denotes a data sequence (ordered list); normal font (such as ) denotes individual values and represents a set.

2.1 Local Differential Privacy

The local model consists of a set of data owners and an untrusted data aggregator (analyst); each individual perturbs their data using a LDP algorithm (randomizers) and sends it to the analyst. The LDP guarantee is formally defined as

Definition 2.1.

[Local Differential Privacy, LDP [Warner, Evfimievski:2003:LPB:773153.773174, Kasivi]] A randomized algorithm is -locally differentially private (or -LDP ), if for any pair of private values and any subset of output,

(1)

In an extension of the local model, known as the shuffle model [shuffling1, shuffle2, blanket], the data owners randomize their inputs like in the local model. Additionally, an intermediate trusted shuffler applies a uniformly random permutation to all the noisy responses before the analyst can view them. The anonymity provided by the shuffler requires less noise than the local model for achieving the same privacy.

2.2 Local Inferential Privacy

Local inferential privacy captures what information a Bayesian adversary [Pufferfish], with some prior, can learn in the LDP setting. Specifically, it measures the largest possible ratio between the adversary’s posterior and prior beliefs about an individual’s data after observing a mechanism’s output .

Definition 2.2.

(Local Inferential Privacy Loss [Pufferfish]) Let and let denote the input (private) and output sequences (observable to the adversary) in the LDP setting. Additionally, the adversary’s auxiliary knowledge is modeled by a prior distribution on . The inferential privacy loss for the input sequence is given by

(2)

Bounding would imply that the adversary’s belief about the value of any does not change by much even after observing the output sequence . This means that an informed adversary does not learn much about the individual ’s private input upon observation of the entire private dataset .

2.3 Mallows Model

A permutation of a set is a bijection . The set of permutations of forms a symmetric group . As a shorthand, we use to denote applying permutation to a data sequence of length . Additionally, denotes the value at index in and denotes its inverse. For example, if and , then , and .
Mallows model is a popular probabilistic model for permutations [MM]. The mode of the distribution is given by the reference permutation – the probability of a permutation increases as we move ‘closer’ to as measured by rank distance metrics, such as the Kendall’s tau distance (Def. 7.1). The dispersion parameter controls how fast this increase happens.

Definition 2.3.

For a dispersion parameter , a reference permutation , and a rank distance measure , is the Mallows model where is a normalization term and .

Figure 2: Trusted shuffler mediates on

3 Data Privacy and Shuffling

In this section, we present -privacy and a shuffling mechanism capable of achieving the -privacy guarantee.

3.1 Problem Setting

In our problem setting, we have data owners each with a private input (Fig. 2). The data owners first randomize their inputs via a -LDP mechanism to generate . We consider an informed adversary with public auxiliary information about each individual. Additionally, just like in the shuffle model, we have a trusted shuffler. It mediates upon the noisy responses and systematically shuffles them based on (since is public, it is also accessible to the shuffler) to obtain the final output sequence ( corresponds to Alg. 1) which is sent to the untrusted data analyst. Next, we formally discuss the notion of order and its implications.

Definition 3.1.

(Order) The order of a sequence refers to the indices of its set of values and is represented by permutations from .

When the noisy response sequence is represented by the identity permutation , the value at index corresponds to and so on. Standard LDP releases the identity permutation w.p. 1. The output of the shuffler, , is some permutation of the sequence , i.e.,

where is determined via . For example, for , we have which means that the value at index () now corresponds to that of and so on.

3.2 Definition of -privacy

Inferential risk captures the threat of an adversary who infers ’s private using all or a subset of other data owners’ released ’s. Since we cannot prevent all such attacks and maintain utility, our aim is to formally limit which data owners can be leveraged in inferring ’s private . To make this precise, each is assigned a corresponding group, , of data owners. Each consists of all those s who are similar to w.r.t auxiliary information according to some distance measure . Here, we define ‘similar’ as being under a threshold .

(3)
Figure 3: An example social media connectivity graph

For example, can be Euclidean distance if corresponds to geographical locations, thwarting inference attacks using one’s immediate neighbors. If represents a social media connectivity graph, can measure the path length between two nodes, thwarting inference attacks using specifically one’s friends. For the example social media connectivity graph depicted in Fig. 3, assuming distance metric path length and , the groups are defined as and so on.

Intuitively, -privacy protects against inference attacks that leverages correlations at a finer granularity than . In other words, under -privacy, one subset of data owners (e.g. household) is no more useful for targeting than any other subset of data owners (e.g. some combination of neighbors). This leads to the following key insight for the formal privacy definition.

Key Insight. Formally, our privacy goal is to prevent the leakage of ordinal information from within a group. We achieve this by systematically bounding the dependence of the mechanism’s output on the relative ordering (of data values corresponding to the data owners) within each group.
First, we introduce the notion of neighboring permutations.

Definition 3.2.

(Neighboring Permutations) Given a group assignment , two permutations are defined to be neighboring w.r.t. a group (denoted as ) if .

Neighboring permutations differ only in the indices of its corresponding group . For example, and are neighboring w.r.t (Fig.3) since they differ only in and . We denote the set of all neighboring permutations as

(4)

Now, we formally define -privacy as follows.

Definition 3.3 (-privacy).

For a given group assignment on a set of entities and a privacy parameter , a randomized mechanism is - private if for all and neighboring permutations and any subset of output , we have

(5)

and are defined to be neighboring sequences.

-privacy states that, for any group , the mechanism is (almost) agnostic of the order of the data within the group. Even after observing the output, an adversary cannot learn about the relative ordering of the data within any group. Thus, two neighboring sequences are indistinguishable to an adversary.
An important property of -privacy is that post-processing computations on the output of a -private algorithm does not degrade privacy. Additionally, when applied multiple times, the privacy guarantee degrades gracefully. Both the properties are analogous to that of DP and the formal theorems are presented in App. 7.3. Interestingly, LDP mechanisms achieve a weak degree of -privacy.

Lemma 3.1.

An -LDP mechanism is -  private for any group assignment such that (proof in App. 7.3).

3.3 Privacy Implications

We now turn to -privacy’s semantic guarantees: what can/cannot be learned from the released sequence ? The group assignment delineates a threshold of learnability as follows.

  • Learning allowed.

    -privacy can answer queries that are order agnostic within groups, such as aggregate statistics of a group. In Alice’s case, the analyst can estimate the disease prevalence in her neighborhood.

  • Learning disallowed. Adversaries cannot identify (noisy) values of individuals within any group. While they may learn the disease prevalence in Alice’s neighborhood, they cannot determine the prevalence within her household and use that to target her value .

Consider any Bayesian adversary with a prior

on the joint distribution of noisy responses,

, modeling their beliefs on the correlation between participants in the dataset (such as the correlation between Alice and her households’ disease status). As with early DP works, such as [dwork_early], we consider an informed Bayesian adversary. With DP, informed adversaries know the private input of every data owner but . With -privacy, the informed adversary knows the assignment of noisy values outside , , and the unordered bag of noisy values in , . Formally,

Theorem 3.2.

For a given group assignment on a set of data owners, if a shuffling mechanism is -private, then for each data owner ,

for a prior distribution , where and is the noisy sequence for all data owners outside (proof in App. 7.4).

The above privacy loss variable differs slightly from that of Def. 2.2, since the informed adversary already knows and

. Equivalently, this bounds the prior-posterior odds gap on

We illustrate this with the following example on (Fig. 3). The adversary knows (i.e. prior ) that data owner is strongly correlated with their close friends and . Let and represent two neighboring sequences w.r.t , for any . Under -privacy, the adversary cannot distinguish between (data sequence where , and have value ) and (data sequence where , and have value ). Hence, after seeing the shuffled sequence, the adversary can only know the ‘bag’ of values and cannot specifically leverage ’s immediate friends’ responses to target . However, analysts may still answer queries that are order-agnostic in , which could not be achieved with uniform shuffling.

Note. By the post-processing [Dwork] property of LDP, the shuffled sequence retains the -LDP guarantee. The granularity of the group assignment determined by distance threshold and the privacy degree act as control knobs of the privacy spectrum. For instance w.r.t. (Fig. 3), for , we have and the problem reduces to the pure LDP setting. For , we get which corresponds to the case of uniform random shuffling (standard shuffle model). All other pairs of represent intermediate points in the privacy spectrum which are achievable via -privacy.

3.4 Utility of a Shuffling Mechanism

We now introduce a novel metric, -preservation, for assessing the utility of any shuffling mechanism. Let correspond to a set of indices in . The metric is defined as follows.

Definition 3.4.

(-preservation) A shuffling mechanism is defined to be -preserving w.r.t to a given subset , if

(6)

where and .

For example, consider . If permutes the output according to , then which preserves or of its original indices. This means that for any data sequence , at least fraction of its data values corresponding to the subset overlaps with that of shuffled sequence with high probability . Assuming, and denotes the set of data values corresponding to in data sequences and respectively, we have .

For example, let be the set of individuals from Nevada. Then, for a shuffling mechanism that provides -preservation to , with probability , of the values that are reported to be from Nevada in are genuinely from Nevada. The rationale behind this metric is that it captures the utility of the learning allowed by -privacy – if is equal to some group , preservation allows overall statistics of to be captured. Note that this utility metric is agnostic of both the data distribution and the analyst’s query. Hence, it is a conservative analysis of utility which serves as a lower bound for learning from .

1.25 Input: LDP sequence

;
         Public aux. info. ;
         Dist. threshold ; Priv. param. ;
Output: - Shuffled output sequence;
1

;
2 Construct graph with
    a) vertices

    b) edges

3

;
4

;
65 =

;
7

;
8 ;
9 ;
10 Return ;
Algorithm 1 -private Shuffling Mech.

3.5 -private Shuffling Mechanism

We now describe our novel shuffling mechanism that can achieve -privacy. In a nutshell, our mechanism samples a permutation from a suitable Mallows model and shuffles the data sequence accordingly. We can characterize the -privacy guarantee of our mechanism in the same way as that of the DP guarantee of classic mechanisms [Dwork]

– with variance and sensitivity. Intuitively, a larger dispersion parameter

(Def. 2.3) reduces randomness over permutations, increasing utility and increasing (worsening) the privacy parameter . The maximum value of for a given guarantee depends on the sensitivity of the rank distance measure over all neighboring permutations . Formally, we define the sensitivity as

the maximum change in distance from the reference permutation for any pair of neighboring
permutations after applying to them, . The privacy parameter
of the mechanism is then proportional to its sensitivity .

Given and a reference permutation , the sensitivity of a rank distance measure depends on the width, , which measures how ‘spread apart’ the members of any group of are in

For example, for and , . The sensitivity is an increasing function of the width. For instance, for Kendall’s distance we have . If a reference permutation clusters the members of each group closely together (low width) the groups are more likely to permute within themselves. This has two benefits. First, if a group is likely to shuffle within itself, it will have better -preservation (see App. 7.10 for demonstration). Second, for the same ( is an indicator of utility as it determines the dispersion of the sampled permutation), a lower value of width gives lower (better privacy).

Unfortunately, minimizing is an NP-hard problem (Thm. 7.3 in App. 7.6). We instead estimate the optimal

using the following heuristic approach based on a graph breadth first search.

Algorithm Description. Alg. 1 above proceeds as follows. We first compute the group assignment, , based on the public auxiliary information and desired distance threshold following Eq. 3 (Step 1). Then we construct with a breadth first search (BFS) graph traversal.

We translate into an undirected graph , where the vertices are indices and two indices are connected by an edge if they are both in some group (Step 2). Next, is computed via a breadth first search traversal (Step 4) – if the -th node in the traversal is , then . The rationale is that neighbors of (members of ) would be traversed in close succession. Hence, a neighboring node is likely to be traversed at some step near which means would be small (resulting in low width). Additionally, starting from the node with the highest degree (Steps 3-4) which corresponds to the largest group in (lower bound for for any ) helps to curtail the maximum width in

This is followed by the computation of the dispersion parameter, , for our Mallows model (Steps 5-6). Next, we sample a permutation from the Mallows model (Step 7) and we apply the inverse reference permutation to it, to obtain the desired permutation for shuffling. Recall that is (most likely) close to , which is unrelated to the original order of the data. therefore brings back to a shuffled version of the original sequence (identity permutation ). Note that since Alg. 1 is publicly known, the adversary/analyst knows . Hence, even in the absence of this step from our algorithm, the adversary/analyst could perform this anyway. Finally, we permute according to and output the result (Steps 9-10).

Theorem 3.3.

Alg. 1 is - private where .

The proof is in App. 7.8. Note that Alg. 1 provides the same level of privacy for any two group assignment as long as they have the same sensitivity, i.e, . This leads to the following theorem which generalizes the privacy guarantee for any group assignment.

Theorem 3.4.

Alg. 1 satisfies - for any group assignment where .

The proof is in App. 7.9. A utility theorem for Alg. 1 that formalizes the - preservation for Hamming distance (we chose for the ease of numerical computation) is in App. 7.10.

Note. Producing is completely data () independent. It only requires access to the public auxiliary information . Hence, Steps can be performed in a pre-processing phase and do not contribute to the actual running time. See App. 7.7 for an illustration of Alg. 1.

4 Evaluation

(a) PUDF: Attack
(b) Adult: Attack
(c) Twitch: Attack
(d) Adult: Attack ()
(e) PUDF: Learnability
(f) Adult: Learnability
(g) Twitch: Learnability
(h) Syn: Learnability
Figure 4: Our scheme interpolates between standard LDP (orange line) and uniform shuffling (blue line) in both privacy and data learnability. All plots increase group size along x-axis (except (d)). (a) (c): The fraction of participants vulnerable to an inferential attack. (d): Attack success with varying for a fixed . (e) (g): The accuracy of a calibration model trained on predicting the distribution of LDP outputs at any point

, such as the distribution of medical insurance types used specifically in the Houston area (not possible when uniformly shuffling across Texas). (h): Test accuracy of a classifier trained on

for the synthetic dataset in Fig. 1.

The previous sections describe how our shuffling framework interpolates between standard LDP and uniform random shuffling. We now experimentally evaluate this asking the following two questions –

Q1. Does the Alg. 1 mechanism protect against realistic inference attacks?
Q2. How well can Alg. 1 tune a model’s ability to learn trends within the shuffled data i.e. tune data learnability?

We evaluate on four datasets. We are not aware of any prior work that provides comparable local inferential privacy. Hence, we baseline our mechanism with the two extremes: standard LDP and uniform random shuffling. For concreteness, we detail our procedure with the PUDF dataset [PUDF] (license), which comprises k psychiatric patient records from Texas. Each data owner’s sensitive value is their medical payment method, which is reflective of socioeconomic class (such as medicaid or charity). Public auxiliary information is the hospital’s geolocation. Such information is used for understanding how payment methods (and payment amounts) vary from town to town for insurances in practice [insurance]. Uniform shuffling across Texas precludes such analyses. Standard LDP risks inference attacks, since patients attending hospitals in the same neighborhood have similar socioeconomic standing and use similar payment methods, allowing an adversary to correlate their noisy ’s. To trade these off, we apply Alg. 1 with being distance (km) between hospitals, and Kendall’s rank distance measure for permutations.

Our inference attack predicts ’s by taking a majority vote of the values of the data owners within of and who are most similar to w.r.t some additional privileged auxiliary information . For PUDF, this includes the data owners who attended hospitals that are within km of ’s hospital, and are most similar in payment amount . Using an randomized response mechanism, we resample the LDP sequence 50 times, and apply Alg. 1’s chosen permutation to each, producing 50 ’s. We then mount the majority vote attack on each for each . If the attack on a given is successful across of these LDP trials, we mark that data owner as vulnerable – although they randomize with LDP, there is a chance that a simple inference attack can recover their true value. We record the fraction of vulnerable data owners as

. We report 1-standard deviation error bars over 10 trials.

Additionally, we evaluate data learnability – how well the underlying statistics of the dataset are preserved across . For PUDF, this means training a model on the shuffled to predict the distribution of payment methods used near, for instance, Houston for . For this, we train a calibrated model, , on the shuffled outputs where is the set of all distributions on the domain of sensitive attributes . We implement Cal

as a gradient boosted decision tree (GBDT) model

[gradientboosting] calibrated with Platt scaling [calibration]. For each location , we treat the empirical distribution of values within as the ground truth distribution at , denoted by . Then, for each , we measure the Total Variation error between the predicted and ground truth distributions . We then report – the average TV error for distributions predicted at each

normalized by the TV error of naively guessing the uniform distribution at each

. With standard LDP, this task can be performed relatively well at the risk of inference attacks. With uniformly shuffled data, it is impossible to make geographically localized predictions unless the distribution of payment methods is identical in every Texas locale.

We additionally perform the above experiments on the following three datasets

  • Adult [adult]. This dataset is derived from the 1994 Census and has K records. Whether ’s annual income is k is considered private, . is age and is the individual’s marriage status.

  • Twitch [twitch]. This dataset, gathered from the Twitch social media platform, includes a graph of edges (mutual friendships) along with node features. The user’s history of explicit language is private . is a user’s mutual friendships, i.e. is the ’th row of the graph’s adjacency matrix. We do not have any here, and select the 25 nearest neighbors randomly.

  • Syn. This is a synthetic dataset of size which can be classified at three granularities – 8-way, 4-way and 2-way (Fig. 0(a) shows a scaled down version of the dataset). The eight color labels are private ; the 2D-positions are public . For learnability, we measure the accuracy of -way, -way and -way GBDT models trained on on an equal sized test set at each .

Experimental Results.
Q1. Our formal guarantee on the inferential privacy loss (Thm. 3.2) is described w.r.t to a ‘strong’ adversary (with access to ). Here, we test how well does our proposed scheme (Alg. 1) protect against inference attacks on real-world datasets without any such assumptions. Additionally, to make our attack more realistic, the adversary has access to extra privileged auxiliary information which is not used by Alg. 1. Fig. 3(a) 3(c) show that our scheme significantly reduces the attack efficacy. For instance, is reduced by at the attack distance threshold for PUDF. Additionally, for our scheme varies from that of LDPbbbOur scheme gives lower than LDP at because the resulting groups are non-singletons. For instance, for PUDF, includes all individuals with the same zipcode as . (minimum privacy) to uniform shuffle (maximum privacy) with increasing (equivalently group size as in Fig. 3(c)) thereby spanning the entire privacy spectrum. As expected, decreases with decreasing privacy parameter (Fig. 3(d)).

Q2. Fig.3(e) 3(g) show that varies from that of LDP (maximum learnability) to that of uniform shuffle (minimum learnability) with increasing (equivalently, group size), thereby providing tunability. Interestingly, for Adult our scheme reduces by at the same as that of LDP for (Fig. 3(f)). Fig. 3(h) shows that the distance threshold defines the granularity at which the data can be classified. LDP allows 8-way classification while uniform shuffling allows none. The granularity of classification can be tuned by our scheme – , and mark the thresholds for -way, -way and -way classifications, respectively. Experiments on evaluation of -preservation are in App. 7.11.

5 Conclusion

In this paper, we propose a generalized shuffling framework that interpolates between standard LDP and uniform random shuffling. We establish a new privacy definition, -privacy, which casts new light on the inferential privacy benefits of shuffling.

6 Acknoledgements

KC and CM would like to thank ONR under N00014-20-1-2334 and UC Lab Fees under LFR 18-548554 for research support.

References

7 Appendix

7.1 Background Cntd.

Here we define two rank distance measures

Definition 7.1 (Kendall’s Distance).

For any two permutations, , the Kendall’s distance counts the number of pairwise disagreements between and , i.e., the number of item pairs that have a relative order in one permutation and a different order in the other. Formally,

(7)

For example, if and , then .

Next, Hamming distance measure is defined as follows.

Definition 7.2 (Hamming Distance).

For any two permutations, , the Hamming distance counts the number of positions in which the two permutations disagree. Formally,

Repeating the above example, if and , then .

7.2 -privacy and the De Finetti attack

We now show that a strict instance of privacy is sufficient for thwarting any de Finetti attack [definetti] on individuals. The de Finetti attack involves a Bayesian adversary, who, assuming some degree of correlation between data owners, attempts to recover the true permutation from the shuffled data. As written, the de Finetti attack assumes the sequence of sensitive attributes and side information are exchangeable: any ordering of them is equally likely. By the de Finetti theorem, this implies that they are i.i.d. conditioned on some latent measure . To balance privacy with utility, the sequence is non-uniformly randomly shuffled w.r.t. the sequence producing a shuffled sequence , which the adversary observes. Conditioning on the adversary updates their posterior on (i.e. posterior on a model predicting

), and thereby their posterior predictive on the true

. The definition of privacy in [definetti] holds that the adversary’s posterior beliefs are close to their prior beliefs by some metric on distributions in , :

We now translate the de Finetti attack to our setting. First, to align notation with the rest of the paper we provide privacy to the sequence of values since we shuffle those instead of the values as in [definetti]. We use max divergence (multiplicative bound on events used in DP ) for :

which, for compactness, we write as

(8)

We restrict ourselves to shuffling mechanisms, where we only randomize the order of sensitive values. By learning the unordered values alone, an adversary may have arbitrarily large updates to its posterior (e.g. if all values are identical), breaking the privacy requirement above. With this in mind, we assume the adversary already knows the unordered sequence of values (which they will learn anyway), and has a prior on permutations allocating values from that sequence to individuals. We then generalize the de Finetti problem to an adversary with an arbitrary prior on the true permutation , and observes a randomize permutation from the shuffling mechanism. We require that the adversary’s prior belief that is close to their posterior belief for all :

(9)

where , the set of permutations assigning element to . Conditioning on any unordered sequence with all unique values, the above condition is necessary to satisfy Eq. (8) for events of the form , since for some . For any with repeat values, it is sufficient since is the sum of probabilities of disjoint events of the form for various values.

We now show that a strict instance of -privacy satisfies Eq. (9). Let be any group assignment such that at least one includes all data owners, .

Property 1.

A --private shuffling mechanism satisfies

for all and all priors on permutations .

Proof.
Lemma 1.

For any prior , Eq. (9) is equivalent to the condition

(10)

where the set is the complement of .

Under grouping , every permutation neighbors every permutation , , for any . By the definition of -privacy, we have that for any observed permutation output by the mechanism:

This implies Eq. 10. Thus, --privacy implies Eq. 10, which implies Eq. 9, thus proving the property. ∎

Using Lemma 1, we may also show that this strict instance of -privacy is necessary to block all de Finetti attacks:

Property 2.

A --private shuffling mechanism is necessary to satisfy

for all and all priors on permutations .

Proof.

If our mechanism is not --private, then for some pair of true (input) permutations and some released permutation , we have that

Under , all permutations neighbor each other, so . Since , then for some , and : one of the two permutations assigns some to some and the other does not. Given this, we may construct a bimodal prior on the true that assigns half its probability mass to and the rest to ,

Therefore, for released permutation , the RHS of Eq. 10 is 1, and the LHS is