DeepAI
Log In Sign Up

Confidence-Ranked Reconstruction of Census Microdata from Published Statistics

11/06/2022
by   Travis Dick, et al.
0

A reconstruction attack on a private dataset D takes as input some publicly accessible information about the dataset and produces a list of candidate elements of D. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of D from aggregate query statistics Q(D)∈ℝ^m, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset D was sampled, demonstrating that they are exploiting information in the aggregate statistics Q(D), and not simply the overall structure of the distribution. In other words, the queries Q(D) are permitting reconstruction of elements of this dataset, not the distribution from which D was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/13/2022

Private Synthetic Data with Hierarchical Structure

We study the problem of differentially private synthetic data generation...
07/19/2020

Private, Fair, and Verifiable Aggregate Statistics for Mobile Crowdsensing in Blockchain Era

In this paper, we propose FairCrowd, a private, fair, and verifiable fra...
06/21/2019

Suboptimal Provision of Privacy and Statistical Accuracy When They are Public Goods

With vast databases at their disposal, private tech companies can compet...
12/30/2021

GenShare: Sharing Accurate Differentially-Private Statistics for Genomic Datasets with Dependent Tuples

Motivation: Cutting the cost of DNA sequencing technology led to a quant...
06/26/2020

Database Reconstruction from Noisy Volumes: A Cache Side-Channel Attack on SQLite

We demonstrate the feasibility of database reconstruction under a cache ...
08/06/2018

Differential Private Stream Processing of Energy Consumption

A number of applications benefit from continuously releasing streams of ...
06/14/2021

Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods

We study private synthetic data generation for query release, where the ...

1 Preliminaries

In this section, we describe our algorithm, metrics and the baselines we use for comparison.

1.1 Reconstruction Attack

A dataset is a multiset of records from a discrete domain . Each item in the multiset is called a row. We use to denote a private dataset that is the target of a reconstruction attack. A reconstruction attack takes as input aggregate statistics computed from dataset (and in the case of the attacks we present, a possibly uniformly random seed dataset), and outputs a set of candidate rows, ranked according to the confidence of appearing in . This confidence-ordered set of rows is denoted , where the index  in determines the confidence ranking555Thus corresponds to the row that we are most confident in, the row that we have the next most confidence in, and so on.. Our rankings will be obtained from attacks that produce a multiset of rows. Elements appearing in  are then ordered according their frequency in the multiset. We let denote the resulting ordered set (not multiset) of rows.

To measure the performance of a reconstruction attack, we introduce the following metric that measures its accuracy at different confidence thresholds. For any private target dataset , the top- match-rate of the confidence set is the fraction of rows ranked from to by that actually appear in .

(1)

We can plot as a function of which traces out a curve — in general, if our confidence set has its intended semantics (that higher ranked rows are more likely to appear in ), then the curve should be monotonically decreasing in . For a given level , a higher match rate corresponds to higher confidence that rows ranked within the top are correct reconstructions; at a given match rate, higher values of correspond to the ability to confidently reconstruct more rows.

1.2 Reconstruction From Aggregate Statistics

We design a reconstruction attack that starts from a collection of aggregate statistics computed from the private dataset. A statistical query is a function that counts the fraction of rows in a dataset that satisfy a given property. We give a formal definition here:

Definition 1 (Statistical Queries [Kea98])

Given a function , a statistical query (also known as a linear query or counting query) is defined as , for any dataset .

We use to denote a set of statistical queries and

denote the vector of statistics on the dataset

. The objective of an attack on is to reconstruct rows of given and .

We propose a new reconstruction attack mechanism RAP-Rank that learns rows of the unknown dataset from statistics . RAP-Rank

leverages the recent optimization heuristic Relaxed Adaptive Projection (

RAP) [ABK+21] for synthetic data generation. RAP is a randomized algorithm that takes as input a collection of statistical queries and answers (derived from some dataset ), and outputs a dataset by attempting to solve the following optimization objective:

(2)

using a randomized continuous optimization heuristic. RAP is initialized with parameter , discussed below. Roughly speaking, captures some additional distributional information available to the attacker. In our work, this will either be a uniformly randomly generated dataset of a given schema (corresponding to no additional information) or a dataset of this schema sampled from a prior distribution related to the distribution from which was drawn; more on this below. The notation is used to indicate union with multiplicities. For example, if appears 2 times in and 1 time in , then it appears 3 times in .

Our method, RAP-Rank, described in Algorithm 1, consists of running RAP for times to produce datasets and outputting the confidence set .

Input: A set of queries and their evaluations on some private dataset .
Parameters: number of runs
for  do
     Initialize RAP’s parameter (either uniformly or to a dataset sampled from a prior distribution).
     Output by solving Eq. [2

] via stochastic gradient descent.

end for
Let
Output: Confidence set
Algorithm 1 Overview of RAP-Rank

The RAP algorithm maintains a parameterized distribution over datasets that it can use to produce data samples to form synthetic datasets. The goal of the RAP algorithm is to find a set of parameters that correspond to synthetic data that minimizes the objective in Equation [2]. Since the optimization problem in Equation [2] is discrete (which makes it difficult to solve), the RAP algorithm considers a continuous relaxation of objective Equation [2], that is differentiable in the internal parameters of RAP, enabling the use of continuous differentiable optimization techniques which are highly effective in practice.

RAP is initialized with a parameter that is defined over a domain that is a continuous relaxation of the schema of the dataset . So we can initialize RAP at a dataset in the same schema as . In this work, we initialize RAP either at a uniformly random dataset, or at a dataset drawn from a prior distribution that will represent sampling Census data at various geographic resolutions.

Although the performance of RAP-Rank as measured by Match-Rate is an empirical finding, RAP-Rank is a theoretically motivated heuristic. In particular, if we imagine that when RAP-Rank is initialized at a sample from a prior distribution on datasets, it samples a dataset from the posterior distribution on datasets given the statistics , then the ranking it constructs would be the correct ranking of points by their (posterior) likelihood of appearing in the true dataset . We briefly elaborate on this theoretical intuition in the next section.

1.3 Some Theoretical Intuition

There is a simple Bayesian argument that provides some intuition for our resampling method for confidently reconstructing rows of the true, private dataset . Let be some prior distribution over all datasets with the same format or schema as . For instance, could simply be uniform over all datasets with the same schema as , but any suffices in the argument that follows. Let us assume that the true is drawn according to this prior (denoted ), and we are given some queries as well as their numerical values on , denoted . Suppose we imagine that when we initialize RAP at a sample drawn from the prior and run it once, the resulting reconstructed dataset is a sample from the posterior distribution given the computed statistics: . How could we use the ability to sample such datasets

to estimate the probability that particular points

are elements of ?

More generally, let

be any random variable determined by the draws

— for instance, a natural for our purposes would take value equal to 1 if both and contain some particular row , and 0 otherwise.

The attacker is interested in the expectation

(3)

which in the example above is simply the probability that both and contain the row r. The difficulty is that although given , we have assumed that we can take samples , we cannot evaluate the predicate because we do not have access to the true dataset from which the statistics were computed.

However, it is not hard to derive that this expectation is identical to:

(4)

In other words, rather than computing we can instead compute where and are both independent samples from the posterior distribution — i.e. under our assumption, two reconstructions that result from running RAP

 twice with fresh randomness. The reason for this equivalence is that in the two expectations above, the joint distributions of

and are identical, since and are conditionally independent given , and both are distributed according to .

In other words, if we wish to estimate the expectation [3], we can do so by instead estimating the expectation [4], which involves evaluating the predicate only on datasets drawn from the posterior rather than the prior. Concretely, rows that are more likely to appear in two or more draws from the posterior are also more likely to appear in drawn from the prior and from the posterior. To the extent that our resampling method on successfully approximates repeated draws from the posterior given , it ranks rows in decreasing order of their value of the inner expectation in [4] — i.e. their posterior likelihood of being true rows of the original . Thus if we believe that RAP, when initialized at a draw from the prior distribution simulates a draw from the posterior distribution, the above argument explains why the ranking should correspond to an ordering of data points by their probability of appearing in the true dataset . In general sampling from a posterior in a space of high-dimensional datasets and queries is a computationally intractable problem [TD19], but this does not rule out effective heuristics on real datasets, and we believe that this Bayesian argument provides at a minimum some insight about why methods such as ours work well in practice.

2 Empirical Findings

In this section we describe our primary experimental findings. Additional and more fine-grained results are provided in the appendix, including plots of Match-Rate for all 50 states on the Census data.

2.1 Datasets and Queries

2.1.1 U.S. Decennial Census

Dataset:

We conduct experiments on subsets of synthetic U.S. Census microdata released by the Census Bureau during the development of the 2020 Census Disclosure Avoidance System (DAS). This synthetic microdata was generated so that it has similar statistics when compared to the real 2010 Census microdata. We use the 2020-05-27 vintage Privacy-protected Microdata File (PPMF) [U.S20]. In our experiments, we treat the PPMF as the ground truth microdata, even though it is synthetic, since the true microdata has never been released.

The 2020-05-27 vintage PPMF consists of 312,471,327 rows, each representing a (synthetic) response for one individual in the 2010 Deccenial Census. The columns correspond to the following attributes: The location of the respondent’s home (state, county, census tract, and census block), their housing type (either a housing unit, or one of 8 types of group quarters), their sex (male or female), their age (an integer in ), their race (one of the 63 racial categories defined by the U.S. Office of Management and Budget Standards)666The 63 race categories correspond to any non-empty subset of the following: American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, White, and Other., and whether they have Hispanic or Latino origin or not.

We evaluate reconstruction attacks on subsets of the PPMF that contain all rows belonging to a given census tract or census block. According to the U.S. Census Bureau, census tracts typically have between 1,200 and 8,000 people with an optimum size of 4000, and cover a contiguous area (although their geographic sizes vary widely depending on population density). Each census tract is partitioned into up to 10,000 census blocks, which are typically small regions bounded by features such as roads, streams, and property lines.

In our tract-level experiments, we randomly select one tract from each state. In our block-level experiments, we select for each state the block closest in size to mean block size as well as the largest block. In addition, we select blocks closest in size to , where is the maximum block size in the state and . Thus in total, we evaluate on tracts and blocks.

Statistical Queries:

The U.S. Census Bureau publishes a collection of data tables containing statistics computed from the microdata at various levels of geographic granularity. For example, some tables are published at the block level, meaning that they release a copy of that table for every census block in the U.S., while others are published at the census tract or county level. Our experiments attempt to reconstruct the microdata belonging to census tracts and blocks based on statistics contained in the Census tables.

We use the same tables that the Census Bureau used in their internal reconstruction attack of the 2010 Census data [JAS20]. These are the following tables from the Census Summary File 1:777Summary File 1 has been renamed to the Demographic and Housing Characteristics File (DHC) for the 2020 U.S. Census. In all cases, we refer to tables and data products by their names used in the 2010 Census.

[nosep,labelindent=0.3cm]

P1:

Total population,

P6:

Race (total races tallied),

P7:

Hispanic or Latino origin by race (total races tallied),

P9:

Hispanic or Latino and not Hispanic or Latino by race.

P11:

Hispanic or Latino, and Not Hispanic or Latino by race for the population 18 years and over,

P12:

Sex by age for selected age categories (roughly 5 year buckets),

P12 A-I:

Sex by age for selected age categories (iterated by race).

PCT12:

Sex by single year age.

PCT12 A-N:

Sex by single year age (iterated by race).

All of the P tables are released at the block level, while the PCT tables are released only at the census tract level.

Each table defines a collection of statistical queries that will be evaluated on the Census microdata. For example, cell 3 of table P12 counts the number of male children under the age of 5. Since P12 is a block-level table, cell 3 corresponds to one statistical query per census block. Similarly, each cell in a tract-level table encodes one statistical query per census tract. All of the statistical queries in the above tables can be encoded as follows: Given pairs , where is a column name and is a subset of that column’s domain, and either a census block or tract identifier, count the number of microdata rows belonging to that tract or block for which for all .888The Census tables report row counts, but in our experiments we convert counts to fractions by dividing by the population of the tract or block we are reconstructing. Thus in logical terms, queries are in Conjunctive Normal Form (CNF), meaning that they consist of a conjunction (logical AND) of clauses, with each clause being a disjunction (logical OR) of allowed values for a column.

For example, cell 3 of table P12 encodes queries for each block with , , and , . When we perform tract-level reconstructions, we use queries defined by all of the above tables. For block-level reconstructions, we use only the block-level tables (i.e., excluding tables PCT12 and PCT12 A-N). In order to minimize the total number of queries, we omit several table cells that are either repeated or can be computed as a sum or difference of other table cells.

The statistical queries encoded by the Census data tables vary significantly in the value of (number of clauses in a conjunction) and the size of the sets (clauses) . There are 2 cells with (total population at the block and tract level), 27 cells with , 352 cells with , 1915 cells with , and 1259 cells with . The size of the sets range from 1 to 98.

To verify the correctness of our implementation of the statistical queries from the tables above, we compared the output of our implementation to tables released by the IPUMS National Historical Geographic Information System (NHGIS). For each vintage of the PPMF released by the U.S. Census Bureau, the NHGIS computes the census tables from that PPMF vintage.999The NHGIS tables constructed from each PPMF vintage are available here. We compared our implementations of queries from tables P1, P6, P7, P9, P11, P12, and P12 A-I on all census blocks in the United States and Puerto Rico and found no discrepancies. Unfortunately, the PCT12 and PCT12 A-N tables were not included in the NHGIS tabulations for the 2020-05-27 vintage PPMF, so we were unable to verify our implementation of these queries (but their structure is very similar to the block-level queries).

2.1.2 American Community Survey (ACS)

Dataset

We conduct additional experiments on a suite of datasets derived from US Census, introduced in [DHM+21].101010The Folktables package comes with MIT license, and terms of service of the ACS data can be found here: https://www.census.gov/data/developers/about/terms-of-service.html. The Folktables package defines datasets for each of 50 states and various tasks. Each task consists of a subset of columns111111A detailed list of the attributes can be found in the Appendix (Table 2). Note that we discretize numerical columns into 10 equal-sized bins. from the American Community Survey (ACS) corpus. These datasets provide a diverse and extensive collection of datasets helpful in experimenting with practical algorithms. We use the five largest states (California, New York, Texas, Florida, and Pennsylvania) which together with the three tasks (employment, coverage, mobility) constitute 15 datasets. Our experiments therefore seek to reconstruct individuals at the state-level. Compared to datasets derived from the Census Bureau’s May 2020 Demonstration Data Product (PPMF), based on the 2010 Census, the Folktables ACS datasets contain many more attributes (see Table 1), helping us demonstrate how our reconstruction attack scales up to higher dimensional datasets.

We note that while the datasets distributed by the Folktables package are derived from the ACS microdata, the package was designed for evaluating machine learning algorithms, and there exist many differences from the actual 1-year and 5-year statistical tables released by the Census Bureau each year. As mentioned above, each task only contains a subset of features collected on the ACS questionnaire and released in the 1-year Public Use Microdata Sample (PUMS). Moreover, survey responses are collected at both the household and person-level, but Folktables treats records only at the person-level. Lastly, in the ACS PUMS, each survey response is assigned a sampling weight, which can then be used to calculate weighted statistics (e.g., estimated population sizes and income percentiles) that estimate population-level statistics. Folktables ignores these weights, and so the statistics we calculate and use for experiments are unweighted tabulations. Folktables also ignores the replicate weights on the ACS PUMS that the Census Bureau recommends users implement to generate measures of uncertainty associated with the weighted statistics.

Statistical Queries

For each ACS dataset we compute a set of -way marginal statistics. A marginal query counts the number of people in a dataset whose features match a given value. An example of a -way marginal query is: ”How many people are female and and have income greater than 50K”. The formal definition is as follows:

Definition 2 (-way Marginal Queries)

Let be a discrete data domain with features, where is the domain of the -th feature. A -way marginal query is defined by a set of features , together with a target value for each feature in . Given such a pair , let denote the set of points in that match on each feature . Then consider the function defined as , where is the indicator function. The corresponding -way marginal query is the statistical query defined as

for any dataset .

We explore the efficacy of our reconstruction attack on ACS datasets when all -way or -way marginal queries are released.

Task # Attr Dim # 2-way # 3-way
employment 16 108 5154 144910
coverage 18 107 5160 149848
mobility 21 141 9137 362309
Table 1: For each Folktables task, we list the number attributes, total dimension of such attributes, and the number of all 2 and 3-way marginal queries.

2.2 Baselines

In isolation, the Match-Rate of RAP-Rank described in the previous section cannot provide enough information to indicate a privacy breach. If the dataset distribution is very low entropy, and we know the distribution, then we might expect to obtain a high Match-Rate simply by randomly guessing rows that are likely under the data distribution. Therefore, we would like to compare the Match-Rate of our attack to the Match-Rate of baselines of various strengths corresponding to increasingly precise knowledge of the data distribution.

Given a baseline distribution , we consider a Match-Rate baseline that results from ordering the rows of according to their likelihood of appearing in a randomly sampled dataset . In practice, the domain size is often too large to enumerate; an alternative in this case is to sample a large collection of rows and then compare to the confidence set —i.e. the ranking that results from ordering rows by their likelihood in the empirical distribution over , sampled from .

We compare to different baselines corresponding to a set of increasingly informed prior distributions. First, in order to simulate a prior that is identical to the distribution from which the private dataset is sampled, we randomly partition the real dataset into two halves and . We treat as the private dataset which we compute statistical queries on and seek to reconstruct rows from, while is used to produce a baseline confidence set . Here, by construction, and are identically distributed, which allows us to compare to the very strong baseline of the “real” sampling distribution for real datasets. Of course, as a synthetic construction originating from the real data, should generally be viewed as an unrealistically strong benchmark.

We also compare to a natural hierarchy of benchmarks that correspond to fixing a prior based on knowledge of Census data at different levels of granularity. U.S. Census data is organized according to geographic entities that have a hierarchical structure. We consider a natural hierarchy of prior distributions in which a lower level in the hierarchy is more informative than higher levels. For example, for block-level reconstruction, we consider benchmarks defined by sampled rows from the tract, county, and state (, , ) that each block is contained in, as well as the benchmark defined by samples from all rows in the dataset (). We note that in block level reconstruction experiments, corresponds to a block-level prior, and so we refer to this set of rows as in Section 2.3. Similarly for tract-level reconstruction experiments, is referred to as .

As we describe in more detail in Section 2.3, we run reconstruction of Census tracts both with and without the attribute corresponding to the block each individual resides in. For the setting in which the block attribute is included, the county, state, and national baselines are at an extreme disadvantage, since the majority of individuals in , , and reside in a tract different than those found in —and so necessarily have different block values. To compensate for this (otherwise crippling to the baselines) disadvantage, in these cases we strengthen the baselines and instead populate the block attribute according to the distribution of blocks found in . For example, the state-level baseline can be interpreted as a prior in which the distribution of blocks follows that of and the distribution of the remaining attributes follows that of .

2.3 Results

Our primary reconstruction rate visualization technique is as follows.

Recall both RAP-Rank and our baselines each output some confidence set . Therefore, for both RAP-Rank and our baselines, we plot against , or in other words, the fraction of candidates of rank or higher that exactly match some row in . Because the many datasets on which we run our reconstruction attack vary considerably in size, and in some of our plots we average our results over many datasets, in the ensuing plots we express rank as a fraction of the number of unique rows in , which we denote as . In other words, the -axis measures . This allows us to average results across different samples of data (e.g., different geographic entities for both Census and ACS experiments) on a common scale for the -axis.

Figure 1: The panel on the left plots the Match-Rate of RAP-Rank and our various baselines on a tract level reconstruction when we use the BLOCK attribute. The panel on the right plots the Match-Rate of RAP-Rank and our various baselines when the BLOCK attribute has been removed. Both panels show the average performance of RAP-Rank and the baselines averaged over a randomly selected tract from each of the 50 US states. In both cases, RAP-Rank is initialized uniformly at random (i.e. we have not initialized at a baseline distribution).
Figure 2: Initializing RAP-Rank at the tract baseline significantly improves its performance, leading it to out-perform the tract baseline. Here the BLOCK attribute is included and must be reconstructed to constitute a match.
Figure 3: The panel on the left plots the Match-Rate of RAP-Rank and our various baselines on a block level reconstruction, when RAP-Rank is initialized to a uniformly random dataset. The panel on the right shows the performance of RAP-Rank when it is initialized to , and compares its performance to .
Figure 4: We select state-task combinations from the folktables dataset, comparing top candidate Match-Rate  of RAP-Rank  against the baseline, which is derived from the holdout split. For each task, we average results over the five largest states (by population) in the United States (i.e., California, New York, Texas, Florida, and Pennsylvania). We initialize RAP-Rank  randomly, showing results where is all and -way marginals.

In our first set of experiments, we randomly select a tract from each state, which forms the private dataset from which we compute the Census query-answer pairs. We run RAP-Rank using these queries and starting from a uniformly random initialization,121212We shortly describe a natural and realistic alternative initialization scheme that improves performance considerably. and plot the match rate as a function of . We similarly plot the match rate of each of our baselines. In the left panel of Figure 1, we plot the reconstruction rates after averaging across the selected tract from all 50 states. (See Figures 5, 6, 9, and 10 in the appendix for the state-by-state plots that comprise this average.)

As expected, in general at higher ranks (lower -axis values) the reconstruction rates are reasonably high and then fall at lower ranks. The left panel shows that the RAP-Rank reconstruction rates are considerably higher than all but the strongest baseline — resampling at the tract level — which is much higher still. Recall that since this is a tract-level reconstruction, here is in fact — i.e. the very strong artificial benchmark constructed from the dataset we are attacking itself. We see that the other baselines—, and perform quite poorly. This is partially an artifact of requiring that they reconstruct the BLOCK attribute. Since blocks appearing within a tract appear in no other tracts, the non-tract baselines have a poor chance of reconstruction since they are sampling at a coarser geographic level. Recall that we have strengthened these baselines by letting the BLOCK attribute be distributed according to the empirical distribution of blocks in the true dataset , but still, these baselines are at a disadvantage because they have lost the correlation between the BLOCK attribute and all other features. Therefore in the right panel of Figure 1, we reproduce the same experiment in which we have dropped the BLOCK attribute. This makes the reconstruction task easier and improves the performance of RAP-Rank as well as all of the baselines. The most dramatic increase is in the performance of the , and baselines, but we also see that RAP-Rank now performs relatively better compared to the / baseline. RAP-Rank now has reconstruction rates above 0.9 up to . These results establish that RAP-Rank can perfectly reconstruct rows well beyond what sampling access alone permits except at the most local level. In other words, RAP-Rank is far from simply “getting lucky” — its optimization process is deliberately and effectively exploiting the actual query-answer pairs, not simply benefiting from having data similarly distributed to the private dataset. Nevertheless, the ordering of the baselines and of RAP-Rank is unchanged — i.e. RAP-Rank outperforms all of the baselines except for the artificial baseline.

We next observe that there is an asymmetry in our experiments that treats RAP-Rank in what could be considered an unfair manner: we assert that there are strong “baseline” distributions that are related to the data we are trying to reconstruct, and yet we have initialized our attack RAP-Rank at a uniformly random dataset, without giving it the benefit of this knowledge. If indeed these baseline distributions are public knowledge, then an attacker could make use of them as well. Thus our next set of experiments consists of initializing RAP-Rank at the baseline that we are comparing it to, and see that this causes it to significantly outperform all baseline—including (which we recall is the strong baseline ), even with the BLOCK attribute. In other words, if we view the baseline as a public prior distribution, then giving RAP-Rank access to it leads to the ability to significantly improve over it.

In Figure 2, we show results averaged across randomly chosen tracts for all 50 states in which we have now initialized to the tract baseline and compare to that sampling baseline, (once again including the BLOCK feature). The results are clear: when we level the playing field by seeding RAP-Rank with knowledge of the tract baseline distribution, it now outperforms the tract baseline. We can interpret the area between the two curves in Figure 2 as a measure of the additional reconstruction risk introduced by RAP-Rank on the query-answer pairs, beyond the baseline risk of tract sampling.

In Figure 3, we show that RAP-Rank remains an effective reconstruction attack even at the most fine-grained geographic level, which corresponds to Census blocks. The left panel again shows Match-Rate for RAP-Rank initialized randomly, and compared to all the sampling baselines. Here we again see the same qualitative performance — even with random initialization, RAP-Rank outperforms all of the sampling baselines except for constructed (which we recall in this case is the artificially constructed ). The right panel shows results when we initialize RAP-Rank at . In this case we again see that initializing at the benchmark distribution causes RAP-Rank to significantly outperform the benchmark. This figure is again averaging over attacks on blocks from all 50 states. (See Figures 11 and 12 in the appendix for the state-by-state plots that comprise this average.)

We conclude by briefly describing a second set of experiments on three datasets from the ACS Folktable package, corresponding to the employment, coverage and mobility tasks. We consider these alternate datasets both to show the generality of our methods beyond decennial Census data (in particular, the ACS Folktables datasets have much higher dimensionality than the decennial Census data), and in order to do a controlled comparison of queries of differing power (as opposed to the fixed set of queries provided for the decennial Census data).

In Figure 4, we show the reconstruction rates obtained by letting the query set be the sets of all 2-way and 3-way marginal queries on these three ACS datasets, and as in the Census data we compare to the very strong baseline. Two remarks are in order. First, despite the low complexity of these queries compared to Census queries — 2-way and 3-way marginals reference only pairs and triples of columns, respectively — both considerably outperform the baseline even when RAP-Rank is initialized randomly, maintaining reconstruction rates well above 0.8 even at the lowest rank. This suggests that not only is aggregation insufficient for privacy, neither is restriction to simple queries. In fact, on this dataset, and with these simple queries, our reconstruction attack performs even better—outperforming the strongest baseline even without the benefit of being initialized at that baseline.

Second, the lift in performance in moving from 2-way marginals to 3-way marginals is large, demonstrating the reconstructive power of even slightly more complex queries.

3 Limitations and Conclusions

We have shown the power of a new class of reconstruction attacks that can not only produce a candidate reconstructed dataset with a high intersection with the true dataset, but also produce a ranking of rows that empirically corresponds to their likelihood of appearing in the true dataset. We have shown that from statistics that were actually released as part of the 2010 Decennial U.S. Census, it is possible to run our attack and that its Match-Rate  is high — particularly at lower values of , indicating high confidence reconstruction of a subset of the rows. Moreover, even with random initialization (equivalently, viewing RAP-Rank as having an uninformative prior), RAP-Rank outperforms all but the most stringent (artificial) benchmark that we construct. Finally, we can reliably outperform even the most stringent benchmark if we initialize RAP-Rank at the benchmark distribution—consistently with the premise that if a distribution is publicly known (and so is sensible to consider as a public benchmark), then we should assume that attackers can make use of it as well.

Nevertheless, our attack is not without limitations. First and foremost, our reconstructions of Census decennial data are far from recovering every row in the private data. The primary threat is that we can recover some fraction of the rows with confidence. Moreover, our attack does not produce calibrated confidence scores. That is, we produce a ranking of rows , but an attacker without access to the ground-truth would be unable to compute the Match-Rate as a function of as we do in our plots, and so would not know a-priori how much confidence to put in each reconstructed row. Nevertheless, a ranking (known to be empirically correlated with Match-Rate) is sufficient for an attacker to prioritize the rows of a reconstruction for some other external validation procedure or attack.

References

Appendix A Appendix

In Table 2, we describe the columns used for each Folktables task found in our ACS experiments.

Task Columns
employment AGEP (age), SCHL (educational attainment)
MAR (marital status), RELP (relationship)
DIS (disability recode), ESP (employment status of parents)
CIT (citizenship status), MIG (mobility status - lived here 1 year ago)
MIL (military status), ANC (ancestry recode)
NATIVITY (nativity), DEAR (hearing difficulty)
DEYE (vision difficulty), DREM (cognitive difficulty)
SEX (sex), RAC1P (recoded detailed race code)
coverage AGEP (age), SCHL (educational attainment)
MAR (marital status), SEX (sex)
DIS (disability recode), ESP (employment status of parents)
CIT (citizenship status), MIG (mobility status - lived here 1 year ago)
MIL (military status), ANC (ancestry recode)
NATIVITY (nativity), DEAR (hearing difficulty)
DEYE (vision difficulty), DREM (cognitive difficulty)
PINCP (Total person’s income), ESR (employment status recode)
FER (gave birth within the past 12 months), RAC1P (recoded detailed race code)
mobility AGEP (age), SCHL (educational attainment)
MAR (marital status), SEX (sex)
DIS (disability recode), ESP (employment status of parents)
CIT (citizenship status), MIL (military status)
ANC (ancestry recode), NATIVITY (nativity)
RELP (relationship), DEAR (hearing difficulty)
DEYE (vision difficulty), DREM (cognitive difficulty)
RAC1P (recoded detailed race code), GCL (grandparents living with grandchildren)
COW (class of worker), ESR (employment status recode)
WKHP (usual hours worked per week past 12 months), JWMNP (Travel time to work)
PINCP (Total person’s income)
Table 2: We detail below the Folktables columns we use for each task.

In Section 2.3, we visualized the reconstruction rates on Census and ACS datasets. However, to more easily communicate our findings, we presented results that were averaged across various geographic entities in Figures 1, 2, and 3 for Census experiments and Figure 4 for ACS experiments. Here, we now present more granular results. In particular, in each subplot of Figures 5, 6, 9, and 10, we present results for a single tract that was randomly chosen from each state, where the latter two figures (9, 10) present tract-level experiments without the BLOCK feature. In addition, we plot results of RAP-Rank initialized to the baseline distribution in Figures 7 and 8. Like in Section 2.3, we again average results at the block-level in Figures 11 and 12, but now we aggregate our randomly selected blocks at the state-level. Finally, in 13, we present results for each of the 15 state-task combinations derived from the ACS Folktables package.

Figure 5: We plot Match-Rate of RAP-Rank and our baselines on a tract-level reconstruction with the BLOCK attribute included. Subplots are labeled and ordered alphabetically by the state name.
Figure 6: We plot Match-Rate of RAP-Rank and our baselines on a tract-level reconstruction with the BLOCK attribute included. Subplots are labeled and ordered alphabetically by the state name.
Figure 7: We plot Match-Rate of RAP-Rank and our tract-level baseline on a tract-level reconstruction with the BLOCK attribute included. RAP-Rank is initialized to the baseline. Subplots are labeled and ordered alphabetically by the state name.
Figure 8: We plot Match-Rate of RAP-Rank and our tract-level baseline on a tract-level reconstruction with the BLOCK attribute included. RAP-Rank is initialized to the baseline. Subplots are labeled and ordered alphabetically by the state name.
Figure 9: We plot Match-Rate of RAP-Rank and our baselines on a tract-level reconstruction with the BLOCK attribute excluded. Subplots are labeled and ordered alphabetically by the state name.
Figure 10: We plot Match-Rate of RAP-Rank and our baselines on a tract-level reconstruction with the BLOCK attribute excluded. Subplots are labeled and ordered alphabetically by the state name.
Figure 11: We plot Match-Rate of RAP-Rank and our baselines on a block-level reconstruction, where in each subplot, we average results across all blocks selected for the corresponding state. Subplots are labeled and ordered alphabetically by the state name.
Figure 12: We plot Match-Rate of RAP-Rank and our baselines on a block-level reconstruction, where in each subplot, we average results across all blocks selected for the corresponding state. Subplots are labeled and ordered alphabetically by the state name.
Figure 13: We plot Match-Rate of RAP-Rank and our baselines on a block-level reconstruction for each state-task combination from the ACS Folktables package. We show results of RAP-Rankusing both 2 and 3-way marginal queries.