DeepAI
Log In Sign Up

Private Synthetic Data with Hierarchical Structure

We study the problem of differentially private synthetic data generation for hierarchical datasets in which individual data points are grouped together (e.g., people within households). In particular, to measure the similarity between the synthetic dataset and the underlying private one, we frame our objective under the problem of private query release, generating a synthetic dataset that preserves answers for some collection of queries (i.e., statistics like mean aggregate counts). However, while the application of private synthetic data to the problem of query release has been well studied, such research is restricted to non-hierarchical data domains, raising the initial question – what queries are important when considering data of this form? Moreover, it has not yet been established how one can generate synthetic data at both the group and individual-level while capturing such statistics. In light of these challenges, we first formalize the problem of hierarchical query release, in which the goal is to release a collection of statistics for some hierarchical dataset. Specifically, we provide a general set of statistical queries that captures relationships between attributes at both the group and individual-level. Subsequently, we introduce private synthetic data algorithms for hierarchical query release and evaluate them on hierarchical datasets derived from the American Community Survey and Allegheny Family Screening Tool data. Finally, we look to the American Community Survey, whose inherent hierarchical structure gives rise to another set of domain-specific queries that we run experiments with.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/16/2021

Benchmarking Differentially Private Synthetic Data Generation Algorithms

This work presents a systematic benchmark of differentially private synt...
02/17/2021

Leveraging Public Data for Practical Private Query Release

In many statistical problems, incorporating priors can significantly imp...
11/06/2022

Confidence-Ranked Reconstruction of Census Microdata from Published Statistics

A reconstruction attack on a private dataset D takes as input some publi...
11/02/2020

Synthetic Data Generation for Economists

As more tech companies engage in rigorous economic analyses, we are conf...
06/14/2021

Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods

We study private synthetic data generation for query release, where the ...
09/15/2022

Private Synthetic Data for Multitask Learning and Marginal Queries

We provide a differentially private algorithm for producing synthetic da...
09/14/2021

Learning Bill Similarity with Annotated and Augmented Corpora of Bills

Bill writing is a critical element of representative democracy. However,...

1 Introduction

Differential privacy (DworkMNS06) provides rigorous privacy guarantees that center around limiting the influence of any individual data point. As a result, organizations have increasingly adopted differential privacy to release information while protecting individuals’ sensitivite data. The 2020 U.S. Decennial Census, for example, serves as one of the most prominent deployments of differential privacy in recent years (Abowd18). In this work, we study private synthetic data generation for the purpose of preserving a large collection of summary statistics. This task, which is otherwise known as private query release, is one of most fundamental problems in differential privacy and remains a key objective for many organizations like the U.S. Census Bureau. For example, recent work (hardt2010simple; gaboardi2014dual; vietri2020new; liu2021leveraging; aydore2021differentially; liu2021iterative; mckenna2022aim) has demonstrated the effectiveness of generating private synthetic data to solve query release; and after announcing plans to incorporate differential privacy into the American Community Survey (ACS) (jarmin2025acs), the U.S. Census Bureau declared (albeit informally) that it intended to replace the public ACS release with fully synthetic data (rodriguez2021synacs).

While private synthetic data generation has shown promise with respect to query release, prior work has not studied the setting in which the data exhibits hierarchical structure (i.e., individual data points that are naturally grouped together). In particular, the synthetic data generation process for records at both the group and indivdiual-level has not yet been established in this problem domain. Moreover, although studies based on hierarchical data often look at relationships between individuals across groups (e.g., trends observed between domestic partners), queries in the traditional setting do not describe such statistics. As a result, social scientists have criticized the U.S. Census Bureau’s objective to replace the ACS with synthetic data, arguing that existing methods are unsuitable since they can only capture statistical relationships at the individual-level (ipums_acssynthetic).

Our Contributions. As alluded to by such critiques, private synthetic data generation remains an open problem for hierarchical data within the context of query release. Therefore, the objective of this work is to explore this problem and introduce differentially private algorithms for synthetic data with hierarchical structure. In pursuit of this goal, we make the following contributions:

  1. [leftmargin=*]

  2. We initiate the study of hierarchical query release, formulating the problem in which the data domain has two levels—denoted as group and individual111In contrast, previous works only consider data domains containing the latter.—in its data hierarchy.

  3. We construct a set of queries that describe statistical relationships across group and individual-level attributes, where our goal is to capture how attributes interact both within and between each level.

  4. While our formulation of hierarchical data domains naturally lends itself to running synthetic data generation algorithms such as MWEM (hardt2010simple), these algorithms are computationally intractable. Therefore, inspired by methods from liu2021iterative, we present a practical approach for hierarchical query release that models the data as separate product distributions over group and individual-level attributes. Using this approach, we introduce two methods, HPD-Fixed and HPD-Gen, which we empirically evaluate on hierarchical datasets that we construct from the American Community Survey (ACS)222To allow researchers to reproduce our evaluation datasets, code for preprocessing ACS data derived from IPUMS USA (ruggles2021ipums) will be made available online. and Allegheny Family Screening Tool (AFST) data.

  5. We use the ACS to formulate another set of queries meant to capture relationships between people within households (e.g., the likelihood that an individual with a graduate degree is married to someone who also has a graduate degree). We run additional experiments in this setting.

Related Work. Query release remains as one of the most fundamental problems in differential privacy (BlumLR08), and over the years, synthetic data generation has become a well studied approach, both from theoretical (RothR10; hardt2010multiplicative; hardt2010simple; GRU) and practical (gaboardi2014dual; vietri2020new; aydore2021differentially) perspectives. In particular, liu2021iterative unify iterative approaches to synthetic data generation for private query release under their framework Adaptive Measurements and concurrently with aydore2021differentially propose methods that take advantage of gradient-based optimization. Moreover, liu2021iterative demonstrate that the performance of their iterative algorithms can be improved significantly when optimizing over marginal queries, where the sensitivity can be bounded for an entire workload of queries. In later work, mckenna2022aim

also explore this case, presenting an iterative procedure that achieves strong performance while removing the burden of hyperparameter tuning.

Despite this long line of research applying synthetic data to private query release, prior works—including those that have used the ACS itself for empirical evaluation (liu2021leveraging; liu2021iterative)—are limited to the non-hierarchical setting. In addition, we note that there exist other works that do not tackle this specific problem: (1) those studying private query release using "data-independent" mechanisms (mm; mckenna2018optimizing; mckenna2019graphical; NTZ; ENU20) and (2) those studying synthetic (tabular) data generation for purposes other than query release (rmsprop_DPGAN; yoon2018pategan; xu2019modeling; beaulieu2019privacy; neunhoeffer2020private; rosenblatt2020differentially). However, such works also do not consider hierarchical structure.

2 Formulation of Hierarchical Data Domains

We consider data with some hierarchical structure, in which a dataset can first be partitioned into different groups that can then further be divided into individual rows. For example, one can form a hierarchical dataset from census surveys by grouping individuals into their respective households. We assume that each group contains at most rows and has features . Similarly, we assume that each individual row belongs to some data domain of features . For example, a census survey may record whether households reside in an urban or rural area () and what the sex and race () are of individuals in each household. Putting these attribute domains together, we then have a hierarchical data universe over all possible groups of size no more than :333In the traditional (non-hierarchical) query release setting, there instead exists some private dataset with discrete attributes, such that the data domain .

(1)

where and represents an empty set of features in (i.e., one of the possible individual rows in the group does not exist). In other words, every element in corresponds to a possible group (including the individual rows it contains) in the data universe. Letting the group-level domain size and individual-level domain size , the overall domain size can be written as .

Given this formulation of , our goal then is to study the problem of answering a collection of queries about some dataset while satisfying differential privacy. [Differential Privacy (DworkMNS06)] A randomized mechanism is -differentially privacy, if for all neighboring datasets (i.e., differing on a single person), and all measurable subsets we have:

Typical differentially private query release algorithms operate under the assumption that neighboring datasets differ on a single row where is given as public knowledge.444This version of differential privacy that uses this notion of neighboring datasets is sometimes referred to as bounded differential privacy (Dwork06; DworkMNS06) In our setting, however, we have a dataset with individual rows comprising groups of size . Therefore, we similarly assume that , , and are given (and fixed). For release of census survey data, for example, the U.S. Census Bureau may publish publicly both the total number of people in the United States and the total number of households that such people reside in. Moreover, we assume that while and still differ on an individual row, the total number of groups and maximum group size remain the same.

3 Hierarchical Query Release

We now formulate the problem of hierarchical query release, in which we construct a set of queries that our synthetic data algorithms will optimize over. Specifically, to construct a set of statistics describing datasets , we will focus on linear queries, which we define as the following: [linear query] Given a dataset and predicate function , a linear query is defined as

For example, the number of males in some dataset can be represented as a linear query with a predicate function . Similar to previous work, we will also normalize query answers with respect to a fixed constant , where denotes some overall count such as the size of the dataset (e.g. % males = # males / # people).555In some prior works, a linear query is defined with the normalizing constant included (where ).

In the case where is an indicator function, the output of some linear query can be interpreted as the count of elements satisfying a (boolean) condition defined by (e.g., is male). Going forward, we will refer to such predicates as singleton predicate functions, where we restrict to condition over a single attribute in either or . We can then form linear queries over distinct attributes by combining singleton predicates into a -way predicate function, which we define as [-way predicate] Given some set of singleton predicate functions , we define a -way predicate function as the product . In other words, a -way predicate function outputs some logical conjunction over boolean conditions defined for distinct attributes in (e.g., whether a person is both male and 30 years old).

In the traditional query release setting, it suffices to construct a set of queries counting the number of (individual) rows satisfying linear queries defined by -way predicates. In contrast, given the hierarchical structure of , our goal then is to output counts at both the group and individual-level. Consequently, we have two classes of query types: (1) linear queries at the group-level (i.e., % of households) and (2) linear queries at the individual-level (i.e., % of people).

Group-level queries. Given some dataset , we have that each element corresponds to a group. Therefore, we can describe write down group-level counting queries using Defintion 3, where is some -way predicate function.

Individual-level queries. When counting individual rows, each group can contribute up to rows to the total count (e.g., a household can contain up to males). Therefore, defining such queries using Definition 3 requires that . However, we can convert hierarchical datasets from a collection of groups to a collection of individuals , where

(2)

Unpacking Equation 2, we have that describes the features for some individual row , the group to which belongs, and the features of the remaining group members. Now, we can also write down these linear queries using -way predicates, where .

Query Type Attr. Domain Predicate Example
attribute What proportion of households
reside in rural areas?
contains individual row What proportion of households
with attribute contain (at least) one male individual?
belongs to a group What proportion of individuals
with attribute live in rural households?
attribute What proportion of individuals
are male?
Table 1: We can distinguish between singleton predicates by reducing them to (1) whether counts are made at the group or individual-level and (2) the domain of the attribute that specifies over ( or ). In this table, given some attribute or , we describe the corresponding predicate for each combination of (1) and (2) where is some target value (e.g, for ).

We summarize in Table 1 the singleton predicates for both group and individual-level queries that we consider. Given some set of group-level attributes and set of individual-level attributes where , we construct a set of group and individual-level queries by combining singleton predicates for each attribute in . For example, given the group-level attribute Urban/rural and individual-level attributes Sex and Race, we have queries of the following form:666i.e., linear queries where .

  • () What proportion of households are located in an urban area and contain at least one individual who is female and white?

  • () What proportion of individuals reside in a household located in an urban area and are female and white?

Finally, for differentially private algorithms, it is necessary to derive the -sensitivity, which captures the effect of changing an individual in the dataset for each function (or query). [-sensitivity] The -sensitivity of a function is

Because and are comprised of linear queries with and are normalized by constants and respectively, the -sensitivities of queries in and are and .

4 Synthetic data generation for hierarchical data

Having formulated the problem of hierarchical query release, we next introduce a general approach to answering queries privately via synthetic data generation. Given some family of data distributions , our goal then is to find some synthetic dataset that solves the optimization problem

One natural approach is construct a probability distribution over all elements in

, and while methods using this form of have been shown to perform well (hardt2010multiplicative; hardt2010simple; ullman2015private; gaboardi2014dual; liu2021leveraging; liu2021iterative), they are computationally intractable in most settings given that the size of grows exponentially with the number of attributes. Inspired by methods introduced in liu2021iterative, we demonstrate that an alternative solution is to represent as a mixture of product distributions over each attribute in and . In this way, the computational requirements for methods operating on scale only linearly with respect to and the number of attributes and .

Adaptive Measurements. We consider algorithms under the Adaptive Measurements framework (liu2021iterative), which we provide a brief overview for. Having observed that many synthetic data generation algorithms for private query release share similar iterative procedures, liu2021iterative present this unifying framework for such algorithms that can broken down into the following steps at each round : (1) sample privately some query with high error, (2) measure the answer to this query using a differentially private mechanism, and (3)

optimize some loss function with respect to measurements from all rounds

. Moreover, they show that the formulation of algorithms under this framework can then be reduced to selecting and finding a corresponding loss function .

Input: Private dataset , set of linear queries , distributional family , loss functions , number of iterations
Initialize distribution
for  do
       Sample: Choose using the exponential mechanism.
       Measure: Let where is Gaussian noise (Gaussian mechanism).
       Update: Let and . Update distribution :
end for
Output where is some function over all distributions (such as the average)
Algorithm 1 Overview of Adaptive Measurements

We restate Adaptive Measurements in Algorithm 1 using the exponential and Gaussian mechanisms for the sample and measure steps respectively. We also note that in our algorithms, we will run these mechanisms under the assumption that all queries have -sensitivity equal to . For example, we have a sensitivity of for (since ). Having bounded the sensitivity of the queries, we can now preserve the privacy guarantees of Adaptive Measurements for our hierarchical setting (we provide a proof in Appendix A.1). For all , there exist parameters such that when run with the exponential and Gaussian mechanisms using and respectively, algorithms under the Adaptive Measurements framework satisfy -differential privacy.

4.1 Hierarchical product distributions (HPD)

Our goal is derive a probability distribution from which we sample groups (attributes

) and the individual rows (attributes ) contained in each group. Let be a distribution over the possible number of individuals in each group, and let be the distribution from which we sample attributes . In addition, for , let denote the distribution over attributes such that for each group, the individual row is sampled from . In Algorithm 2, we present our two-stage sampling procedure, which can be summarized as the following: (1) sample the group-level attributes and group size , and (2) sample sets of individual-level attributes.

Input: Distributions , , and
Sample group size
Sample group-level attributes
for  do
       Sample individual-level attributes
      
end for
Let for
Output and
Algorithm 2 Sampling procedure for HPD.

Parameterization. Inspired by liu2021iterative, we choose to parameterize and as a uniform mixture of product distributions over each group and individual-level attribute and . Concretely, our parameterization means that for each of the product distributions, we maintain a distribution over each attribute separately. For example, in the ACS dataset, the attribute takes on the values Male and Female, meaning that we maintain a

-dimensional vector corresponding to probabilities that an individual is male or female.

Mixture of product distributions. One can interpret as some tunable parameter that increases the expressiveness of our probability distribution . When , we are limited to a single product distribution, meaning that one must assume in the private dataset that for all groups and the individual rows, attributes are independent. On the opposite side of the spectrum, suppose we let . Then any dataset with size groups can be perfectly captured since each row can be thought of as a product distribution over point masses. In non-hierarchical settings, liu2021iterative show that can be much smaller than and still achieve strong performance.

Linear number of parameters. We note that the number of parameters grows linearly with the number of attributes and . In particular, and require and parameters respectively, where and .

Differentiable query answers. An advantage of this representation is that mean aggregate queries can be calculated using basic operations over the probabilities for each attribute or . Therefore, this formulation makes constructing differentiable functions for each predicate in Table 1 convenient. For example, to calculate that the likelihood that an individual generated from is male, one can simply look up the probability in the -dimensional vector corresponding to the attribute . Similarly, to calculate if there is at least one male, it suffices to calculate the probability that a group (given some size ) has no males: .

4.2 Methods

For notation purposes, we let be some matrix of probability values such that row is the concatenation of the probability vectors for attributes in . We define and similarly for individual-level attributes and the group size . In Figure 1, we provide an example of a row in .

Figure 1: We provide an example of the row in matrix . As shown, the individual-level query with the predicate for "is female" and "is " is simply a product of two values in .

We now describe two methods that are inspired by the family distributions used in RAPsoftmax and GEM (liu2021iterative). The first of which maintains a fixed distribution in which the parameters of our model are exactly the probability vectors making up our

product distributions. The second of which employs generative modeling by using some neural network architecture that takes as input randomly sampled Gaussian noise

. Note that in both cases, we use the softmax function to ensure that these methods output probabilities while still maintaining differentiability.

Fixed distribution modeling (HPD-Fixed). Our first method defines a "model" whose parameters , , and are simply the matrices , , and respectively.

Generative modeling (HPD-Gen). Alternatively, we propose using a multi-headed neural network to generate rows from Gaussian noise . In this case, the output of each head, which we denote by , , and , corresponds to the , , and respectively.

Loss function. Following the Adaptive Measurements framework (liu2021iterative), at each round , we have a set of selected queries and their noisy measurements . Suppose we have some set of probability matrices , for , and that are obtained via either the fixed distribution or generative modeling methods outlined above. Abusing notation, we use to denote the collection of these matrices. Then at each round

, we optimize the following differentiable loss function via stochastic gradient descent:

(3)

where for query , there exists a differentiable function that maps from , , and to the corresponding query answer. Referring to the query types in Table 1, we have that

for group-level hierarchical queries () and

for individual-level hierarchical queries ().

5 Experiments

Methods. We evaluate methods that optimize over the two distributional families mentioned in Section 4—namely, modeling a distribution over all elements in (MWEM) and modeling hierarchical product distributions (HPD-Fixed and HPD-Gen).777We provide more details on these methods in Algorithms 4 and 6 in the appendix.

Datasets. We construct datasets from the following:

American Community Survey (ACS) (ruggles2021ipums). We first use the American Community Survey as our main evaluation dataset. The ACS gathers microdata annually and is widely used by various organizations to capture the socioeconomic conditions in the country. Survey responses are collected at the household level, describing the household itself () and the various individuals () residing in each household. We use data ( group-level and individual-level attributes) from 2019 for the state of New York (ACS NY-19) and select households of size or smaller. In total, we have individual rows that comprise groups.

Running MWEM requires a domain size significantly smaller than that of ACS NY-19. Therefore, we create a reduced version ( group-level and individual-level attributes) of the dataset (ACS (red.) NY-19), where we select households of size or smaller. In addition, we combine categories that attributes take on to reduce the domain size further. The domain size of ACS (red.) NY-19 is , which is significiantly less than that of ACS NY-19 ().

Allegheny Family Screening Tool Data (AFST) (vaithianathan2017developing). We run experiments using child welfare data of incidents located in Allegheny County, Pennsylvania, USA. Having been acquired from the Allegheny County Office of Children, Youth and Families (CYF), this data is used by call-screeners to make decisions about which referrals (i.e., reports of child abuse or neglect) to investigate via an AI-based tool called the Allegheny Family Screening Tool (AFST). We select a subset of columns ( group-level and individual-level) from the source dataset where . Our groups are the referrals themselves, which are comprised of children involved in such incidents. We will refer to this sample as AFST. In total, AFST has individual rows that comprise groups.

Evaluation Queries. As shown in Section 3, given some set of group and individual-level attributes , we construct group and individual-level queries and . We restrict the number of attributes for each query to and choose all such queries. This query selection scheme results in queries for ACS (red.) NY-19, for ACS NY-19, and for AFST.

Results. We evaluate all methods on each dataset across privacy budgets , plotting the max error over queries in Figure 2.888Plots for mean error can be found in Figures 4 of the appendix. The methods perform reasonably well, with HPD-Fixed and MWEM outperforming HPD-Gen in all experiments. We also note that performance on AFST is lower due to the significantly higher sensitivity of in this experiment.

Figure 2: Max errors for group and individual-level hierarchical queries evaluated on ACS/ACS (red.) NY-19 and AFST where and . The x-axis uses a logarithmic scale. Results are averaged over

runs, and error bars represent one standard error.

6 Case study: Modeling intra-household relationships in the ACS

Table 1 summarizes a general class of group and individual-level queries that are applicable to any hierarchical dataset belonging to . However, we note that this list is non-exhaustive because of the innumerable ways in which one can ask queries about the composition of each group. Therefore, while we contend that the sets and serve as a good basis for capturing group and individual-level relationships, there could exist additional queries that are important to specific datasets or problem domains. In particular, one interesting set of statistics that may be useful to measure are those that capture relationships between individuals belonging to the same group. For example, one common analysis employed by social scientists using the ACS involves uncovering statistical trends within spousal relationships with respect to attributes like education and ethnicity.

Consequently, we augment for the ACS with an additional individual-level query that takes on the form—"shares a group with another individual row with "—for some individual-level attribute and target . Furthermore, to focus on spousal and parent-child relationships, we restrict the dataset to only heads of the household and their spouses and children. We then let be predicates that identify individuals’ relationships to each other (e.g., spouse/parent/child) plus one or more additional attributes (e.g., sex, age, education, etc.). Taken together, we then have intra-group-relationship predicates such as: "has a spouse/parent/child that is male"

Combined with the predicates found in Table 1, we can construct intra-group-relationship queries . Suppose we have some set of group-level attributes and sets and of individual-level attributes (where ). Like in Section 3 we form singleton predicates from and (Table 1). However, in this case, we also add intra-group-relationship predicates formed from attributes in . For example, letting , , and , we have queries of the form:

  • () What proportion of individuals who live in an urban area and have a graduate degree are married to an individual who is employed?

Similar to assumptions made in Section 3 for the more general setting, we assume that neighboring datasets differ on a single individual while maintaining the overall household count. We also assume that households in are always valid. For example, households without a head cannot exist. In Appendix A.3, we describe how the flexibility of HPD allows us to make minor changes tailored to this problem domain. Using this modification, we also define for .

Experiments. To evaluate intra-group-relationship queries , we create a separate sample from the ACS data, which we denote as ACS (rel.) NY-19. In this case, we restrict individual types (attribute RELATE) to the values Head, Spouse, Child, giving us a maximum group size of (up to children and spouse can exist). To add the queries , we again have that where . With the three relation types spouse/parent/child, we have a total of queries. Like before, we also have that .

In Figure 3, we evaluate HPD-Fixed and HPD-Gen on ACS (rel.) using the sets of queries (left column) and (right column). Although the performance of the methods are close, we again observe that HPD-Fixed performs better.

Figure 3: Max errors for queries evaluated on ACS (rel.) NY-19 where and . On the left, we evaluate only on queries in , and on the right, we use all queries . The x-axis uses a logarithmic scale. Results are averaged over runs, and error bars represent one standard error.

7 Conclusion

In summary, having formulated the problem of hierarchical query release, we present an approach for generating private synthetic data with hierarchical structure. We then empirically evaluate our methods using a variety of query types and datasets, demonstrating that hierarchical relationships can be preserved using private synthetic data. Going forward, we hope that our work will inspire future research on hierarchical data with respect to both synthetic data and private query release.

References

Appendix A Hierarchical query release methods

We provide additional information regarding the methods discussed in Section 4. Note that implementation details for MWEM, HPD-Fixed, and HPD-Gen can be found in Appendix B.

a.1 Adaptive Measurements

  Input: Private dataset , set of queries , distributional family , loss function
  Parameters: Privacy parameter , number of iterations , privacy weighting parameter
  Let
  Let be the max -sensitivity over all queries in
  Initialize distribution
  for  to  do
     Sample: Choose using the exponential mechanism with score function
     Measure: Take measurement via the Gaussian mechanism:
     Update: Let and . Update distribution :
  end for
  Output: where is some function over all distributions (such as the average)
Algorithm 3 Adaptive Measurements

For the purposes of making this work more self-contained, we restate details of the Adaptive Measurements framework given by liu2021iterative. Specifically, we provide the full Adaptive Measurements algorithm in Algorithm 3, in which we write down the algorithm using zero concentrated differential privacy (zCDP) DworkR16, BunS16 (Zero Concentrated Differential Privacy [BunS16]) A randomized mechanism satisfies -zero concentrated differential privacy (-zCDP) if for all neighboring datasets (i.e., differing on a single person), and for all ,

where is the -Rényi divergence between the distributions and .

We note that we can convert from zCDP to -DP using the following: [BunS16] For all , if is -zCDP, then satisfies -differential privacy where .

Using this lemma, we can now prove Theorem 1, which states the privacy guarantee of Adaptive Measurements in terms of -DP. We restate the proof sketch given by liu2021iterative.

Proof sketch of Theorem 1. At each iteration, Adaptive Measurements runs the exponential and Gaussian mechanisms, which satisfy [cesar2020unifying] and -zCDP [BunS16] respectively. Therefore at each iteration, Adaptive Measurements satisfies -zCDP, or -zCDP after iterations [BunS16]. Plugging in , we have that Adaptive Measurements satisfies -zCDP. By Lemma 3, Adaptive Measurements then satisfies -differential privacy where . Therefore, for all , there exists some such that when run with the exponential and Gaussian mechanisms using parameters and respectively, Adaptive Measurements satisfies -differential privacy.

a.2 Explicit distribution over elements in the data universe

We consider algorithms that optimize over the distributional family . In other words, contains all possible probability distributions over , and any histogram can be thought of as a normalized histogram over group types in . In this case, the parameters of methods optimizing over such representations are simply the values of a -dimensional probability vector. Furthermore, we can also sample directly from to generate a synthetic dataset .

Suppose we have some dataset with some group size and histogram representation . Then, we can write any linear query (Definition 3) as a dot product and . We now discuss how to evaluate both group and individual-level queries on a histogram .

Group-level queries (). Given that is a distribution over all possible group types, we have that . Moreover, since we are normalizing over some total number of groups for group-level queries, our query function for some query with predicate is simply

Individual-level queries (). A group can contribute up to individuals when counting the number of individuals satisfying some predicate condition . Moreover, even when is fixed, the number of individual rows contained in varies with . Let be the predicate function that outputs the size of a group . Then the total number of individual rows in is . Therefore for individual-level queries, we have that

Having described how to both parameterize and evaluate any hierarchical query on some dataset , one can directly optimize our objective using algorithms such as MWEM [hardt2010simple], which we describe in Appendix B.3.1, or PEP [liu2021iterative].

a.3 Modifications to Hpd for modeling intra-household relationships in the ACS

Given the emphasis on the individual type in the ACS, we propose the following modification to :

(4)

where , , and correspond to the attributes of individuals with types Head, Spouse, and Child respectively, and , , denotes the maximum number of each type of person in the dataset.999In the ACS, every household has exactly Head and up to Spouse.

One immediate advantage of this formulation of is that we can now enforce counts for individuals types in our synthetic dataset. Previously, for example, synthetic households sampled from the product distribution could contain more spouses or children than is possible given the constraints of the ACS data. Similarly, a synthetic household could be entirely missing a person designated as the head.

In our product distribution representation, we also replace

with , , and (i.e., # head, spouse, and children) and with , , and for . Note that because every household in the ACS must have a head, we have that is fixed (i.e., the probability of a household having a single head is always ). As in Section 4, we will use the following matrix notation: and for .

Finally, we can write down the query function for intra-group-relationship queries () as

where denotes the pairs of person types for each relationship type. For example, for a query counting the number of people married to an individual satisfying some condition, then (i.e., counting pairs of heads and spouses). We include in Table 2 more details about intra-group-relationship queries that we consider for the ACS dataset. Note that in this equation for , and just correspond to and since there can be at most head and spouse in this dataset.

Relation Sensitivity Example
married to How many people are married to
someone who has graduated college?
has child How many people have a child
who is in high school?
has parent How many people have a parent
who is older than thirty-years-old?
Table 2: We describe the different types of household relationships in the ACS (spouse/parent/child) that we consider for queries . In particular, for each relationship condition, we list its sensitivity and the set of individual type pairings, .

Appendix B Experimental Details

We provide additional information related to our empirical evaluation.

b.1 Mean error results

We plot mean error for our experiments presented in Section 5 in Figures 4 and 5. Results with respect to mean error are similar to those with respect to max error. We observe, however, that HPD-Gen outperforms HPD-Fixed on ACS (rel.) when run on only.

Figure 4: Max and mean errors for group and individual-level hierarchical queries evaluated on ACS/ACS (red.) NY-19 and AFST where and . The x-axis uses a logarithmic scale. Results are averaged over runs, and error bars represent one standard error.
Figure 5: Max and mean errors for queries evaluated on ACS (rel.) NY-19 where and . On the left, we evaluate only on queries in , and on the right, we use all queries . The x-axis uses a logarithmic scale. Results are averaged over runs, and error bars represent one standard error.

b.2 Data

In Table 3, we list attributes from the ACS and AFST data that we use for our experiments. We obtained the raw data from the IPUMS USA database [ruggles2021ipums]. The AFST data is confidential, and so access to the raw data currently requires signing a data agreement. We obtained permission from the Allegheny County Office of Children, Youth and Families to publish error plots on experiments run on a sample derived from this data. More information about the attributes used for the tool can be found in vaithianathan2017developing.

Dataset Domain Attributes
ACS (red.) COUNT, METRO, OWNERSHP, FARM, FOODSTMP
SEX, AGE, EMPST, MARST
ACS / ACS (rel.) COUNT, METRO, OWNERSHP, FARM, COUNTYFIP
FARMPROD, ACREHOUS, ROOMS
BUILTYR2, FOODSTMP, MULTGEN
SFRELATE, SEX, MARST, RACE
HISPAN, CITIZEN, EDUC, SCHOOL
EMPSTAT, LOOKING, AGE
AFST VERSION, RELATIONSHIP_TO_REPORT
CALL_SCRN_OUTCOME, SERVICE_DECISION
HH_CITY
ACJ_NOW_VICT_SELF, ACJ_EVERIN_VICT_SELF
JPO_NOW_VICT_SELF, JPO_EVERIN_VICT_SELF
AGE_AT_RFRL_VICT_SELF
PLSM_PAST548_COUNT_NULL
SER_PAST548_COUNT_VICT_SELF
REF_PAST548_COUNT_VICT_SELF
Table 3: Data Attributes. Note that all hierarchical datasets have an attribute COUNT that denotes the group size.

b.3 Algorithm implementation details

We provide the exact details of our algorithms in the following sections. Hyperparameters can be found in Table 4. All experiments are run using a desktop computer with an Intel® Core™ i5-4690K processor and NVIDIA GeForce GTX 1080 Ti graphics card. Our implementation is derived from https://github.com/terranceliu/dp-query-release [liu2021iterative].

Dataset Method Parameter Values
ACS (red.) MWEM , , ,
, ,
HPD-Fixed learning rate
, , , , , ,
HPD-Gen learning rate
hidden layers
, , , , , ,
ACS / ACS (rel.) / AFST HPD-Fixed learning rate
, , , ,
HPD-Gen learning rate
hidden layers
, , , ,
Table 4: Hyperparameters. We use for all methods.

b.3.1 Mwem

  Input: Private hierarchical dataset , set of queries
  Parameters: Privacy parameter , number of iterations , privacy weighting parameter , max per-round iterations
  Let be the maximum possible number of individual rows belonging to a group in
  Let be the max -sensitivity over all queries in
  Let
  Let be a the normalized histogram representation of
  Initialize

be a uniform distribution over

  for  to  do
     Sample: Select query using the exponential mechanism with score function
     Measure: Take measurement via the Gaussian mechanism
     Update: = MWEM-Update () where and
  end for
  Output:
Algorithm 4 MWEM
  Input: Normalized histogram , queries , noisy measurements , max iterations
  Let be the max error across queries in
  Let be the collection of indices for the top queries with highest error
  Let be the collection of indices such that the error for is greater than
  for  do
     Let be a distribution s.t.
where
  end for
  Output:
Algorithm 5 MWEM-Update

We restate MWEM in Algorithm 4, with a slight change to the multiplicative weights update rule—in our case, we rescale by a factor of (Algorithm 5) when is an individual-level query so that . In addition, we add empirical improvements described in liu2021leveraging, which are presented in Algorithm 5.

Note that in general, running MWEM is extremely impractical in the hierarchical setting since scales exponentially with (in addition to and . For example, in the non-hierarchical setting, the domain size of the attributes used in ACS (red.) NY-19 is only (compared to a domain size of in our case when ).

b.3.2 Hierarchical Product Distributions

  Input: Private hierarchical dataset , queries
  Parameters: Privacy parameter , number of iterations , privacy weighting parameter , batch size , max per-round iterations
  Let be the max -sensitivity over all queries in
  Let
  Initialize as some representation of a hierarchical product distribution that is parameterized by (either fixed table in HPD-Fixed or neural network in HPD-Gen)
  for  to  do
     Sample: Select query using the exponential mechanism with parameter and score function
     Measure: Take (via the Gaussian mechanism) measurement
     Update: