DeepAI
Log In Sign Up

Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

08/26/2022
by   Lucas Rosenblatt, et al.
0

Differential privacy mechanisms are increasingly used to enable public release of sensitive datasets, relying on strong theoretical guarantees for privacy coupled with empirical evidence of utility. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, or classification accuracy. In this paper, we propose an alternative evaluation methodology for measuring the utility of differentially private synthetic data in scientific research, a measure we term "epistemic parity." Our methodology consists of reproducing empirical conclusions of peer-reviewed papers that use publicly available datasets, and comparing these conclusions to those based on differentially private versions of the datasets. We instantiate our methodology over a benchmark of recent peer-reviewed papers that analyze public datasets in the ICPSR social science repository. We reproduce visualizations (qualitative results) and statistical measures (quantitative results) from each paper. We then generate differentially private synthetic datasets using state-of-the-art mechanisms and assess whether the conclusions stated in the paper hold. We find that, across reasonable epsilon values, epistemic parity only partially holds for each synthesizer we evaluated. Therefore, we advocate for both improving existing synthesizers and creating new data release mechanisms that offer strong guarantees for epistemic parity while achieving risk-aware, best effort protection from privacy attacks.

READ FULL TEXT VIEW PDF
08/21/2020

Low Influence, Utility, and Independence in Differential Privacy: A Curious Case of 3 2

We study the relationship between randomized low influence functions and...
09/13/2022

Differentially Private Genomic Data Release For GWAS Reproducibility

With the rapid development of technology in genome-related fields, resea...
02/14/2022

Optimizing Random Mixup with Gaussian Differential Privacy

Differentially private data release receives rising attention in machine...
06/02/2019

Generating Poisson-Distributed Differentially Private Synthetic Data

The dissemination of synthetic data can be an effective means of making ...
10/20/2022

Private Algorithms with Private Predictions

When applying differential privacy to sensitive data, a common way of ge...
12/13/2022

Considerations for Differentially Private Learning with Large-Scale Public Pretraining

The performance of differentially private machine learning can be booste...
10/04/2017

Differentially Private Database Release via Kernel Mean Embeddings

We lay theoretical foundations for new database release mechanisms that ...

1 Introduction

Differential privacy (DP) has been studied intensely for over a decade, and has recently enjoyed uptake in the private and public sector. In situations where the analysis is known, one can design specialized mechanisms with high utility mckenna2021winning; mckenna2022aim. But an active research area is to design DP mechanisms that model the entire data distribution, then sample these noisy models to generate synthetic datasets. The goal is to support a broader variety of downstream, unanticipated applications. However, this goal is fundamentally unachievable due to the Fundamental Law of Information Recovery DBLP:conf/pods/DinurN03

: overly accurate estimates of too many statistics are blatantly non-private. Therefore, the utility of DP synthetic data generation techniques (henceforth, synthesizers) is evaluated empirically using proxy tasks on common public datasets. These tasks include descriptive statistics and queries involving one or two variables 

hill2015evaluating; hay2016principled; takagi2021p3gm; tao2021benchmarking, classification accuracy ding2020differentially; takagi2021p3gm; zhang2017privbayes, and information theoretic measures zhang2017privbayes. These proxy tasks are informed by real tasks, but the implicit claim of generalization is rarely explored.

With empirical studies having limited perceived relevance, it is perhaps unsurprising that DP is not universally embraced in the social sciences. The US Census Bureau adopted DP for disclosure avoidance in the 2020 census, interpreting federal law (the Census Act, 13 U.S.C. § 214, and the Confidential Information Protection and Statistical Efficiency Act of 2002) as a mandate to use advanced methods to protect against computational reconstruction attacks unforeseen when the laws were passed. But there has been resistance among some in the research community, who contend that DP noise destroys accurate demographic distributions ruggles2019differential and can exacerbate underrepresentation of minorities kenny2021use; ganev2021robin. Besides confounding research, there are potential consequences for policy: Block grants are allocated based on minority populations as measured by the census data; underrepresentation can lead to underfunding integral services including Medicaid, Head Start, SNAP, Section 8 Housing vouchers, Pell Grants, and more christ2022differential.

Despite these challenges, DP still offers stronger guarantees of disclosure protection and similar utility to alternative proposals (e.g., swapping christ2022differential). The DP promise, if used correctly, ensures that any inferences conducted on data do not reveal whether a single individual’s information (including, for example, their gender or race) was included in the data for analysis dwork2006calibrating. Thus, DP can both prevent leakage of an individual’s data and allow their sensitive and other attributes to be used during model training jagielski2019differentially.

Challenges.

A practical method of operationalizing DP is by learning a DP model of a dataset, then sampling that model to generate synthetic data dwork2009complexity; hardt2010simple; DBLP:journals/corr/abs-2001-09700; rosenblatt2020differentially; vietri2020new; mckenna2021winning. The goal is to develop general purpose data synthesizers that perform well across a number of metrics, with both high statistical utility (preservation of statistics involving a single feature) as well as high predictive

utility (preservation of the joint probability distribution needed to make predictions). However, this goal has proven elusive due to inherent features of DP: If the privacy budget is “spent” uniformly across features, we can preserve statistical utility at the expense of conditional probabilities and marginals needed for accurate prediction. Further, utility loss may be non-uniform across subsets of a dataset, in some cases exacerbating inequity and leading to underrepresentation 

bagdasaryan2019differential; kenny2021use or to error rate disparities DBLP:journals/corr/abs-2204-12903.

Methodology.

We propose an evaluation strategy for synthesizers that emphasizes epistemic parity: the conclusions one draws on the original dataset are expected to hold on the noise-added dataset. We identify conclusions in the text, extract relevant findings supporting those conclusions, implement the corresponding statistical tests using public data, generate a synthetic dataset using state-of-the art DP synthesizers, re-apply the implemented tests over the synthetic data, and then determine if the conclusions still hold.

We instantiate our methodology over a benchmark of peer-reviewed sociology papers that are based on public data from the Inter-university Consortium for Political and Social Research (ICPSR) repository. We model quantitative results as an inequality between two numbers, e.g., Those using marijuana first (vs. alcohol or cigarettes first) were more likely to be Black, American Indian/Alaskan Native, multiracial, or Hispanic than White or Asian. fairman2019marijuana. In addition to quantitative results, we consider qualitative results: a subjective assessment of whether key visualizations in the paper expose the same relationships when recreated over synthesized data. Following errington2021reproducibility, and as is common in the reproducibility literature, our aim was not to reproduce every finding in every paper; rather it was to identify and reproduce a selection of key findings from each paper.

For generality, interpretability and simplicity, we consider whether a conclusion holds over synthetic data to be true if the the two quantities are in the same relative order, and do not attempt to measure the change in effect size or the statistical significance of the difference between the original and synthetic result.

Figure 1: Here we isolated finding statements across each of the papers we privatized (here, ) according to the taxonomy described in Section 3.1. The scale—0 (red) to 5 (blue)—represents the number of iterations where each finding (x-axis, discrete) was reproduced using a new run of the private data.

Benchmark and results.

ICPSR is an NSF-funded repository for social science data holding over 100,000 publications associated with 17,312 studies. A study typically involves hundreds of variables and supports dozens of papers. Each paper can be considered to be deriving its own dataset (selected variables and selected rows) from the source data of the study. We apply DP methods to synthesize data for these paper-specific, study-derived datasets.

ICPSR studies are publicly available by policy, which enables us to instantiate the epistemic parity methodology and develop a benchmark. Notably, there is increasing demand from the ICPSR leadership and community to support keeping sensitive data private, while generating DP synthetic subsets to support reproducibility. Our methodology can be used to respond to this demand.

Paper selection. The benchmark we define consists of three datasets and six recent peer-reviewed papers selected for impact, accessibility of the topic to non-experts, recency, and several other criteria. We extracted findings and attempted to reproduce them, following the ”same data, different code, different team” approach to reproducibility, encountering challenges commonly reported in that literature including undocumented data versioning, unspecific or incomplete methodologies, and irreconcilable differences between our reproduction and what the authors report. Each paper received, at a minimum, attention from two researchers with advanced degrees in either computer science, statistics, or both, and at least thirty hours of work. We include a complete list of papers that we attempted to reproduce, and the issues we encountered during reproduction, in supplementary materials.

DP synthesizer selection. We use three state-of-the-art DP synthesizers, namely, MST mckenna2021winning, PrivBayes zhang2017privbayes and PATECTGAN rosenblatt2020differentially, executing each at their recommended settings. We describe these methods in Section 2.

Summary of results. We find that, across reasonable epsilon values, epistemic parity only partially holds for each paper in our benchmark, for each synthesizer we evaluated, and that no single synthesizer dominates the others in terms of performance on our metric. Our results are shown in Figure 1, and will be discussed further in Section 5.

Roadmap and Contributions.

We present related work on reproducibility, DP synthesis and evaluation of DP in Section 2, and then go on to present our contributions.

  • We propose the epistemic parity evaluation methodology, based on reproducing qualitative and quantitative empirical findings in peer-reviewed papers over DP synthetic datasets, in Section 3.

  • We instantiate the epistemic parity methodology for a benchmark of peer-reviewed papers in sociology, creating a reusable benchmark for evaluating synthesizers, as discussed in Section 4.

  • We present experimental results on our benchmark, using three state-of-the-art DP synthesizers, in Section 5.

We conclude with a discussion of the results, motivating a new class of privacy techniques that favor strong epistemic parity and de-emphasize privacy risk, in Section 6.

2 Related Work

DP synthesis and evaluation.

In our evaluation, we considered three state-of-the-art private data release methods: MST, PATECTGAN, and PrivBayes. We acknowledge that many other methods exist for generating DP data dwork2009complexity; hardt2010simple; DBLP:journals/corr/abs-2001-09700; vietri2020new. We chose these three methods according to recent work by tao2021benchmarking, which finds that of the marginal based synthetic data methods, MST is the highest performing, of the Bayes net based methods, PrivBayes is the highest performing, and of the GAN based methods, PATECTGAN is the highest performing (all on tabular data).

MST relies on the Private-PGM graphical model to construct a maximum spanning tree among attributes in the data feature space, where edges are weighted by mutual information mckenna2018optimizing

. By measuring 1-, 2- and some 3-way marginals, MST is able to create a high-fidelity low-dimensional approximation of the joint distribution between all attributes, leading to impressive statistical utility on metrics like mean, standard deviation, and bivariate correlations. PATECTGAN, on the other hand, relies on a conditional generative adversarial network tuned to tabular data, where the discriminator has privacy constraints  

xu2019modeling. The high up-front cost of initializing and updating the weights in a full discriminator network means that PATECTGAN is limited in low dimensional settings, but shows particular promise when the data is higher-dimensional. Finally, PrivBayes derives a bayesian model and adds noise to all

-way correlations to ensure differential privacy. PrivBayes has demonstrated efficacy in a number of settings, but is computationally inefficient for high-dimensional data.

tao2021benchmarking evaluated several DP synthesizers’ ability to preserve 1- and 2-variable distributions. jayaraman2019evaluating studied privacy-utility tradeoffs for ML tasks and found that commonly-used values and implementations in practice may be ineffective: either unacceptable privacy leakage or unacceptable utility tends to occur. hay2016principled used 1- and 2-dimensional range queries over 27 public datasets to study the influence of dataset scale and shape that had led to inconsistent results in the previous literature. While the datasets are diverse in relevant properties, the tasks are limited, and the link to conclusions drawn is unexplored. hill2015evaluating studied the utility of DP on one longitudinal behavioral science dataset involving a sexuality survey motivated by real world attacks based on disclosing pregnancy, finding that theoretical guarantees of DP were generally supported, but that high-dimensional data was a challenge for utility. In this work, we aim to facilitate and standardize these kinds of applied studies.

Reproducibility.

Numerous reproducibility studies have been attempted in various fields, typically reporting remarkably low rates of success, leading to calls for significant changes to policy and incentive structures underlying scientific funding and publishing national2019reproducibility; nosek2018preregistration; munafo2017manifesto. In the social sciences, camerer2018evaluating replicated 21 systematically selected experimental studies published in Nature and Science between 2010 and 2015, finding a significant effect in the same direction as the original study for 13 (62%) studies, with about half the effect size, on average. The Reproducibility Project: Cancer Biology errington2021reproducibility attempted to reproduce 193 experiments from 53 papers, but succeeded in only 50 experiments from 23 papers. They found that only 2% of papers supplied open data, 0% of protocols were completely described, 67% of experiments required modifications to complete, and replication effect sizes were 85% smaller than in the original findings. A survey by baker20161 found that 52% of respondents agreed that reproducibility represents a ’crisis’ for science.

The terms reproducibility and replicability are used inconsistently across and even within fields barba2018terminologies. We use reproducibility to mean same data, different code, different team barba2018terminologies. We use replicability to mean acquiring new data in a new experiment to determine whether the same conclusions hold. Our methodology amounts to first reproducing conclusions found in peer-reviewed papers, and then replicating these same conclusions on DP synthetic data. While our focus is neither to reproduce nor to replicate the original study, our framework supports reproducibility analysis as an intermediate step, and our insights regarding the difficulty (and, often, the impossibility) of reproduction are consistent with prior findings.

3 Evaluating Epistemic Parity

We now present the methodology for measuring the epistemic parity of DP synthetic data in scientific research: the degree to which conclusions hold when re-evaluated under DP. Our methodology is designed to be applied one peer-reviewed paper at a time, collating results from multiple papers into a diverse, representative, and reusable benchmark for use by DP researchers. We assume public access to the data on which the paper’s results were computed; we are not intending to protect the privacy of subjects involved in the study. This methodology is intended to be applied by a reader of the paper, as opposed to the authors, though expertise in the domain is of course an advantage.

Given a paper, we start by identifying natural language claims made by the authors throughout the paper, focusing primarily on the abstract and conclusion. Although expertise in the domain is an advantage in this exercise, we contend that the identification of claims should be possible by non-experts, as the goal of the paper is to communicate findings to a broad audience. For each claim, we identify a quantitative (or qualitative ) evidence that supports the claim, recording the variables involved and methods used. While in principle this step should always be possible, it can in practice be difficult or impossible baker20161; national2019reproducibility, and may involve guesswork when the computational details are incomplete. We then re-implement the analysis to (attempt to) reproduce the salient findings and conclusions in the paper over the original, public dataset, as discussed in Section 3.1. If the reproduction was successful, we use DP methods to generate synthetic datasets, using different methods and different random seeds, with the same number of records as the original data, and attempt to once again reproduce the findings over these generated datasets. Finally, we contrast the findings based on original and DP data.

Our main metric is the proportion of synthesized datasets for each method for which the finding holds, which we refer to as epistemic parity. Our methodology is implemented in an open-source framework, see supplementary materials or our repository

https://github.com/DataResponsibly/SynRD.

3.1 Reproducing Experimental Studies

Figure 2: Publication and Finding classes are extended for each paper, privatized and aggregated.

We adapt three concepts of reproducibility—values, findings, and conclusions—from Cohen et al cohen2018three into a practical taxonomy for reproducing a statistical analysis in a peer-reviewed publication, and implement a class structure that allows us to conduct concrete experiments around this taxonomy (summarized in Figure 2).

The atomic element in reproducibility is a finding, defined by Cohen as “a relationship between the values for some reported figure of merit with respect to two or more dependent variables.” For the purposes of our study, a finding consists of a natural language statement (i.e., a claim) reported in a publication, along with evidence provided by one or more quantitative or qualitative sub-statements about the analysis.

Evidence for a finding consists of a comparison between two or more values that can be evaluated as a Boolean condition. A value may be a scalar measurement (i.e.,

), an aggregated or computed result (i.e., an odds ratio of

), or even an implicit threshold expressed in natural language (e.g., “a low rate” or “a strong correlation”). In these cases, we instantiate the language as a quantitative threshold, applying conventions from the literature when they exist. For example, a common convention is that Pearson’s correlation is considered “strong” when the value is larger than 0.7.

A special case of a finding is a qualitative visual finding that often appears in the form of a figure, table or diagram. A figure encodes a multitude of potential findings; we do not (necessarily) consider each of these sub-findings on their own in our analysis, but rather treat them as a single visual finding: we attempt to reproduce the figure itself, and subjectively evaluate its similarity to the original. Consider Figure 3, where the top sub-figure shows a percentage breakdown of drug use across demographic groups from fairman2019marijuana, and the bottom sub-figure shows the same breakdown replicated by us on DP synthetic data generated by MST at .

Figure 3: A visual finding from Fairman et al fairman2019marijuana, describing a percentage breakdown of drug use across demographic groups (top) and our qualitative reproduction under DP, using MST at . Agreement is subjectively high, though imperfect.

Finally, following Cohen cohen2018three, a conclusion is defined as “a broad induction that is made based on the results of the reported research.” A conclusion must be explicitly stated in a paper, and comprises one or several findings. We will discuss how reproduced findings support reproduced conclusions in Section 6.

3.2 Generating DP Synthetic Data

Each of the papers that we reproduced using DP synthetic data derived findings from a subset of the full study’s data. For example, HSLS:09 consists of over 7000 columns, but Jeong et al used only a subset of 57 jeong2021. We synthesize the subset of data relevant for the reproduced findings and conclusions, as discussed in Section 3.1.

The DP methods for private data release are executed for the range of values , which represents a small to medium privacy regimen bowen2020comparative. Each private data mechanism is run five times to produce, at each

value, five sampled datasets using the same sample size but different random seeds. Each private data release method involves different hyperparameters and varying levels of tunability, but we use author-recommended settings to avoid biasing results towards our own expertise. We then re-compute the results for each sample.

If all findings are reproduced regardless of or random seed, we say that the DP mechanism as achieved epistemic parity. We measure the degree of epistemic parity as the proportion of iterations for which the finding holds. The goal is to overlook small variations in the exact value in favor of maintaining the relative relationships of the computed statistics for interpretability and practical utility.

4 Benchmark Construction

We used a standardized approach for study and paper selection. Each study, which has an associated dataset, was selected for broad impact (at least 100 papers). For each selected study, we limited our publication search to the past five years, to ensure a focus on modern methods of analysis. Within that five year window, there might still be 10s or 100s of data related publications. We sought papers that meet all of the following criteria: (1) Are publicly available (so that we could report on their results without violating any permissions); (2) Use a publicly available portion of the study dataset (so that we were not trying to replicate analyses conducted on private data using a public subsample); (3) Are published in peer-reviewed publications, with a preferences given to high-impact journals; (4) Are cited (for papers that are at least two years old); and (5) Are of a reasonable length (page count ). We describe selected studies and papers below, see Appendix A for details.

4.1 Selected Studies

We selected three prominent, federally funded studies.

HSLS:09

: High School Longitudinal Study 

dalton2016high, is a nationally representative, longitudinal study of U.S. 9th graders who were followed through their secondary and postsecondary years. We attempted to reproduce four papers that use HSLS:09, and were able to fully or partially reproduce three papers. As HSLS:09 has a single dataset representing the entire period of the study, we did not encounter versioning issues during reproduction.

AddHealth: National Longitudinal Study of Adolescent and Adult Health addhealth, consists of a nationally representative sample of U.S. adolescents in grades 7 through 12 during the 1994-1995 school year. We attempted to reproduce four papers that use Add Health, and were able to fully or partially reproduce two. The public use data was severely limited in scope (a random subsample of less than 50% of the original), and we were forced to only consider papers that relied solely on the public use data.

NSDUH: National Survey on Drug Use and Health 2004-2014 nsfduh, measures the prevalence and correlates of drug use in the U.S.. We attempted to reproduce four papers that use NSDUH, but we were able to reproduce only one, partially and with substantial effort. The main obstacle was that the study is broken down by year across a fifteen-year time frame and multiple versions of the study for each year have been released. Past years of the study are seemingly updated without a record of what the new version modifies. Thus, we had extreme difficulty finding the breakdown of data version for each year to attempt replication.

4.2 Selected Papers

saw2018cross use HSLS:09 to examine cross-sectional and longitudinal disparities in STEM career aspirations among high school students. Methods include singular and trivariate disparity analysis of quantities of interest, and analysis of disparities among students deemed “persisters” (who persist in their STEM interest from 9th to 11th grade) and “emergers” (who emerge with STEM interest in 11th grade, having no interest in 9th grade).

This paper was reproducible, with effort. The authors provided an overview of data processing methodology, but failed to specify exact columns and clarify preprocessing procedures (e.g., creating “emerger” and “persister” student sets). However, we were able to reproduce each finding and agree with all conclusions.

lee2021ability

use HSLS:09 to identify factors that affect the performance of students on the 11th grade math exam. The authors examine “low teacher support” as an adverse factor, and self perceptions of math “ability” and “parental support” as protective factors. Factors are constructed by aggregating across relevant survey responses via a weighted average. Pearson correlation is computed across these variables and demographic information. Linear regression models are trained to predict math performance, with different interactions between variables.

This paper was partially reproducible, with substantial effort. The authors did a reasonable job of detailing their aggregation techniques, and include a helpful table for survey questions in their appendix. They did not detail their regression techniques, although we tried the simplest weighted linear regression and it aligned well. We were unable to reproduce a finding that involves a covariance slope analysis figure from a complex R package.

jeong2021

use HSLS:09 to interrogate potential racial bias in classifiers that predict student performance on a standardized 9th grade math exam. The authors assign each student to one of two racial groups — White/Asian (privileged) or Black/Hispanic/Native American (disadvantaged). They trained random forest, SVC and logistic regression models to predict whether a student would receive a top-50% or a bottom-50% test score, and measured accuracy, FPR, FNR, and predicted base rate.

We found this paper to be reproducible, with effort. The authors did not specify how the data was preprocessed (e.g., how missing values were imputed) or how it was split into training and test. We were ultimately able to reproduce the results sufficiently to agree with the conclusions, but were unable to reproduce the values in the findings exactly.

fruiht2018naturally use Add Health to investigate the role of naturally occurring mentors in the educational outcomes of first-generation college students. The authors fit a statistical mediation model by preacher2008asymptotic (PROCESS Model, variation 1) to test direct and interaction effects of parental educational attainment and mentorship on students’ educational attainment.

This paper was partially reproducible, with substantial effort. Beyond the Add Health study, the authors reported on findings made though manual qualitative coding of free text responses regarding the nature of support provided by the mentor, but did not make the coding scheme publicly available, and so we were unable to reproduce findings reliant on the scheme. For other findings, lack of detail about pre-processing, such as how individuals with multiple races were categorized into racial groups, prevented a perfect replication of many of the values reported in the paper, although we were able to reproduce the observed trends and agree with the authors conclusions.

iverson2021high analyze the effect of having played high school football on depressive and suicidal tendencies in men later in life using the first wave (1994-95) and the most recent wave (2016-2018) of Add Health. The authors conduct a bivariate analysis across two groups of men (those who did or did not play football in high school), and report on simple percent comparisons, statistical significance and odds ratios.

This paper was easy to reproduce. The authors provided precise guidance on which questions (columns) they used from the data and, because their techniques were fundamentally simple, we were able to exactly match nearly all values reported in the paper, reproducing all findings.

fairman2019marijuana use NSDUH to investigate predictors and potential consequences of initiating the use of marijuana before other types of substance (e.g., cigarettes and alcohol) for U.S. youth. The primary methods of analysis were counts and percentage comparisons by group, and computing adjusted relative risk ratio (aRRR) and adjusted odds ratio (aOR).

This paper was partially reproducible, with substantial effort. The main obstacle to reproducibility is the inappropriate versioning of the data by NSDUH, as discussed in Section 4.1. Additionally, the paper inadequately describes their calculations of aRRR and aOR were calculated, making them difficult to replicate. Ultimately, while we were unable to reproduce exact reported values, we did reproduce general trends and agree with most conclusions drawn in the paper.

5 Results

Our benchmark consists of six papers, each evaluated on three methods for four values of , for a total of 12 mechanisms for each paper, each repeated with 5 random seeds.

Quantitative findings.

The results appear in Figure 1. We show a grid of three methods (with ) for up to 16 individual findings per paper, and for a total of 72 distinct quantitative findings. The color indicates the number of random seeds for which a finding held, from 0 (red) to 5 (blue).

Both MST and PrivBayes had papers for which they were clearly the strongest performing:  fairman2019marijuana for MST; lee2021ability and saw2018cross for PrivBayes. PATECTGAN was not the strongest performing for any paper, which supports the literature describing current limitations of GAN-based approaches for DP synthesis tao2021benchmarking. PrivBayes was the slowest method to run, and its inability to run quickly on high-dimensional data (a well-understood limitation of the method) prevented us from obtaining results for jeong2021, with 57 variables, many of them non-binary.

Findings that only one synthesizer was able to reproduce consistently provide insight into the comparative strengths of the methods. For example, Finding #7 of lee2021ability (Text: As expected, perceived low teacher support was linked to lower achievement when adolescents were low on both protective factors, namely low ability self-concepts and low parental support (B= -2.23, p = .003)) was fully reproduced by PrivBayes for each iteration, and never reproduced by the other synthesizers. The only finding in  fairman2019marijuana that MST failed to reproduce, Finding #6 (Text: From 2004 to 2014, the proportion who had initiated marijuana before other substances increased from 4.4% to 8.0%), was reproduced by PrivBayes in 4 out of 5 iterations. MST focuses on preserving broad summary statistics associated with low-order marginals, while PrivBayes preserves conditional probability chains, which may explain some of these distinctions.

No synthesizer succeeded across all papers, and, remarkably, some findings were never reproduced by any of the three synthesizers. For example, Findings #3 and #4 in  jeong2021, which centered on aggregated comparisons between demographic groups on ML-model predictive performance, were not reproduced successfully by any synthesizer. (Recall that PrivBayes was unable to run on this paper due to high dimensionality of the data.)

That some findings are easier to reproduce than others is unsurprising. Though each synthesizer relies on a fundamentally different approach to replicating the joint distribution across all of the data, they each struggle with high dimensional data. For general purpose synthetic data, MST and PrivBayes prioritize lower dimensional relationships among variables, and thus it is unsurprising that simple mean comparison findings, and even some bivariate correlations, are easily preserved by these methods. However, findings like those in  jeong2021, where higher-dimensional predictive performance is reported on, are difficult to replicate for the DP synthesizers. Overall, however, we were surprised at the high number of findings across all our papers (even those that we were unable to replicate) relying only on 1- or 2-dimensional comparisons: The low-dimensionality suggests that earlier empirical studies (including tao2021benchmarking and hay2016principled) may be suitable as proxy tasks. Targeted improvements to the synthesizers may allow us to simultaneously support high utility for individual findings and their composition into broad conclusions.

Synthesizer performance across values.

Figure 4 compares synthesizer performance across reasonable values: the y-axis shows aggregated epistemic parity as the percentage of reproduced findings, over all five iterations of each synthesizer. We see that synthesizer performance improves somewhat for higher values, as expected. Furthermore, at higher values, we see near-perfect epistemic parity for both MST and PrivBayes for two of the papers. We also observe that PATECTGAN shows weaker performance for all but one paper, where no method performs well. The trends are consistent with the observations in Figure 1, and support the choice of as a reasonable privacy budget.

Figure 4: Epistemic parity achieved by MST, PrivBayes, and PATECTGAN for each paper in our benchmark, as a function of the privacy parameter . Epistemic parity, on the -axis, ranges between 0 and 1, and represents the fraction of reproduced findings over 5 executions.

Visual findings.

Figure 3 highlights a visual finding from fairman2019marijuana, discussed in Section 3.1. The results are subjectively similar, though a number of relationships may change at the individual level. Just as authors include visual findings in their papers, we argue that a DP evaluation should include qualitative results as part of an argument for epistemic parity.

6 Conclusions and Discussion

In this paper, we proposed epistemic parity as methodology for measuring the utility of DP synthetic data in support of scientific research. We assembled a benchmark of six peer-reviewed papers that analyze one of three studies in the ICPSR social science repository. We then experimentally evaluated epistemic parity achieved by three state-of-the-art DP synthesizers over the papers in our benchmark. Overall, we found epistemic partiy to be a compelling method for evaluating DP synthesizers. Further, we found that, of the three DP synthesizers we evaluated, two performed best for some of the papers, but no single synthesizer outperformed all others on all papers. Finally, some findings were never reproduced by any of the synthesizers.

Despite facing well-known reproducibility challenges during benchmark construction, we are confident that our results will lead to generalizaeble insights. We are continuing to expand the benchmark with additional studies and papers. We are also incoroporating additional synthesizers, to develop stronger insight into the performance trade-offs between classes of DP synthesizer methods in this setting.

Rebalancing Utility and Privacy.

The reality is that DP is yet to see broad use across domains, both in the sciences and in support of decision-making. On the other hand, other techniques, including anonymization and aggregation, that offer no formal guarantees of privacy, are widely used and even encoded in law in many jurisdictions. Moreover, many settings have conditions that can be met to allow the release of data to authorized partners. In these cases, the party is trusted, and noise-added data plays no role.

A conservative way to look at DP is as a “first pass” debugging tool: strong privacy guarantees, unreliable for scientific study, yet useful for developing software and methodology. Once the process is developed, one can apply for direct access to the original data with a clearer intent and more detailed project plan — the software for experiments is already written. In our work, we aim to quantify the utility of DP synthetic data beyond the limited use case of debugging, to directly support scientific experimentation.

We also advocate for a counterbalance to DP: Where DP provides strong privacy guarantees and best-effort utility, we recommend the study of strong epistemic parity guarantees with best-effort privacy. After all, this approach is broadly used in practice currently, but in an unprincipled way. What would a formal epistemic parity guarantee look like?

References

Appendix A Benchmark Construction Details

We selected studies (datasets) based on their impact as measured by the number of associated publications (minimum of 100), the accessibility of the topic, and their relevance to privacy research via guidance from ICPSR leadership. For each selected study, we sought papers that (1) Are publicly available (so that we could report on their results without violating any permissions); (2) Use a publicly available portion of the study dataset (so that we were not trying to replicate analyses conducted on private data using a public subsample); (3) Are published in peer-reviewed publications, with a preferences given to high-impact journals; (4) Are cited (for papers that are at least two years old); and (5) Are of a reasonable length (page count ). We describe the selected studies in Section 4.1 and the selected papers in Section 4.2.

a.1 Selected Studies

Hsls:09

The High School Longitudinal Study dalton2016high, is a nationally representative, longitudinal study of U.S. 9th graders followed through their secondary and postsecondary years, with an emphasis on understanding students’ trajectories from the beginning of high school into postsecondary education, the workforce, and beyond. We attempted to reproduce four papers that use HSLS:09, and were able to fully or partially reproduce three. As HSLS:09 has a single dataset representing the entire period of the study, we did not encounter versioning issues during reproduction.

AddHealth

The National Longitudinal Study of Adolescent and Adult Health addhealth, is a longitudinal study of a nationally representative sample of U.S. adolescents in grades 7 through 12 during the 1994-1995 school year. Add Health combines longitudinal survey data on respondents’ social, economic, psychological, and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships. We attempted to reproduce four papers that use Add Health, and were able to fully or partially reproduce two. AddHealth did not suffer from data versioning issues, although the public use data was severely limited in scope (a random subsample of less than 50% of the original study participants), and so we were forced to only consider papers that relied solely on the public use data.

Nsduh

The National Survey on Drug Use and Health 2004-2014 nsfduh, measures the prevalence and correlates of drug use in the U.S., providing quarterly and annual estimates. Information is provided on the use of illicit drugs, alcohol, and tobacco among members of U.S. households aged 12 and older. Questions include usage of a variety of illegal drug categories, and non-medical use of prescription drugs. We attempted to reproduce four papers that use NSDUH, but we were able to reproduce only one paper, partially and with substantial effort. The main obstacle to reproducibility was that the study is broken down by year across a greater than fifteen year time frame, and that multiple versions of the study for each year have been released. Furthermore, past years/versions of the study are seemingly updated without record to what the new versions/data modifies. Thus, we had extreme difficulty finding the exact breakdown of data (which data version for each year, sometimes across 10 or more years) that the authors conducted their research on.

a.2 Selected Papers

saw2018cross

use HSLS:09 to examine cross-sectional and longitudinal disparities in Science, Technology, Engineering and Math (STEM) career aspirations among high school students. They specifically focus on intersectional interactions between gender, race/ethnicity and socioeconomic status, and assess the progression of “aspiration” (i.e., stated interest in pursuing a STEM-related career) from 9th to 11th grade of high school.

The main methods include singular and trivariate (across race, socioeconomic status and gender) disparity analysis of quantities of interest. The authors additionally focus on disparities among students deemed “persisters” (who persist in their STEM interest from 9th to 11th grade) and “emergers” (who emerge with STEM interest in 11th grade, having no interest in 9th grade). They also trained a basic linear regression model to assess how predictive demographic traits are of interest, but we do not report directly on these results, see table of coefficients in supplementary materials.

The authors conclude that girls from all ethnic/racial and socioeconomic backgrounds, and lower income Black or Hispanic boys, had substantially lower rates of interest, persistence and development of STEM aspirations. They support their conclusion with a set of findings that mainly catalogue examples of statistically significant gaps in their trivariate demographic analysis of disparities.

This paper was reproducible, with effort. The authors provided an overview of data processing methodology, but failed to specify exact columns of interest and clarify steps of some preprocessing procedures (e.g., creating “emerger” and “persister” sets of students). However, with effort, we were able to reproduce each of their findings and agree with all of their conclusions.

The paper was published in 2018 in Educational Researcher, a top education research journal. It has been cited 69 times as of August 2022, according to Google scholar.

lee2021ability

catalogue factors from HSLS:09 that affect the performance of students on an 11th grade math exam. The authors examine “low teacher support” as an adverse factor for math performance, and self perceptions of math “ability” and “parental support” as protective factors. They control for demographic variables and for historic math performance (i.e., performance on a similar exam in 9th grade) to isolate the effects.

In order to understand the protective and adverse factors concretely, they construct each of them by aggregating (via a simple weighted average) across relevant survey responses from the study (between 4 and 10 for each factor). This produces undimensional “low teacher support,” “ability self-concept” and “parental support” variables. The authors compute Pearson correlation across the newly constructed variables and demographic information to get a coarse correlation between features. They then train five weighted linear regression models predicting math performance with different interactions between the three constructed variables to tease out further relationships.

The authors replicate a known finding that low teacher support leads to poor math performance. With this in hand, they demonstrate that “ability self-concept” and “parental support” can sometimes protect against the low teacher support, depending on the level of each variable and other demographic factors. Motivation, they conclude, is highly tied to these three important influences in an adolescents life, and high “ability self-concept” and “parental support” can make an adolescent significantly more resilient to poor teaching.

This paper was partially reproducible, with substantial effort. The authors did a reasonable job of detailing their aggregation techniques, but they failed to list exact variable names (although they did include a helpful table in their appendix). They also did not detail their regression technique, although it turned out to align well with the simplest weighted linear regression that we tried. One of the findings they report, and which we were unable to reporduce, involves a slope analysis figure that computes covariances using an complex R package.

The paper was published in 2021 in Journal of Adolescence, a reputable adolescent psychology research journal, with 2 citations as of August 2022, per Google scholar.

jeong2021

use HSLS:09 to interrogate potential racial bias in classifiers that predict student performance on a standardized 9th grade math exam. The authors point out that such predictions can inform a student’s placement in an advanced math course. Thus, having disparate error rates— systematically under-estimating the potential of students from disadvantaged racial groups, while at the same time giving the benefit of the doubt to students from privileged groups—may exacerbate disparities in access to educational opportunities.

To carry out their analysis, the authors assign each student to one of two racial groups — White/Asian (privileged) or Black/Hispanic/Native American (disadvantaged). They trained random forest, SVC and logistic regression models to predict whether a student would receive a top-50% or a bottom-50% score on the test, and measured accuracy, false positive rate (FPR), false negative rate (FNR), and predicted base rate (PBR), both overall and for each racial group.

The main findings were that FPR was almost twice as high for the privileged students, while the FNR was twice as high for the disadvantaged students. Interpreting these findings, the authors concluded that privileged students were given the benefit of the doubt, while the disadvantaged were systematically under-estimated by the classifiers.

We found this paper to be reproducible, with effort. The authors did not specify how the data was preprocessed (e.g., how missing values were imputed) or how it was split into training and test. We were ultimately able to reproduce the results sufficiently to agree with the conclusions, but were unable to reproduce the values in the findings exactly.

This paper was published in 2021 in the NeurIPS Workshop on Math AI for Education (MATHAI4ED). It has not been cited as of August 2022, according to Google scholar.

fruiht2018naturally

use Add Health to investigate the role that naturally occurring mentors play in the educational outcomes of first-generation college students.

The authors fit a statistical mediation model introduced by preacher2008asymptotic (PROCESS Model, variation 1) to test both direct and interaction effects of parental educational attainment and mentorship on the educational attainment of the student. They conclude that having at least one parent who graduated from college, or having a mentor in adolescence or emerging adulthood, was positively associated with higher educational attainment. The authors also found that African Americans experienced significantly lower educational attainment than other participants.

This paper was partially reproducible, with substantial effort. In addition to using the Add Health study, the authors also report on findings made though manual qualitative coding of free text responses regarding the nature of support provided by the mentor, but do not make the coding scheme publicly available. We were thus unable to reproduce the findings that rely on this coding scheme. For other findings, lack of detail about pre-processing, such as how individuals with multiple races were categorized into racial groups, prevented a perfect replication of many of the values reported in the paper, although we were able to reproduce the observed trends and agree with the authors conclusions.

The paper was published in 2018 in the American Journal of Community Psychology, a reputable quarterly journal that covers community health, community processes, and social reform. It has been cited 42 times as of August 2022 according to Google Scholar.

iverson2021high

respond to what they deem “growing public concern” around the long term effects of playing football in high school on health later in life. The authors conduct an analysis across male participants from the first wave (1994-95) to the most recent (2016-2018) wave of Add Health. They analyze the effect of having played high school football on depressive and suicidal tendencies later in life, controlling for demographic and risk factors.

The authors are mainly concerned with proportions of history of depression and recent suicidal ideation, and conduct a bivariate analysis across two groups of men (those who did or did not play football in high school). Their primary means of analysis are simple percent comparisons with high statistical significance and meaningful odds ratios.

Through these comparisons, the authors did not find a direct effect of playing football in high school on depressive tendencies later in life. However, they did find that those who already had depressive tendencies in adolescence were more likely to maintain those tendencies as they got older.

This paper was easy to reproduce. The authors provided precise guidance on which questions (columns) they used from the data and, because their techniques were fundamentally simple, we were able to exactly match nearly all values reported in the paper, reproducing all findings.

The paper was published in 2022 in Frontiers of Neurology, a top journal in the field. It has not been cited as of August 2022, according to Google Scholar.

fairman2019marijuana

use NSDUH to investigate predictors and potential consequences of initiatingthe use of marijuana before other types of substance (e.g., cigarettes and alcohol) for U.S. youth. The authors found that using marijuana first was predictive of future heavy use of this and other substances. They also analyzed associations between using marijuana first and demographic group membership and found that those using marijuana first were more likely to be male (vs. female), older (vs. younger), and Black, American Indian/Alaskan Native, multiracial, or Hispanic (vs. White or Asian).

The primary methods of analysis were counts and percentage comparisons by group, and computing adjusted relative risk ratio (aRRR) and adjusted odds ratio (aOR).

This paper was partially reproducible, with substantial effort. The main obstacle to reproducibility is the inappropriate versioning of the data by NSDUH, as discussed in Section 4.1. We spent substantial effort running the analysis on numerous published versions, but were unable to pinpoint which specific version of NSDUH was used in this paper. Additionally, the paper does not list how aRRR and aOR were calculated. Ultimately, while we were unable to reproduce exact reported values, we did reproduce general trends and agree with most conclusions drawn in the paper.

The paper was published in 2019 in Prevention Science, the official journal of the Society for Prevention Research, and has 39 citations as of August 2022, per Google scholar.