Robin Hood and Matthew Effects – Differential Privacy Has Disparate Impact on Synthetic Data

09/23/2021
by   Georgi Ganev, et al.
0

Generative models trained using Differential Privacy (DP) are increasingly used to produce and share synthetic data in a privacy-friendly manner. In this paper, we set out to analyze the impact of DP on these models vis-a-vis underrepresented classes and subgroups of data. We do so from two angles: 1) the size of classes and subgroups in the synthetic data, and 2) classification accuracy on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our experiments, conducted using three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN), show that DP results in opposite size distributions in the generated synthetic data. More precisely, it affects the gap between the majority and minority classes and subgroups, either reducing it (a "Robin Hood" effect) or increasing it ("Matthew" effect). However, both of these size shifts lead to similar disparate impacts on a classifier's accuracy, affecting disproportionately more the underrepresented subparts of the data. As a result, we call for caution when analyzing or training a model on synthetic data, or risk treating different subpopulations unevenly, which might also lead to unreliable conclusions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2019

Differential Privacy Has Disparate Impact on Model Accuracy

Differential privacy (DP) is a popular mechanism for training machine le...
research
11/26/2021

DP-SGD vs PATE: Which Has Less Disparate Impact on GANs?

Generative Adversarial Networks (GANs) are among the most popular approa...
research
06/03/2020

One Step to Efficient Synthetic Data

We propose a general method of producing synthetic data, which is widely...
research
05/11/2023

Energy cost and machine learning accuracy impact of k-anonymisation and synthetic data techniques

To address increasing societal concerns regarding privacy and climate, t...
research
07/07/2023

Programmable Synthetic Tabular Data Generation

Large amounts of tabular data remain underutilized due to privacy, data ...
research
08/24/2023

The Impact of De-Identification on Single-Year-of-Age Counts in the U.S. Census

In 2020, the U.S. Census Bureau transitioned from data swapping to diffe...
research
05/18/2023

Understanding how Differentially Private Generative Models Spend their Privacy Budget

Generative models trained with Differential Privacy (DP) are increasingl...

Please sign up or login with your details

Forgot password? Click here to reset