In Defense of Synthetic Data

05/03/2019
by   Luke Rodriguez, et al.
0

Synthetic datasets have long been thought of as second-rate, to be used only when "real" data collected directly from the real world is unavailable. But this perspective assumes that raw data is clean, unbiased, and trustworthy, which it rarely is. Moreover, the benefits of synthetic data for privacy and for bias correction are becoming increasingly important in any domain that works with people. Curated synthetic datasets - synthetic data derived from minimal perturbations of real data - enable early stage product development and collaboration, protect privacy, afford reproducibility, increase dataset diversity in research, and protect disadvantaged groups from problematic inferences on the original data that reflects systematic discrimination. Rather than representing a departure from the true state of the world, in this paper we argue that properly generated synthetic data is a step towards responsible and equitable research and development of machine learning systems.

READ FULL TEXT

page 1

page 2

page 3

research
11/12/2022

TAPAS: a Toolbox for Adversarial Privacy Auditing of Synthetic Data

Personal data collected at scale promises to improve decision-making and...
research
07/17/2020

Diversifying Anonymized Data with Diversity Constraints

Recently introduced privacy legislation has aimed to restrict and contro...
research
04/21/2023

Auditing and Generating Synthetic Data with Controllable Trust Trade-offs

Data collected from the real world tends to be biased, unbalanced, and a...
research
08/31/2023

The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development

In the current data driven era, synthetic data, artificially generated d...
research
03/09/2022

Downstream Fairness Caveats with Synthetic Healthcare Data

This paper evaluates synthetically generated healthcare data for biases ...
research
02/19/2022

Improving the Level of Autism Discrimination through GraphRNN Link Prediction

Dataset is the key of deep learning in Autism disease research. However,...
research
08/21/2018

MobilityMirror: Bias-Adjusted Transportation Datasets

We describe customized synthetic datasets for publishing mobility data. ...

Please sign up or login with your details

Forgot password? Click here to reset