A Framework for Auditable Synthetic Data Generation

11/21/2022
by   Florimond Houssiau, et al.
0

Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what statistical patterns are captured, leading to concerns over privacy protection. While synthetic records are not linked to a particular real-world individual, they can reveal information about users indirectly which may be unacceptable for data owners. There is thus a need to empirically verify the privacy of synthetic data – a particularly challenging task in high-dimensional data. In this paper we present a general framework for synthetic data generation that gives data controllers full control over which statistical properties the synthetic data ought to preserve, what exact information loss is acceptable, and how to quantify it. The benefits of the approach are that (1) one can generate synthetic data that results in high utility for a given task, while (2) empirically validating that only statistics considered safe by the data curator are used to generate the data. We thus show the potential for synthetic data to be an effective means of releasing confidential data safely, while retaining useful information for analysts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2023

What Is Synthetic Data? The Good, The Bad, and The Ugly

Sharing data can often enable compelling applications and analytics. How...
research
02/20/2020

Cluster Aware Mobility Encounter Dataset Enlargement

The recent emerging fields in data processing and manipulation has facil...
research
03/15/2018

Strategies to facilitate access to detailed geocoding information using synthetic data

In this paper we investigate if generating synthetic data can be a viabl...
research
10/28/2021

Generating synthetic transactional profiles

Financial institutions use clients' payment transactions in numerous ban...
research
04/04/2023

30 Years of Synthetic Data

The idea to generate synthetic data as a tool for broadening access to s...
research
12/04/2018

Learning Vine Copula Models For Synthetic Data Generation

A vine copula model is a flexible high-dimensional dependence model whic...

Please sign up or login with your details

Forgot password? Click here to reset