MC-GEN:Multi-level Clustering for Private Synthetic Data Generation

05/28/2022
by   Mingchen Li, et al.
0

Nowadays, machine learning is one of the most common technology to turn raw data into useful information in scientific and industrial processes. The performance of the machine learning model often depends on the size of dataset. Companies and research institutes usually share or exchange their data to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy leakage. Utilizing synthetic datasets which have similar characteristics as a substitute is one of the solutions to avoid the privacy issue. Differential privacy provides a strong privacy guarantee to protect the individual data records which contain sensitive information. We propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee for multiple classification tasks. MC-GEN builds differentially private generative models on the multi-level clustered data to generate synthetic datasets. Our method also reduced the noise introduced from differential privacy to improve the utility. In experimental evaluation, we evaluated the parameter effect of MC-GEN and compared MC-GEN with three existing methods. Our results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2021

Private sampling: a noiseless approach for generating differentially private synthetic data

In a world where artificial intelligence and data science become omnipre...
research
12/10/2019

Privacy-preserving data sharing via probabilistic modelling

Differential privacy allows quantifying privacy loss from computations o...
research
04/20/2022

Private measures, random walks, and synthetic data

Differential privacy is a mathematical concept that provides an informat...
research
08/26/2017

Plausible Deniability for Privacy-Preserving Data Synthesis

Releasing full data records is one of the most challenging problems in d...
research
08/23/2018

Privacy-Preserving Synthetic Datasets Over Weakly Constrained Domains

Techniques to deliver privacy-preserving synthetic datasets take a sensi...
research
05/24/2023

Private and Collaborative Kaplan-Meier Estimators

Kaplan-Meier estimators capture the survival behavior of a cohort. They ...
research
06/05/2020

Generation of Differentially Private Heterogeneous Electronic Health Records

Electronic Health Records (EHRs) are commonly used by the machine learni...

Please sign up or login with your details

Forgot password? Click here to reset