Approximate, Adapt, Anonymize (3A): a Framework for Privacy Preserving Training Data Release for Machine Learning

07/04/2023
by   Tamas Madl, et al.
0

The availability of large amounts of informative data is crucial for successful machine learning. However, in domains with sensitive information, the release of high-utility data which protects the privacy of individuals has proven challenging. Despite progress in differential privacy and generative modeling for privacy-preserving data release in the literature, only a few approaches optimize for machine learning utility: most approaches only take into account statistical metrics on the data itself and fail to explicitly preserve the loss metrics of machine learning models that are to be subsequently trained on the generated data. In this paper, we introduce a data release framework, 3A (Approximate, Adapt, Anonymize), to maximize data utility for machine learning, while preserving differential privacy. We also describe a specific implementation of this framework that leverages mixture models to approximate, kernel-inducing points to adapt, and Gaussian differential privacy to anonymize a dataset, in order to ensure that the resulting data is both privacy-preserving and high utility. We present experimental evidence showing minimal discrepancy between performance metrics of models trained on real versus privatized datasets, when evaluated on held-out real data. We also compare our results with several privacy-preserving synthetic data generation models (such as differentially private generative adversarial networks), and report significant increases in classification performance metrics compared to state-of-the-art models. These favorable comparisons show that the presented framework is a promising direction of research, increasing the utility of low-risk synthetic data release for machine learning.

READ FULL TEXT

page 5

page 9

research
12/10/2019

Privacy-preserving data sharing via probabilistic modelling

Differential privacy allows quantifying privacy loss from computations o...
research
04/02/2022

Production of Categorical Data Verifying Differential Privacy: Conception and Applications to Machine Learning

Private and public organizations regularly collect and analyze digitaliz...
research
08/23/2018

Privacy-Preserving Synthetic Datasets Over Weakly Constrained Domains

Techniques to deliver privacy-preserving synthetic datasets take a sensi...
research
04/01/2019

Generating Optimal Privacy-Protection Mechanisms via Machine Learning

We consider the problem of obfuscating sensitive information while prese...
research
10/20/2019

Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text

Guaranteeing a certain level of user privacy in an arbitrary piece of te...
research
04/20/2019

Distributed generation of privacy preserving data with user customization

Distributed devices such as mobile phones can produce and store large am...
research
12/19/2017

Privacy-Preserving Adversarial Networks

We propose a data-driven framework for optimizing privacy-preserving dat...

Please sign up or login with your details

Forgot password? Click here to reset