Achilles' Heels: Vulnerable Record Identification in Synthetic Data Publishing

06/17/2023
by   Matthieu Meeus, et al.
0

Synthetic data is seen as the most promising solution to share individual-level data while preserving privacy. Shadow modeling-based membership inference attacks (MIAs) have become the standard approach to evaluate the privacy risk of synthetic data. While very effective, they require a large number of datasets to be created and models trained to evaluate the risk posed by a single record. The privacy risk of a dataset is thus currently evaluated by running MIAs on a handful of records selected using ad-hoc methods. We here propose what is, to the best of our knowledge, the first principled vulnerable record identification technique for synthetic data publishing, leveraging the distance to a record's closest neighbors. We show our method to strongly outperform previous ad-hoc methods across datasets and generators. We also show evidence of our method to be robust to the choice of MIA and to specific choice of parameters. Finally, we show it to accurately identify vulnerable records when synthetic data generators are made differentially private. The choice of vulnerable records is as important as more accurate MIAs when evaluating the privacy of synthetic data releases, including from a legal perspective. We here propose a simple yet highly effective method to do so. We hope our method will enable practitioners to better estimate the risk posed by synthetic data publishing and researchers to fairly compare ever improving MIAs on synthetic data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2023

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

Synthetic data is emerging as the most promising solution to share indiv...
research
11/12/2022

TAPAS: a Toolbox for Adversarial Privacy Auditing of Synthetic Data

Personal data collected at scale promises to improve decision-making and...
research
04/01/2021

Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic Data

AI-based data synthesis has seen rapid progress over the last several ye...
research
06/01/2020

Identification Risk Evaluation of Continuous Synthesized Variables

We propose a general approach to evaluating identification risk of conti...
research
01/18/2021

Fidelity and Privacy of Synthetic Medical Data

The digitization of medical records ushered in a new era of big data to ...
research
07/17/2023

Generic Programming with Extensible Data Types; Or, Making Ad Hoc Extensible Data Types Less Ad Hoc

We present a novel approach to generic programming over extensible data ...
research
02/05/2021

Measuring Utility and Privacy of Synthetic Genomic Data

Genomic data provides researchers with an invaluable source of informati...

Please sign up or login with your details

Forgot password? Click here to reset