Private Synthetic Data for Multitask Learning and Marginal Queries

09/15/2022
by   Giuseppe Vietri, et al.
8

We provide a differentially private algorithm for producing synthetic data simultaneously useful for multiple tasks: marginal queries and multitask machine learning (ML). A key innovation in our algorithm is the ability to directly handle numerical features, in contrast to a number of related prior approaches which require numerical features to be first converted into high cardinality categorical features via a binning strategy. Higher binning granularity is required for better accuracy, but this negatively impacts scalability. Eliminating the need for binning allows us to produce synthetic data preserving large numbers of statistical queries such as marginals on numerical features, and class conditional linear threshold queries. Preserving the latter means that the fraction of points of each class label above a particular half-space is roughly the same in both the real and synthetic data. This is the property that is needed to train a linear classifier in a multitask setting. Our algorithm also allows us to produce high quality synthetic data for mixed marginal queries, that combine both categorical and numerical features. Our method consistently runs 2-5x faster than the best comparable techniques, and provides significant accuracy improvements in both marginal queries and linear prediction tasks for mixed-type datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

Benchmarking Differentially Private Synthetic Data Generation Algorithms

This work presents a systematic benchmark of differentially private synt...
research
01/29/2022

AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data

We propose AIM, a novel algorithm for differentially private synthetic d...
research
06/03/2022

Utility and Disclosure Risk for Differentially Private Synthetic Categorical Data

This paper introduces two methods of creating differentially private (DP...
research
03/24/2021

Active Multitask Learning with Committees

The cost of annotating training data has traditionally been a bottleneck...
research
06/13/2022

Private Synthetic Data with Hierarchical Structure

We study the problem of differentially private synthetic data generation...
research
12/18/2013

Perturbed Gibbs Samplers for Synthetic Data Release

We propose a categorical data synthesizer with a quantifiable disclosure...
research
10/06/2022

Conditional Feature Importance for Mixed Data

Despite the popularity of feature importance measures in interpretable m...

Please sign up or login with your details

Forgot password? Click here to reset