A Bayesian Framework for Generation of Fully Synthetic Mixed Datasets

02/16/2021
by   Joseph Feldman, et al.
0

Much of the micro data used for epidemiological studies contain sensitive measurements on real individuals. As a result, such micro data cannot be published out of privacy concerns, rendering any published statistical analyses on them nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic, high dimensional micro datasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children.

READ FULL TEXT
research
02/16/2021

CTAB-GAN: Effective Table Data Synthesizing

While data sharing is crucial for knowledge development, privacy concern...
research
12/14/2021

Linear Discriminant Analysis with High-dimensional Mixed Variables

Datasets containing both categorical and continuous variables are freque...
research
10/26/2022

Nonparametric Copula Models for Mixed Data with Informative Missingness

Modern datasets commonly feature both substantial missingness and variab...
research
12/06/2019

Differentially Private Mixed-Type Data Generation For Unsupervised Learning

In this work we introduce the DP-auto-GAN framework for synthetic data g...
research
02/14/2021

Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces

High-dimensional black-box optimisation remains an important yet notorio...
research
08/13/2018

A Nonparametric Bayesian Method for Clustering of High-Dimensional Mixed Dataset

Motivation: Advances in next-generation sequencing (NGS) methods have en...

Please sign up or login with your details

Forgot password? Click here to reset