GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

12/08/2022
by   Angeela Acharya, et al.
0

Individual-level data (microdata) that characterizes a population, is essential for studying many real-world problems. However, acquiring such data is not straightforward due to cost and privacy constraints, and access is often limited to aggregated data (macro data) sources. In this study, we examine synthetic data generation as a tool to extrapolate difficult-to-obtain high-resolution data by combining information from multiple easier-to-obtain lower-resolution data sources. In particular, we introduce a framework that uses a combination of univariate and multivariate frequency tables from a given target geographical location in combination with frequency tables from other auxiliary locations to generate synthetic microdata for individuals in the target location. Our method combines the estimation of a dependency graph and conditional probabilities from the target location with the use of a Gaussian copula to leverage the available information from the auxiliary locations. We perform extensive testing on two real-world datasets and demonstrate that our approach outperforms prior approaches in preserving the overall dependency structure of the data while also satisfying the constraints defined on the different variables.

READ FULL TEXT

page 1

page 8

research
04/16/2019

SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula

Synthetic population generation is the process of combining multiple soc...
research
05/31/2021

Adaptive Multi-Source Causal Inference

Data scarcity is a tremendous challenge in causal effect estimation. In ...
research
07/04/2023

Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data

Synthetic data is emerging as the most promising solution to share indiv...
research
04/21/2021

Calibrated Optimal Decision Making with Multiple Data Sources and Limited Outcome

We consider the optimal decision-making problem in a primary sample of i...
research
02/07/2020

Multi-source Deep Gaussian Process Kernel Learning

For many problems, relevant data are plentiful but explicit knowledge is...
research
05/31/2023

Reinforced Borrowing Framework: Leveraging Auxiliary Data for Individualized Inference

Increasingly during the past decade, researchers have sought to leverage...
research
05/22/2020

OBDA for the Web: Creating Virtual RDF Graphs On Top of Web Data Sources

Due to Variety, Web data come in many different structures and formats, ...

Please sign up or login with your details

Forgot password? Click here to reset