SynC: A Unified Framework for Generating Synthetic Population with Gaussian Copula

04/16/2019
by   Colin Wan, et al.
0

Synthetic population generation is the process of combining multiple socioeonomic and demographic datasets from various sources and at different granularity, and downscaling them to an individual level. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In this study, we propose a multi-stage framework called SynC (Synthetic Population via Gaussian Copula) to fill the gap. SynC first removes potential outliers in the data and then fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginal distributions of sampled survey data. Finally, SynC leverages neural networks to merge datasets into one and then scales them accordingly to match the marginal constraints. We make four key contributions in this work: 1) propose a novel framework for generating individual level data from aggregated data sources by combining state-of-the-art machine learning and statistical techniques, 2) design a metric for validating the accuracy of generated data when the ground truth is hard to obtain, 3) demonstrate its effectiveness with the Canada National Census data and presenting two real-world use cases where datasets of this nature can be leveraged by businesses, and 4) release an easy-to-use framework implementation for reproducibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/08/2022

GenSyn: A Multi-stage Framework for Generating Synthetic Microdata using Macro Data Sources

Individual-level data (microdata) that characterizes a population, is es...
research
02/17/2023

Copula-based synthetic population generation

Population synthesis consists of generating synthetic but realistic repr...
research
06/20/2021

Discrepancies in Epidemiological Modeling of Aggregated Heterogeneous Data

Within epidemiological modeling, the majority of analyses assume a singl...
research
04/24/2023

Synthpop++: A Hybrid Framework for Generating A Country-scale Synthetic Population

Population censuses are vital to public policy decision-making. They pro...
research
05/21/2019

Robustness Against Outliers For Deep Neural Networks By Gradient Conjugate Priors

We analyze a new robust method for the reconstruction of probability dis...
research
07/13/2019

Leveraging Auxiliary Information on Marginal Distributions in Nonignorable Models for Item and Unit Nonresponse

When handling nonresponse, government agencies and survey organizations ...
research
04/19/2021

Mapping the Internet: Modelling Entity Interactions in Complex Heterogeneous Networks

Even though machine learning algorithms already play a significant role ...

Please sign up or login with your details

Forgot password? Click here to reset