DeepAI
Log In Sign Up

An experimental study on Synthetic Tabular Data Evaluation

11/19/2022
by   Javier Marin, et al.
0

In this paper, we present the findings of various methodologies for measuring the similarity of synthetic data generated from tabular data samples. We particularly apply our research to the case where the synthetic data has many more samples than the real data. This task has a special complexity: validating the reliability of this synthetically generated data with a much higher number of samples than the original. We evaluated the most commonly used global metrics found in the literature. We introduced a novel approach based on the data's topological signature analysis. Topological data analysis has several advantages in addressing this latter challenge. The study of qualitative geometric information focuses on geometric properties while neglecting quantitative distance function values. This is especially useful with high-dimensional synthetic data where the sample size has been significantly increased. It is comparable to introducing new data points into the data space within the limits set by the original data. Then, in large synthetic data spaces, points will be much more concentrated than in the original space, and their analysis will become much more sensitive to both the metrics used and noise. Instead, the concept of "closeness" between points is used for qualitative geometric information. Finally, we suggest an approach based on data Eigen vectors for evaluating the level of noise in synthetic data. This approach can also be used to assess the similarity of original and synthetic data.

READ FULL TEXT

page 1

page 2

page 3

page 4

09/12/2022

Rule-adhering synthetic data – the lingua franca of learning

AI-generated synthetic data allows to distill the general patterns of ex...
11/02/2022

Web-based Elicitation of Human Perception on mixup Data

Synthetic data is proliferating on the web and powering many advances in...
09/16/2018

Testing SensoGraph, a geometric approach for fast sensory evaluation

This paper introduces SensoGraph, a novel approach for fast sensory eval...
12/26/2018

Group evolution patterns in running races

We address the problem of tracking and detecting interactions between th...
01/15/2022

Sample Summary with Generative Encoding

With increasing sample sizes, all algorithms require longer run times th...
03/15/2018

Strategies to facilitate access to detailed geocoding information using synthetic data

In this paper we investigate if generating synthetic data can be a viabl...
12/13/2022

In-Season Crop Progress in Unsurveyed Regions using Networks Trained on Synthetic Data

Many commodity crops have growth stages during which they are particular...