Log In Sign Up

An experimental study on Synthetic Tabular Data Evaluation

by   Javier Marin, et al.

In this paper, we present the findings of various methodologies for measuring the similarity of synthetic data generated from tabular data samples. We particularly apply our research to the case where the synthetic data has many more samples than the real data. This task has a special complexity: validating the reliability of this synthetically generated data with a much higher number of samples than the original. We evaluated the most commonly used global metrics found in the literature. We introduced a novel approach based on the data's topological signature analysis. Topological data analysis has several advantages in addressing this latter challenge. The study of qualitative geometric information focuses on geometric properties while neglecting quantitative distance function values. This is especially useful with high-dimensional synthetic data where the sample size has been significantly increased. It is comparable to introducing new data points into the data space within the limits set by the original data. Then, in large synthetic data spaces, points will be much more concentrated than in the original space, and their analysis will become much more sensitive to both the metrics used and noise. Instead, the concept of "closeness" between points is used for qualitative geometric information. Finally, we suggest an approach based on data Eigen vectors for evaluating the level of noise in synthetic data. This approach can also be used to assess the similarity of original and synthetic data.


page 1

page 2

page 3

page 4


Rule-adhering synthetic data – the lingua franca of learning

AI-generated synthetic data allows to distill the general patterns of ex...

Web-based Elicitation of Human Perception on mixup Data

Synthetic data is proliferating on the web and powering many advances in...

Testing SensoGraph, a geometric approach for fast sensory evaluation

This paper introduces SensoGraph, a novel approach for fast sensory eval...

Group evolution patterns in running races

We address the problem of tracking and detecting interactions between th...

Sample Summary with Generative Encoding

With increasing sample sizes, all algorithms require longer run times th...

Strategies to facilitate access to detailed geocoding information using synthetic data

In this paper we investigate if generating synthetic data can be a viabl...

In-Season Crop Progress in Unsurveyed Regions using Networks Trained on Synthetic Data

Many commodity crops have growth stages during which they are particular...