An experimental study on Synthetic Tabular Data Evaluation
In this paper, we present the findings of various methodologies for measuring the similarity of synthetic data generated from tabular data samples. We particularly apply our research to the case where the synthetic data has many more samples than the real data. This task has a special complexity: validating the reliability of this synthetically generated data with a much higher number of samples than the original. We evaluated the most commonly used global metrics found in the literature. We introduced a novel approach based on the data's topological signature analysis. Topological data analysis has several advantages in addressing this latter challenge. The study of qualitative geometric information focuses on geometric properties while neglecting quantitative distance function values. This is especially useful with high-dimensional synthetic data where the sample size has been significantly increased. It is comparable to introducing new data points into the data space within the limits set by the original data. Then, in large synthetic data spaces, points will be much more concentrated than in the original space, and their analysis will become much more sensitive to both the metrics used and noise. Instead, the concept of "closeness" between points is used for qualitative geometric information. Finally, we suggest an approach based on data Eigen vectors for evaluating the level of noise in synthetic data. This approach can also be used to assess the similarity of original and synthetic data.
READ FULL TEXT