On synthetic data with predetermined subject partitioning and cluster profiling, and pre-specified categorical variable marginal dependence structure

09/04/2017
by   Michail Papathomas, et al.
0

A standard approach for assessing the performance of partition or mixture models is to create synthetic data sets with a pre-specified clustering structure, and assess how well the model reveals this structure. A common format is that subjects are assigned to different clusters, with variable observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we consider observations from nominal, ordinal and interval categorical variables. Theoretical and empirical results are utilized to explore the dependence structure between the variables, in relation to the clustering structure for the subjects. A novel approach is proposed that allows to control the marginal association or correlation structure of the variables, and to specify exact correlation values. Practical examples are shown and additional theoretical results are derived for interval data, commonly observed in cohort studies, including observations that emulate Single Nucleotide Polymorphisms. We compare a synthetic dataset to a real one, to demonstrate similarities and differences.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

Generation and Simulation of Synthetic Datasets with Copulas

This paper proposes a new method to generate synthetic data sets based o...
research
10/21/2015

Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions

We propose a novel method for multiple clustering that assumes a co-clus...
research
11/28/2018

A new correlation coefficient between categorical, ordinal and interval variables with Pearson characteristics

A prescription is presented for a new and practical correlation coeffici...
research
07/03/2019

A Bayesian Semiparametric Gaussian Copula Approach to a Multivariate Normality Test

In this paper, a Bayesian semiparametric copula approach is used to mode...
research
10/17/2017

The Bayesian Sorting Hat: A Decision-Theoretic Approach to Size-Constrained Clustering

Size-constrained clustering (SCC) refers to the dual problem of using ob...
research
11/15/2019

Variance partitioning in multilevel models for count data

A first step when fitting multilevel models to continuous responses is t...

Please sign up or login with your details

Forgot password? Click here to reset