On synthetic data with predetermined subject partitioning and cluster profiling, and partially specified categorical variable marginal correlation structure

09/04/2017
by   Michail Papathomas, et al.
0

A standard approach for assessing the performance of partition or mixture models is to create synthetic data sets with a pre-specified clustering structure, and evaluate how well the model reveals this structure. A common format is that subjects are assigned to different clusters, with variable observations simulated so that subjects within the same cluster have similar profiles, allowing for some variability. In this manuscript, we focus on observations from categorical variables. First, theoretical results are derived to explore the dependence structure between the variables, in relation to the clustering structure for the subjects. Then, a novel approach is proposed that allows partial control over the marginal correlation structure of the variables. Practical examples are shown and additional theoretical results are derived. To illustrate our methods we focus on simulating observations that emulate Single Nucleotide Polymorphisms. We compare a synthetic dataset to a real one, to demonstrate the extend to which the correlation structure for the variables is controlled.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

Generation and Simulation of Synthetic Datasets with Copulas

This paper proposes a new method to generate synthetic data sets based o...
research
10/21/2015

Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions

We propose a novel method for multiple clustering that assumes a co-clus...
research
10/17/2017

The Bayesian Sorting Hat: A Decision-Theoretic Approach to Size-Constrained Clustering

Size-constrained clustering (SCC) refers to the dual problem of using ob...
research
03/08/2023

Rank Intraclass Correlation for Clustered Data

Clustered data are common in biomedical research. Observations in the sa...
research
12/12/2022

Tandem clustering with invariant coordinate selection

For high-dimensional data or data with noise variables, tandem clusterin...
research
03/24/2023

repliclust: Synthetic Data for Cluster Analysis

We present repliclust (from repli-cate and clust-er), a Python package f...

Please sign up or login with your details

Forgot password? Click here to reset