Generating Multidimensional Clusters With Support Lines

01/24/2023
by   Nuno Fachada, et al.
0

Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data – a crucial activity when real-world data is at premium – while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present Clugen, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. Clugen is open source, 100% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2023

repliclust: Synthetic Data for Cluster Analysis

We present repliclust (from repli-cate and clust-er), a Python package f...
research
05/24/2023

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Large Language Models (LLMs) have democratized synthetic data generation...
research
07/05/2016

Algorithms for Generalized Cluster-wise Linear Regression

Cluster-wise linear regression (CLR), a clustering problem intertwined w...
research
05/31/2022

A Kernelised Stein Statistic for Assessing Implicit Generative Models

Synthetic data generation has become a key ingredient for training machi...
research
10/06/2021

Clustering Plotted Data by Image Segmentation

Clustering algorithms are one of the main analytical methods to detect p...
research
02/20/2020

Cluster Aware Mobility Encounter Dataset Enlargement

The recent emerging fields in data processing and manipulation has facil...
research
04/06/2023

Synthetic Data in Healthcare

Synthetic data are becoming a critical tool for building artificially in...

Please sign up or login with your details

Forgot password? Click here to reset