Cube Sampled K-Prototype Clustering for Featured Data

08/23/2021
by   Seemandhar Jain, et al.
0

Clustering large amount of data is becoming increasingly important in the current times. Due to the large sizes of data, clustering algorithm often take too much time. Sampling this data before clustering is commonly used to reduce this time. In this work, we propose a probabilistic sampling technique called cube sampling along with K-Prototype clustering. Cube sampling is used because of its accurate sample selection. K-Prototype is most frequently used clustering algorithm when the data is numerical as well as categorical (very common in today's time). The novelty of this work is in obtaining the crucial inclusion probabilities for cube sampling using Principal Component Analysis (PCA). Experiments on multiple datasets from the UCI repository demonstrate that cube sampled K-Prototype algorithm gives the best clustering accuracy among similarly sampled other popular clustering algorithms (K-Means, Hierarchical Clustering (HC), Spectral Clustering (SC)). When compared with unsampled K-Prototype, K-Means, HC and SC, it still has the best accuracy with the added advantage of reduced computational complexity (due to reduced data size).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2020

Probabilistically Sampled and Spectrally Clustered Plant Genotypes using Phenotypic Characteristics

Clustering genotypes based upon their phenotypic characteristics is used...
research
07/05/2019

Hybridized Threshold Clustering for Massive Data

As the size n of datasets become massive, many commonly-used clustering ...
research
09/30/2018

Vector Quantized Spectral Clustering applied to Soybean Whole Genome Sequences

We develop a Vector Quantized Spectral Clustering (VQSC) algorithm that ...
research
11/20/2021

Feature selection or extraction decision process for clustering using PCA and FRSD

This paper concerns the critical decision process of extracting or selec...
research
06/24/2022

Deep embedded clustering algorithm for clustering PACS repositories

Creating large datasets of medical radiology images from several sources...
research
05/15/2019

EasiCS: the objective and fine-grained classification method of cervical spondylosis dysfunction

The precise diagnosis is of great significance in developing precise tre...
research
10/06/2015

Large-scale subspace clustering using sketching and validation

The nowadays massive amounts of generated and communicated data present ...

Please sign up or login with your details

Forgot password? Click here to reset