Clustering categorical data via ensembling dissimilarity matrices

06/26/2015
by   Saeid Amiri, et al.
0

We present a technique for clustering categorical data by generating many dissimilarity matrices and averaging over them. We begin by demonstrating our technique on low dimensional categorical data and comparing it to several other techniques that have been proposed. Then we give conditions under which our method should yield good results in general. Our method extends to high dimensional categorical data of equal lengths by ensembling over many choices of explanatory variables. In this context we compare our method with two other methods. Finally, we extend our method to high dimensional categorical data vectors of unequal length by using alignment techniques to equalize the lengths. We give examples to show that our method continues to provide good results, in particular, better in the context of genome sequences than clusterings suggested by phylogenetic trees.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2021

Efficient Binary Embedding of Categorical Data using BinSketch

In this work, we present a dimensionality reduction algorithm, aka. sket...
research
08/26/2019

Sufficient Representations for Categorical Variables

Many learning algorithms require categorical data to be transformed into...
research
09/27/2020

A grammar of graphics framework for generalized parallel coordinate plots

Parallel coordinate plots (PCP) are a useful tool in exploratory data an...
research
02/14/2021

Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces

High-dimensional black-box optimisation remains an important yet notorio...
research
11/13/2019

Generating Stereotypes Automatically For Complex Categorical Features

In the context of stereotypes creation for recommender systems, we found...
research
08/24/2017

GALILEO: A Generalized Low-Entropy Mixture Model

We present a new method of generating mixture models for data with categ...
research
02/28/2020

Modelling High-Dimensional Categorical Data Using Nonconvex Fusion Penalties

We propose a method for estimation in high-dimensional linear models wit...

Please sign up or login with your details

Forgot password? Click here to reset