Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm

09/13/2022
by   Marek Gagolewski, et al.
0

The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algorithm is very sensitive to outliers, produces highly skewed dendrograms, and therefore usually does not reflect the true underlying data structure – unless the clusters are well-separated. To overcome its limitations, we propose a new hierarchical clustering linkage criterion called Genie. Namely, our algorithm links two clusters in such a way that a chosen economic inequity measure (e.g., the Gini- or Bonferroni-index) of the cluster sizes does not drastically increase above a given threshold. The presented benchmarks indicate a high practical usefulness of the introduced method: it most often outperforms the Ward or average linkage in terms of the clustering quality while retaining the single linkage's speed. The Genie algorithm is easily parallelizable and thus may be run on multiple threads to speed up its execution even further. Its memory overhead is small: there is no need to precompute the complete distance matrix to perform the computations in order to obtain a desired clustering. It can be applied on arbitrary spaces equipped with a dissimilarity measure, e.g., on real vectors, DNA or protein sequences, images, rankings, informetric data, etc. A reference implementation of the algorithm has been included in the open source 'genie' package for R. See also https://genieclust.gagolewski.com for a new implementation (genieclust) – available for both R and Python.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2016

Short Communication on QUIST: A Quick Clustering Algorithm

In this short communication we introduce the quick clustering algorithm ...
research
10/01/2022

A new nonparametric interpoint distance-based measure for assessment of clustering

A new interpoint distance-based measure is proposed to identify the opti...
research
07/05/2019

Hybridized Threshold Clustering for Massive Data

As the size n of datasets become massive, many commonly-used clustering ...
research
03/18/2022

Statistical analysis of a hierarchical clustering algorithm with outliers

It is well known that the classical single linkage algorithm usually fai...
research
04/27/2020

Hierarchical clustering of bipartite data sets based on the statistical significance of coincidences

When a set 'entities' are related by the 'features' they share they are ...
research
02/24/2023

Bayesian contiguity constrained clustering, spanning trees and dendrograms

Clustering is a well-known and studied problem, one of its variants, cal...
research
07/12/2017

ClustGeo: an R package for hierarchical clustering with spatial constraints

In this paper, we propose a Ward-like hierarchical clustering algorithm ...

Please sign up or login with your details

Forgot password? Click here to reset