Metrics for quantifying isotropy in high dimensional unsupervised clustering tasks in a materials context

05/25/2023
by   Samantha Durdy, et al.
0

Clustering is a common task in machine learning, but clusters of unlabelled data can be hard to quantify. The application of clustering algorithms in chemistry is often dependant on material representation. Ascertaining the effects of different representations, clustering algorithms, or data transformations on the resulting clusters is difficult due to the dimensionality of these data. We present a thorough analysis of measures for isotropy of a cluster, including a novel implantation based on an existing derivation. Using fractional anisotropy, a common method used in medical imaging for comparison, we then expand these measures to examine the average isotropy of a set of clusters. A use case for such measures is demonstrated by quantifying the effects of kernel approximation functions on different representations of the Inorganic Crystal Structure Database. Broader applicability of these methods is demonstrated in analysing learnt embedding of the MNIST dataset. Random clusters are explored to examine the differences between isotropy measures presented, and to see how each method scales with the dimensionality. Python implementations of these measures are provided for use by the community.

READ FULL TEXT
research
12/27/2016

Clustering with Confidence: Finding Clusters with Statistical Guarantees

Clustering is a widely used unsupervised learning method for finding str...
research
06/29/2018

Grapevine: A Wine Prediction Algorithm Using Multi-dimensional Clustering Methods

We present a method for a wine recommendation system that employs multid...
research
10/13/2018

Measuring Swampiness: Quantifying Chaos in Large Heterogeneous Data Repositories

As scientific data repositories and filesystems grow in size and complex...
research
06/23/2022

Quantifying Distances Between Clusters with Elliptical or Non-Elliptical Shapes

Finite mixture models that allow for a broad range of potentially non-el...
research
08/24/2023

Powerful Significance Testing for Unbalanced Clusters

Clustering methods are popular for revealing structure in data, particul...
research
01/13/2022

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

The failure of the Euclidean norm to reliably distinguish between nearby...
research
12/23/2019

Quantifying the Effects of the 2008 Recession using the Zillow Dataset

This report explores the use of Zillow's housing metrics dataset to inve...

Please sign up or login with your details

Forgot password? Click here to reset