Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10/22/2019
by   Andrew Lensen, et al.
0

Clustering is a difficult and widely-studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g. Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally pre-defined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this paper, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2018

Generating Redundant Features with Unsupervised Multi-Tree Genetic Programming

Recently, feature selection has become an increasingly important area of...
research
08/23/2021

Genetic Programming for Manifold Learning: Preserving Local Topology

Manifold learning methods are an invaluable tool in today's world of inc...
research
12/17/2019

Balancing the Tradeoff Between Clustering Value and Interpretability

Graph clustering groups entities – the vertices of a graph – based on th...
research
11/29/2012

Overlapping clustering based on kernel similarity metric

Producing overlapping schemes is a major issue in clustering. Recent pro...
research
08/26/2022

Comparing Apples to Oranges: Learning Similarity Functions for Data Produced by Different Distributions

Similarity functions measure how comparable pairs of elements are, and p...
research
04/27/2022

Evolving Generalizable Multigrid-Based Helmholtz Preconditioners with Grammar-Guided Genetic Programming

Solving the indefinite Helmholtz equation is not only crucial for the un...
research
12/24/2019

Self-adaption grey DBSCAN clustering

Clustering analysis, a classical issue in data mining, is widely used in...

Please sign up or login with your details

Forgot password? Click here to reset