Hierarchical Clustering with Prior Knowledge

06/09/2018
by   Xiaofei Ma, et al.
0

Hierarchical clustering is a class of algorithms that seeks to build a hierarchy of clusters. It has been the dominant approach to constructing embedded classification schemes since it outputs dendrograms, which capture the hierarchical relationship among members at all levels of granularity, simultaneously. Being greedy in the algorithmic sense, a hierarchical clustering partitions data at every step solely based on a similarity / dissimilarity measure. The clustering results oftentimes depend on not only the distribution of the underlying data, but also the choice of dissimilarity measure and the clustering algorithm. In this paper, we propose a method to incorporate prior domain knowledge about entity relationship into the hierarchical clustering. Specifically, we use a distance function in ultrametric space to encode the external ontological information. We show that popular linkage-based algorithms can faithfully recover the encoded structure. Similar to some regularized machine learning techniques, we add this distance as a penalty term to the original pairwise distance to regulate the final structure of the dendrogram. As a case study, we applied this method on real data in the building of a customer behavior based product taxonomy for an Amazon service, leveraging the information from a larger Amazon-wide browse structure. The method is useful when one wants to leverage the relational information from external sources, or the data used to generate the distance matrix is noisy and sparse. Our work falls in the category of semi-supervised or constrained clustering.

READ FULL TEXT

page 7

page 8

research
08/21/2020

ConiVAT: Cluster Tendency Assessment and Clustering with Partial Background Knowledge

The VAT method is a visual technique for determining the potential clust...
research
05/24/2023

Hierarchical clustering with dot products recovers hidden tree structure

In this paper we offer a new perspective on the well established agglome...
research
02/25/2023

Semi-supervised Clustering with Two Types of Background Knowledge: Fusing Pairwise Constraints and Monotonicity Constraints

This study addresses the problem of performing clustering in the presenc...
research
11/29/2017

HSC: A Novel Method for Clustering Hierarchies of Networked Data

Hierarchical clustering is one of the most powerful solutions to the pro...
research
04/18/2023

On clustering levels of a hierarchical categorical risk factor

Handling nominal covariates with a large number of categories is challen...
research
04/30/2021

Flattening Multiparameter Hierarchical Clustering Functors

We bring together topological data analysis, applied category theory, an...
research
09/13/2016

A Greedy Algorithm to Cluster Specialists

Several recent deep neural networks experiments leverage the generalist-...

Please sign up or login with your details

Forgot password? Click here to reset