Hierarchical Clustering

What is Hierarchical Clustering?

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. In general, the goal of clustering is to group similar objects into clusters. Clusters of objects are defined so that objects in the same cluster are more similar to each other than to those in other clusters. Hierarchical clustering is unique in that it builds up or breaks down clusters step by step, which can be represented in a structure known as a dendrogram — a tree-like diagram that records the sequences of merges or splits.

Types of Hierarchical Clustering

There are two main types of hierarchical clustering: Agglomerative and Divisive.

Agglomerative Hierarchical Clustering

Agglomerative hierarchical clustering, also known as bottom-up clustering, starts with each object in a single cluster. At each step, it merges the closest pair of clusters until all clusters have been merged into one big cluster containing all objects. This approach is the most common type of hierarchical clustering used to analyze various kinds of data.

Divisive Hierarchical Clustering

Divisive hierarchical clustering, also known as top-down clustering, starts with all objects in a single cluster. At each step, it splits the least cohesive cluster into two clusters, until each object ends up in its own single cluster. This method is less common than agglomerative clustering but can be more appropriate in some cases.

Measuring Similarity or Distance

The core of hierarchical clustering is the measure of similarity (or distance) between data points or clusters. Commonly used metrics for distance include the Euclidean distance, Manhattan distance, and the cosine similarity. For clusters, there are several linkage criteria to determine the distance between sets of observations:

Single linkage: the minimum distance between any member of one cluster to any member of the other cluster.
Complete linkage: the maximum distance between any member of one cluster to any member of the other cluster.
Average linkage: the average distance between members of one cluster to members of the other cluster.
Centroid linkage: the distance between the centroids of the clusters.
Ward’s method: the increment of the sum of squares of any cluster that results from a merger.

Advantages and Disadvantages

Hierarchical clustering has several advantages:

It does not require a pre-specified number of clusters.
It can produce a dendrogram, which helps with understanding the data and the result of the hierarchical clustering.
It can reveal fine details about the relationships between data objects and clusters.

However, it also has some disadvantages:

It is computationally intensive, especially for large datasets.
It is sensitive to noise and outliers in the data.
Once a decision is made to combine two clusters, it cannot be undone.

Applications of Hierarchical Clustering

Hierarchical clustering is widely used in various fields such as:

Biology, for constructing phylogenetic trees.
Text mining and information retrieval, for document clustering.
Market research, for understanding consumer preferences and segmenting markets.
Healthcare, for patient stratification based on medical features.

Conclusion

Hierarchical clustering is a versatile tool that can be applied to a wide array of problems involving the grouping of objects. Its ability to create a hierarchy of clusters and to not require the number of clusters in advance makes it a valuable technique in exploratory data analysis. Despite its computational cost and sensitivity to outliers, it remains a popular choice due to the interpretability of the dendrogram and the detailed insights it can provide into the structure of the data.