Log In Sign Up

Contrastive Multi-view Hyperbolic Hierarchical Clustering

by   Fangfei Lin, et al.

Hierarchical clustering recursively partitions data at an increasingly finer granularity. In real-world applications, multi-view data have become increasingly important. This raises a less investigated problem, i.e., multi-view hierarchical clustering, to better understand the hierarchical structure of multi-view data. To this end, we propose a novel neural network-based model, namely Contrastive Multi-view Hyperbolic Hierarchical Clustering (CMHHC). It consists of three components, i.e., multi-view alignment learning, aligned feature similarity learning, and continuous hyperbolic hierarchical clustering. First, we align sample-level representations across multiple views in a contrastive way to capture the view-invariance information. Next, we utilize both the manifold and Euclidean similarities to improve the metric property. Then, we embed the representations into a hyperbolic space and optimize the hyperbolic embeddings via a continuous relaxation of hierarchical clustering loss. Finally, a binary clustering tree is decoded from optimized hyperbolic embeddings. Experimental results on five real-world datasets demonstrate the effectiveness of the proposed method and its components.


page 1

page 2

page 3

page 4


Multi-view Hierarchical Clustering

This paper focuses on the multi-view clustering, which aims to promote c...

MORI-RAN: Multi-view Robust Representation Learning via Hybrid Contrastive Fusion

Multi-view representation learning is essential for many multi-view task...

Learning Hyperbolic Representations for Unsupervised 3D Segmentation

There exists a need for unsupervised 3D segmentation on complex volumetr...

From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering

Similarity-based Hierarchical Clustering (HC) is a classical unsupervise...

Reconsidering Representation Alignment for Multi-view Clustering

Aligning distributions of view representations is a core component of to...

Neural Distance Embeddings for Biological Sequences

The development of data-dependent heuristics and representations for bio...

Multi-modal Entity Alignment in Hyperbolic Space

Many AI-related tasks involve the interactions of data in multiple modal...

1 Introduction

Clustering is one of the fundamental problems in data analysis, which aims to categorize unlabeled data points into clusters. Existing clustering methods can be divided into partitional clustering and hierarchical clustering (HC) [Jain et al.1999]. The difference is that partitional clustering produces only one partition, while hierarchical clustering produces a nested series of partitions. Compared with partitional clustering, hierarchical clustering reveals more information about fine-grained similarity relationships and structures of data.

Hierarchical clustering has gained intensive attention. There are usually two types of hierarchical methods, i.e., the agglomerative methods and continuous ones. Agglomerative methods include Single-Linkage, Complete-Linkage, Average-Linkage, Ward-Linkage, etc.; continuous methods include Ultrametric Fitting (UFit) [Chierchia and Perret2020] and Hyperbolic Hierarchical Clustering (HypHC) [Chami et al.2020], etc. Existing HC methods are usually limited to data from a single source. However, with the advances of data acquisition in real-world applications, data of an underlying signal is often collected from heterogeneous sources or feature subsets [Li et al.2018]. For example, images can be described by the local binary pattern (LBP) descriptor and the scale-invariant feature transform (SIFT) descriptor; websites can be represented by text, pictures, and other structured metadata. Data from different views may contain complementary and consensus information. Hence it would be greatly beneficial to utilize multi-view data to fertilize hierarchical clustering.

Compared with partitional clustering and single-view hierarchical clustering, multi-view hierarchical clustering is less investigated. It requires a finer-grained understanding of both the consistency and differences among multiple views. To this end, we propose a novel Contrastive Multi-view Hyperbolic Hierarchical Clustering (CMHHC) model, as depicted in Figure 1

. The proposed model consists of three components, i.e., multi-view alignment learning, aligned feature similarity learning, and continuous hyperbolic hierarchical clustering. First, to encourage consistency among multiple views and to suppress view-specific noise, we align the representations by contrastive learning, whose intuition is that the same instance from different views should be mapped closer while different instances should be mapped separately. Next, to improve the metric property among data instances, we exploit more reasonable similarity of aligned features measured both on manifolds and in the Euclidean space, upon which we mine hard positives and hard negatives for unsupervised metric learning. Besides, we also assign autoencoders to each view to regularize the model and prevent model collapse. Then, we embed the representations with good similarities into the hyperbolic space and optimize the hyperbolic embeddings via the continuous relaxation of Dasgupta’s discrete objective for HC 

[Chami et al.2020]. Finally, the tree decoding algorithm helps decode the binary clustering tree from optimized hyperbolic embedding with low distortion.

To the best of our knowledge, the only relevant work on hierarchical clustering for multi-view data is the multi-view hierarchical clustering (MHC) model proposed by Zheng et al.

 zheng2020multi. MHC clusters multi-view data by alternating the cosine distance integration and the nearest neighbor agglomeration. Compared with MHC and naïve methods like multi-view concatenation followed by single-view hierarchical agglomerative clustering, our method enjoys several advantages. Firstly, compared with simply concatenating multiple views in naïve methods or the averaging operation in MHC, CMHHC incorporates a contrastive multi-view alignment process, which can better utilize the complementary and consensus information among different views to learn meaningful view-invariance features of instances. Secondly, compared with the shallow HC framework, deep representation learning and similarity learning are applied in our model to match complex real-world multi-view datasets and obtain more discriminative representations. Thirdly, compared with heuristic agglomerative clustering, our method is oriented to gradient-based clustering process by optimizing multi-view representation learning and hyperbolic hierarchical clustering loss functions. These loss functions provide explicit evidence for measuring the quality of tree structure, which is crucial to achieving better HC performance.

The contributions of this work are summarized as follows:

  • To our knowledge, we propose the first deep neural network model for multi-view hierarchical clustering, which can capture the aligned and discriminative representations across multiple views and perform hierarchical clustering at diverse levels of granularity.

  • To learn representative and discriminative multi-view embeddings, we exploit a contrastive representation learning module (to align representations across multiple views) and an aligned feature similarity learning module (to consider the manifold similarity). These embeddings are helpful to the downstream task of similarity-based clustering.

  • We validate our framework with five multi-view datasets and demonstrate that CMHHC outperforms existing HC and MHC algorithms in terms of the Dendrogram Purity (DP) measurement.

Figure 1: Overview of CMHHC. different autoencoders are assigned for different views. Contrastive learning layers are to align sample-level representations across multiple views. Similarity learning layers are to learn better metric property over aligned features. Then, HypHC layers are to optimize the tree-like embeddings in hyperbolic space. Finally, optimal hyperbolic embeddings are decoding into discrete clustering trees.

2 Related Work

We review related work from three perspectives: hierarchical clustering, multi-view clustering, and hyperbolic geometry.

2.1 Hierarchical Clustering

Hierarchical clustering raises a recursive way for partitioning a dataset into successively finer clusters. Classical heuristics, like Single Linkage, Complete Linkage, Average Linkage, and Ward Linkage, are often the methods of choice for small datasets but may not scale well to large datasets as the running time scales cubically with the sample size [Kobren et al.2017]. Another problem with these heuristics is the lack of good objective functions, so there is no solid theoretical basis to support HC algorithms. To overcome the challenge, Dasgupta dasgupta2016cost defined a proper discrete hierarchical clustering loss on all possible hierarchies. Recently, gradient-based HC has gained increasing research attention, like feature-based gHHC [Monath et al.2019], and similarity-based UFit [Chierchia and Perret2020] and HypHC [Chami et al.2020].

This paper deals with multi-view hierarchical clustering. The most relevant work to this paper is MHC [Zheng et al.2020]

, which performs the cosine distance integration and the nearest neighbor agglomeration alternately. However, this shallow HC framework ignores extracting meaningful and consistent information from multiple views, probably resulting in degenerated clustering performance.

2.2 Multi-view Clustering

Existing multi-view clustering (MVC) methods mainly focus on partitional clustering. Traditional multi-view clustering includes four types, i.e., multi-view subspace clustering [Li et al.2019a]

, multi-view spectral clustering 

[Kang et al.2020], multi-view matrix factorization-based clustering [Cai et al.2013], and canonical correlation analysis (CCA)-based clustering [Chaudhuri et al.2009]. However, many MVC methods do not meet the requirements for complex nonlinear situations. Thus, deep MVC approaches have been attached recently [Trosten et al.2021, Xu et al.2021, Xu et al.2022]. For example, Andrew et al. andrew2013deep and Wang et al. wang2015deep proposed deep versions of CCA, termed as deep CCA and deep canonically correlated autoencoders respectively. Also, End-to-end Adversarial-attention network for Multi-modal Clustering (EAMC) [Zhou and Shen2020] leveraged adversarial learning and attention mechanism to achieve the separation and compactness of cluster structure.

Despite many works towards partitional clustering, there is minimal research towards multi-view hierarchical clustering, which we address in this paper.

2.3 Hyperbolic Geometry

Hyperbolic geometry is a non-Euclidean geometry with a constant negative curvature, which drops the parallel line postulate of the postulates of Euclidean geometry [Sala et al.2018]. Since the surface area lying in hyperbolic space grows exponentially with its radius, hyperbolic space can be seen as a continuous version of trees whose number of leaf nodes also increases exponentially with the depth [Monath et al.2019]. Hyperbolic geometry has been adopted to fertilize research related to tree structures recently [Chami et al.2020, Nickel and Kiela2017, Yan et al.2021].

3 Method

Given a set of data points including clusters and different views , where denotes the -th -dimensional instance from the -th view, we aim to produce a nested series of partitions, where the more similar multi-view samples are grouped earlier and have lower lowest common ancestors (LCAs). To this end, we establish a multi-view hierarchical clustering framework named CMHHC. We first introduce the network architecture, and then we define the loss functions and introduce the optimization process.

3.1 Network Architecture

The proposed network architecture consists of a multi-view alignment learning module, an aligned feature similarity learning module, and a hyperbolic hierarchical clustering module, which is shown in Figure 1. We introduce the details of each component as follows.

3.1.1 Multi-view Alignment Learning

Data from different sources tend to contain complementary and consensus information, so how to extract comprehensive representations and suppress the view-specific noise is the major task in multi-view representation learning. Therefore, we design the multi-view alignment learning module to map the data of each view into a low-dimentional aligned space. First of all, we assign a deep autoencoder [Hinton and Salakhutdinov2006] to each view. The reconstruction from input to output not only helps each view keep the view-specific information to prevent model collapse, but also fertilizes subsequent similarity computing through representation learning. Specifically, considering the -th view , the corresponding encoder is represented as , and the decoder is represented as , where  and denote the autoencoder network parameters, and  and  denote learned -dimensional latent features and the reconstructed output, respectively.

For the still-detached multiple views, inspired by recent works with contrastive learning [Xu et al.2022], we propose to achieve the consistency of high-level semantic features across views in a contrastive way, i.e., corresponding instances of different views should be mapped closer, while different instances should be mapped more separately. To be specific, we define a cross-view positive pair as two views describing the same object, and a negative pair as two different objects from the same view or two arbitrary views. We encode the latent features to aligned features denoted as , where  denotes the parameters of contrasive learning encoder , and

. Then, we compute the cosine similarity 

[Chen et al.2020] of two representations and ():


To achieve consistency of all views, we expect the similarity scores of positive pairs  to be larger, and those of negative pairs  and to be smaller.

3.1.2 Aligned Feature Similarity Learning

After the multi-view alignment learning, comprehensive representations of multi-view data are obtained, and view-specific noise is suppressed. However, the aligned hidden representations do not explicitly guarantee an appropriate similarity measurement that preserves desired distance structure between pairs of multi-view instances, which is crucial to similarity-based hierarchical clustering. Therefore, we devise an aligned feature similarity learning module to optimize the metric property of similarity used in hierarchy learning.

Intuitively, samples may be mapped to some manifold embedded in a high dimensional Euclidean space, so it would be beneficial to utilize the distance on manifolds to improve the similarity measurement [Iscen et al.2018]. This module performs unsupervised metric learning by mining positives and negatives pairs with both the manifold and Euclidean similarities. Similar with Iscen et al. iscen2018mining, we measure the Euclidean similarity of any sample pair via the mapping function  and refer to as Euclidean nearest neighbors (NN) set of , where  is the concatenate aligned representation.

In terms of the manifold similarity, the Euclidean affinity graph needs to be calculated as preparations. The elements of the affinity matrix 

are weighted as  when  and are both the Euclidean NN nodes to each other, or else . Following the spirit of advanced random walk model [Iscen et al.2017], we can get the convergence solution  efficiently, where an element denotes the “best” walker from the -th node to the -th node. Therefore, the manifold similarity function can be defined as . Similarly, we denote as the manifold NN set of .

To this end, we try to consider every data point in the dataset as an anchor in turn so that the hard mining strategy keeps feasible under the premise of acceptable computability. Given an anchor from , we select 

nearest feature vectors on manifolds, which are not that close in the Euclidean space, as good positives. By

and , the hard positive set is descendingly sorted by the manifold similarity as:


where  decides how hard the selected positives are, which is a completely detached value from . However, to keep gained pseudo-label information with little noise, good negatives are expected to be not only relatively far from the anchor on manifolds but also in the Euclidean space. So the hard negative set is in the descending order according to the Euclidean similarity, denoted as:


where is the set of all feature vectors, and is the value of the nearest neighbors for negatives, separated from . Intuitively, a larger value of  leads to harder positives for considering those with relatively lower confidence, and tolerating the intra-cluster variability. Similarly, a smaller value of  means harder negatives for distinguishing the anchors from more easily-confused negatives.

After obtaining hard tuples as the supervision of similarity learning for multi-view input, we are able to get clustering-friendly embeddings on top of the aligned feature space via an encoder , where is the discriminative embeddings of dimensions and is the parameter of the encoder .

3.1.3 Hyperbolic Hierarchical Clustering

To reach the goal of hierarchical clustering, we adopt the continuous optimizing process of Hyperbolic Hierarchical Clustering [Chami et al.2020] as the guidance of this module, which is stacked on the learned common-view embeddings. By means of an embedder  parameterized by , we embed the learned similarity graph from common-view space to a specific hyperbolic space, i.e., the Poincaré Model  whose curvature is constant . The distance of the geodesic between any two points  in hyperbolic space is . With this prior knowledge, we can find the optimal hyperbolic embeddings  pushed to the boundary of the ball via a relaxed form of improved Dasgupta’s cost function. Moreover, the hyperbolic LCA of two hyperbolic embeddings  and  is the point on their geodesic nearest to the origin of the ball. It can be represented as . The LCA of two hyperbolic embeddings is an analogy to that of two leaf nodes  and  in a discrete tree, where the tree-like LCA is the node closest to the root node on the two points’ shortest path [Chami et al.2020]. Based on this, the best hyperbolic embeddings can be decoded back to the original tree via getting merged iteratively along the direction of the ball radius from boundary to origin.

3.2 Loss Functions and Optimization Process

This section introduces the loss functions for CMHHC, including multi-view representation learning loss and hierarchical clustering loss, and discusses the optimization process.

3.2.1 Multi-view Representation Learning Loss

In our model, we jointly align multi-view representations and learn the reliable similarities in common-view space by integrating the autoencoder reconstruction loss , the multi-view contrastive loss and the positive weighted triplet metric loss , so the objective for multi-view representation learning loss is defined as:


First of all, is the -th view loss function of reconstructing , so the complete reconstruction loss for all views is:


As for the second term, assuming that alignment between every two views ensures alignment among all views, we introduce total multi-view mode of contrastive loss as follows:


where the contrastive loss function between reference view  and contrast view  is:



is the temperature hyperparameter for multi-view contrastive loss.

The third term is for similarity learning. Hard tuples includes an anchor , a hard positive sample  and a hard negative sample . Both the single  and the single  are randomly selected from corresponding sets. Then we calculate the embeddings of the tuple:  and , so that we can measure the similarities of the embeddings via the weighted triplet loss:


where  represents the degree of contribution of every tuple. Weighting the standard triplet loss by the similarity between the anchor and the positive on manifolds relieves the pressure from the tuples with too hard positives.

It is worth mentioning that in Eq (4) is optimized by mini-batch training so that the method can scale up to large multi-view datasets.

3.2.2 Hierarchical Clustering Loss

A “good” hierarchical tree means that more similar data instances should be merged earlier. Dasgupta dasgupta2016cost first proposed an explicit hierarchical clustering loss function:


where is the Euclidean similarity between  and  and  is the subtree rooted at the LCA of the  and  nodes, and  denotes the set of descendant leaves of internal node .

Here, we adopt the differentiable relaxation of constrained Dasguta’s objective [Chami et al.2020]. Our pairwise similarity graph is learned from the multi-view representation learning process so that the unified similarity measurement narrows the gap between representation learning and clustering. The hyperbolic hierarchical clustering loss for our model is defined as:


We compute the hierarchy of any embedding triplet through the similarities of all embedding pairs among three samples:


where  is the softmax function  scaled by the temperature parameter  for the hyperbolic hierarchical clustering loss. Eq. (10) theoretically asks for similarities among all tuples of the dataset, which takes a high time complexity of . However, we could sample -order triplets by obtaining all possible node pairs and then choosing the third node randomly from the rest [Chami et al.2020]. With mini-batch training and sampling tricks, HC can scale to large datasets with acceptable computation complexity.

Then the optimal hyperbolic embeddings are denoted as:


Finally, the nature of negative curvature and bent tree-like geodesics in hyperbolic space allows us to decode the binary tree  in the original space by grouping two similar embeddings whose hyperbolic LCA is the farthest to the origin:


where  is the decoding function [Chami et al.2020].

3.2.3 Optimization Process

The optimization process is summarized in the Appendix. The entire training process of our framework has two steps: (A) Optimizing the multi-view representation learning loss with Eq. (4), and (B) Optimizing the hyperbolic hierarchical clustering loss with Eq. (10) .

4 Experiments

4.1 Experimental Setup

4.1.1 Datasets

We conduct our experiments on the following five real-world multi-view datasets.

  • MNIST-USPS [Peng et al.2019] is a two-view dataset with 5000 hand-written digital (0-9) images. The MNIST view is in size, and the USPS view is in size.

  • BDGP [Li et al.2019b] contains 2500 images of Drosophila embryos divided into five categories with two extracted features. One view is 1750-dim visual features, and the other view is 79-dim textual features.

  • Caltech101-7 [Dueck and Frey2007]

    is established with 5 diverse feature descriptors, including 40-dim wavelet moments (WM), 254-dim CENTRIST, 1,984-dim HOG, 512-dim GIST, and 928-dim LBP features, with 1400 RGB images sampled from 7 categories.

  • COIL-20 contains object images of 20 categories. Following Trosten et al. trosten2021reconsidering, we establish a variant of COIL-20, where 480 grayscale images of pixel size are depicted from 3 different random angles.

  • Multi-Fashion [Xu et al.2022] is a three-view dataset with 10,000 images of different fashionable designs, where different views of each sample are different products from the same category.

4.1.2 Baseline Methods

We demonstrate the effects of our CMHHC by comparing with the following three kinds of hierarchical clustering methods. Note that for single-view methods, we concatenate all views into a single view to provide complete information [Peng et al.2019].

Firstly, we compare CMHHC with conventional linkage-based discrete single-view hierarchical agglomerative clustering (HAC) methods, including Single-linkage, Complete-linkage, Average-linkage, and Ward-linkage algorithms. Secondly, we compare CMHHC with the most representative similarity-based continuous single-view hierarchical clustering methods, i.e., UFit [Chierchia and Perret2020], and HypHC [Chami et al.2020]. For UFit, we adopt the best loss function “Closest+Size

”. As for HypHC, we directly use the continuous Dasgupta’s objective. We utilized the corresponding open-source versions for both of the methods and followed the default parameters in the codes provided by the authors. Lastly, to the best of our knowledge, MHC 

[Zheng et al.2020] is the only proposed multi-view hierarchical clustering method. We implemented it with Python as there was no open-source implementation available.

4.1.3 Hierarchical Clustering Metrics

Unlike partitional clustering, binary clustering trees, i.e., the results of hierarchical clustering, can provide diverse-granularity cluster information when the trees are truncated at different altitudes. Hence, general clustering metrics, such as clustering accuracy (ACC) and Normalized Mutual Information (NMI), is not able to display the characteristics of clustering hierarchies comprehensively. To this end, following previous literature [Kobren et al.2017, Monath et al.2019], we validate hierarchical clustering performance via the Dendrogram Purity (DP) measurement. To sum up, DP measurement is equivalent to the average purity over the leaf nodes of LCA of all available data point pairs in the same ground-truth clusters. Clustering tree with higher DP value contains purer subtrees and keeps a more consistent structure with ground-truth flat partitions. A detailed explanation of DP is in the appendix.

4.1.4 Implementation Details

Our entire CMHHC model is implemented with PyTorch. We first pretrain

autoencoders for 200 epochs and contrastive learning encoder for 10, 50, 50, 50 and 100 epochs on BDGP, MNIST-USPS, Caltech101-7, COIL-20, and Multi-Fashion respectively, and then finetune the whole multi-view representation learning process for 50 epochs, and finally train the hyperbolic hierarchical clustering loss for 50 epochs. The batch size is set to 256 for representation learning and 512 for hierarchical clustering, using the Adam and hyperbolic-matched Riemannian optimizer 

[Kochurov et al.2020] respectively. The learning rate is set to  for Adam, and a search over  for Riemannian Adam of different datasets. We empirically set  for all datasets, while  for BDGP, MNIST-USPS, and Multi-Fashion and for Caltech101-7 and COIL-20. We run the model 5 times and report the results with the lowest value of . In addition, we create an adjacency graph with 50 Euclidean nearest neighbors to compute manifold similarities. We make the general rule that the  value equals and the  value equals , making the selected tuples hard and reliable. More detailed parameter setting can be found in the Appendix.

4.2 Experimental Results

4.2.1 Performance Comparison with Baselines

Experimental DP results are reported in Table 1. The results illustrate that our unified CMHHC model outperforms comparing baseline methods. As it shows, our model gains a significant growth in DP by  on BDGP,  on MNIST-USPS,  on Caltech101-7,  on COIL-20 and  on Multi-Fashion over the second-best method. The underlying reason is that CMHHC captures much more meaningful multi-view aligned embeddings instead of concatenating all views roughly without making full use of the complementary and consensus information. Our deep model greatly exceeds the level of the only multi-view hierarchical clustering work MHC, especially on Caltech101-7, COIL-20, and large-scale Multi-Fashion. This result can be attributed to the alignment and discrimination of the multi-view similarity graph learning for hyperbolic hierarchical clustering. Additionally, the performance gap between our model and deep continuous UFit and HypHC reflects the limitations of fixing input graphs without an effective similarity learning process.

Method MNIST-USPS BDGP Caltech101-7 COIL-20 Multi-Fashion
HAC (Single-linkage) 29.81% 61.88% 23.67% 72.56% 27.89%
HAC (Complete-linkage) 54.36% 56.57% 30.19% 69.95% 48.72%
HAC (Average-linkage) 69.67% 45.91% 30.90% 73.14% 65.70%
HAC (Ward-linkage) 80.38% 58.61% 35.69% 80.81% 72.33%
UFit 21.67% 69.20% 19.00% 55.41% 25.94%
HyperHC 32.99% 31.21% 22.46% 28.50% 25.65%
MHC 78.27% 89.14% 45.22% 66.50% 54.81%
CMHHC (Ours) 94.49% 91.53% 66.52% 84.89% 96.25%
Table 1: Dendrogram Purity (DP) results of baselines and CMHHC. We concatenate the features of multi-view data for single-view baselines, including HAC, UFit, and HyperHC.

4.2.2 Ablation Study

We conduct an ablation study to evaluate the effects of the components in the multi-view representation learning module. To be specific, we refer to the CMHHC without autoencoders for reconstruction of multiple views as CMHHC, without contrastive learning submodel as CMHHC, and without similarity learning as CMHHC. We train CMHHC, CMHHC and CMHHC after removing the corresponding network layers. Table 2 shows the experimental results on 5 datasets. The results show that the proposed components contribute to the final hierarchical clustering performance in almost all cases.

Ablation MNIST-USPS BDGP Caltech101-7 COIL-20 Multi-Fashion
CMHHC 94.49% 91.53% 66.52% 84.89% 96.25%
CMHHC 92.92% 86.19% 68.50% 27.16% 89.06%
CMHHC 43.10% 24.51% 33.02% 14.73% 43.57%
CMHHC 89.78% 90.10% 19.74% 53.40% 44.65%
Table 2: Ablation study results.

4.2.3 Case Study

Figure 2: Visualization of a truncated subtree from the decoded tree on the MNIST-USPS dataset. The right part represents the sampled subtree structure of LCA #, where MNIST images are framed in blue, and USPS images are framed in orange. We observe that more similar pairs will have lower LCAs. For example, images belonging to the same category (like # and # of digit ) are grouped together first, i.e., have the lowest LCA, while less similar images (like # of digit and # of digit ) are merged at the highest LCA in the subtree.

We qualitatively evaluate a truncated subtree structure learned via our method for the hierarchical tree. We plot the sampled MNIST-USPS subtrees of the final clustering tree in Figure 2. As shown, the similarity between two nodes is getting more substantial from the root to the leaves, indicating that the hierarchical tree can reveal fine-grained similarity relationships and ground-truth flat partitions for the multi-view data.

5 Conclusion

This paper proposed a novel multi-view hierarchical clustering framework based on deep neural networks. Employing multiple autoencoders, contrastive multi-view alignment learning, and unsupervised similarity learning, we capture the invariance information across views and learn the meaningful metric property for similarity-based continuous hierarchical clustering. Our method aims at providing a clustering tree with high interpretability oriented towards multi-view data, highlighting the importance of representations’ alignment and discrimination, and indicating the potential of gradient-based hyperbolic hierarchical clustering. Extensive experiments illustrate CMHHC is capable of clustering multi-view data at diverse levels of granularity.


This work was partially supported by the National Key Research and Development Program of China (No. 2018AAA0100204), and a key program of fundamental research from Shenzhen Science and Technology Innovation Commission (No. JCYJ20200109113403826).


  • [Andrew et al.2013] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In ICML, pages 1247–1255, 2013.
  • [Cai et al.2013] Xiao Cai, Feiping Nie, and Heng Huang.

    Multi-view k-means clustering on big data.

    In IJCAI, pages 2598–2604, 2013.
  • [Chami et al.2020] Ines Chami, Albert Gu, Vaggos Chatziafratis, and Christopher Ré. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. In NeurIPS, pages 15065–15076, 2020.
  • [Chaudhuri et al.2009] Kamalika Chaudhuri, Sham M Kakade, Karen Livescu, and Karthik Sridharan. Multi-view clustering via canonical correlation analysis. In ICML, pages 129–136, 2009.
  • [Chen et al.2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
  • [Chierchia and Perret2020] Giovanni Chierchia and Benjamin Perret. Ultrametric fitting by gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124004, 2020.
  • [Dasgupta2016] Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In STOC, pages 118–127, 2016.
  • [Dueck and Frey2007] Delbert Dueck and Brendan J Frey. Non-metric affinity propagation for unsupervised image categorization. In ICCV, pages 1–8, 2007.
  • [Hinton and Salakhutdinov2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [Iscen et al.2017] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, and Ondrej Chum. Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations. In CVPR, pages 2077–2086, 2017.
  • [Iscen et al.2018] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej Chum. Mining on manifolds: Metric learning without labels. In CVPR, pages 7642–7651, 2018.
  • [Jain et al.1999] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
  • [Kang et al.2020] Zhao Kang, Guoxin Shi, Shudong Huang, Wenyu Chen, Xiaorong Pu, Joey Tianyi Zhou, and Zenglin Xu. Multi-graph fusion for multi-view spectral clustering. Knowledge-Based Systems, 189:105102, 2020.
  • [Kobren et al.2017] Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A hierarchical algorithm for extreme clustering. In SIGKDD, pages 255–264, 2017.
  • [Kochurov et al.2020] Max Kochurov, Rasul Karimov, and Serge Kozlukov. Geoopt: Riemannian optimization in PyTorch. arXiv preprint arXiv:2005.02819, 2020.
  • [Li et al.2018] Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multi-view representation learning. TKDE, 31(10):1863–1883, 2018.
  • [Li et al.2019a] Ruihuang Li, Changqing Zhang, Huazhu Fu, Xi Peng, Tianyi Zhou, and Qinghua Hu. Reciprocal multi-layer subspace learning for multi-view clustering. In ICCV, pages 8172–8180, 2019.
  • [Li et al.2019b] Zhaoyang Li, Qianqian Wang, Zhiqiang Tao, Quanxue Gao, and Zhaohua Yang. Deep adversarial multi-view clustering network. In IJCAI, pages 2952–2958, 2019.
  • [Monath et al.2019] Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, and Amr Ahmed. Gradient-based hierarchical clustering using continuous representations of trees in hyperbolic space. In SIGKDD, pages 714–722, 2019.
  • [Nickel and Kiela2017] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In NeurIPS, pages 6338–6347, 2017.
  • [Peng et al.2019] Xi Peng, Zhenyu Huang, Jiancheng Lv, Hongyuan Zhu, and Joey Tianyi Zhou. Comic: Multi-view clustering without parameter selection. In ICML, pages 5092–5101, 2019.
  • [Sala et al.2018] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Ré. Representation tradeoffs for hyperbolic embeddings. In ICML, pages 4460–4469, 2018.
  • [Trosten et al.2021] Daniel J Trosten, Sigurd Lokse, Robert Jenssen, and Michael Kampffmeyer. Reconsidering representation alignment for multi-view clustering. In CVPR, pages 1255–1265, 2021.
  • [Wang et al.2015] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In ICML, pages 1083–1092, 2015.
  • [Xu et al.2021] Jie Xu, Yazhou Ren, Huayi Tang, Xiaorong Pu, Xiaofeng Zhu, Ming Zeng, and Lifang He. Multi-VAE: Learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In ICCV, pages 9234–9243, 2021.
  • [Xu et al.2022] Jie Xu, Huayi Tang, Yazhou Ren, Liang Peng, Xiaofeng Zhu, and Lifang He. Multi-level feature learning for contrastive multi-view clustering. In CVPR, 2022.
  • [Yan et al.2021] Jiexi Yan, Lei Luo, Cheng Deng, and Heng Huang. Unsupervised hyperbolic metric learning. In CVPR, pages 12465–12474, 2021.
  • [Zheng et al.2020] Qinghai Zheng, Jihua Zhu, and Shuangxun Ma. Multi-view hierarchical clustering. arXiv preprint arXiv:2010.07573, 2020.
  • [Zhou and Shen2020] Runwu Zhou and Yi-Dong Shen. End-to-end adversarial-attention network for multi-modal clustering. In CVPR, pages 14619–14628, 2020.

Appendix A Algorithm Pseudocode for CMHHC

Algorithm 1 presents the step-by-step procedure of the proposed CMHHC.

Input: Multi-view dataset ;

Hyperparameters , , , and ;
Output: Final binary clustering tree.

1:  Initialization: Initialize the network parameters , , and .
2:  Pretrain: Update by Eq. (5).
3:      Update by Eq. (6).
4:  Performing hard mining strategy.
5:  Step(A): To learn .
6:      Update  by Eq. (4).
7:  Step(B): To learn .
8:      Update by Eq. (10).
9:  Decoding from by Eq.(13).
Algorithm 1 Pseudocode to optimize our CMHHC

Appendix B Summarization of Notations

Notation Definition
The set of training examples.
The data samples in the -th view.
The latent features of -th view.
The reconstructed samples in the -th view.
The aligned representations of -th view.
The concatenate aligned representations among all views.
The discriminative embeddings among all views.
The hyperbolic embeddings in Poincaré Model.
The optimal hyperbolic embeddings.
Binary HC decoding tree.
The -th data sample in the -th view.
The -th latent feature in the -th view.
The -th reconstructed sample in the -th view.
The -th aligned representation in the -th view.
The -th concatenate aligned representations.
Hard positive representation for the -th anchor representation.
Hard negative representation for the -th anchor representation.
The -th discriminative embedding from multiple views.
Hard positive embedding for the -th anchor embedding.
Hard negative embedding for the -th anchor embedding.
The -th hyperbolic embeddings.
The number of data instances (leaf nodes).
The number of clusters.
The number of views.
The number of clusters.
The dimensionality of the -th view.
The dimensionality of latent space.
The dimensionality of aligned representation space.
The dimensionality of discriminative embedding space.
Cosinne distance of the -th in the -th view and the -th in the -th view in common-view space.
Affinity matrix of concatenate aligned representations.
Adjacanncy of the -th and -th representations.
The solution to random walk model of -th sample.
The best walker from the -th node to the -th node.
Table 3: Notation used in CMHHC

We summarize the notations used in the paper in Table 3.

Appendix C Experimental Details

We introduce more experimental details in this section. All the experiments are conducted on a Linux Server with TITAN Xp (10G) GPU and Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

c.1 Datasets

We conduct our experiments on the following multi-view datasets. MNIST-USPS [Peng et al.2019] consists of 5000 hand-written digital (0-9) images with 10 categories. The MNIST view is in size, sampled randomly from MNIST dataset, and the USPS view is in

size, sampled randomly from USPS dataset. Each of category is with 500 images. Specifically, for convenience of experiments, we adopt zero-padding to make the dimensions of USPS view

, the same as MNIST view. BDGP [Li et al.2019b] contains 2500 images of Drosophila embryos of 5 different categories, each of which contains 500 samples. One view is with 1750-dimensional visual features and the other view is with 79-dimensional textual features. Caltech101-7 [Dueck and Frey2007] includes 5 visual feature descriptors, i.e., 40-dim wavelet moments (WM) feature, 254-dim CENTRIST feature, 1,984-dim HOG feature, 512-dim GIST feature, and 928-dim LBP feature. These 1400 RGB images are sampled from 7 categories. Each category contains 200 images. As for COIL-20, following Trosten et al. trosten2021reconsidering, we establish a variant of COIL-20 with 480 object images divided into 20 classes, each of which is with 24 images. Every object is captured in 3 different poses with pixel size for each view. In terms of large-scale multi-view dataset, we apply Multi-Fashion [Xu et al.2022], which contains 10 types of clothes, e.g., pullover, shirt, and coat, to verify the scalability of the proposed CMHHC. In Multi-Fashion, different views of each instance are different fashionable designs of the same category, and there are 10000 grey images in each view.

For all the above five datasets, we utilize the entire dataset of all samples and perform our model among all views. The statistics of the experimental datasets are summarized in Table 4.

Dataset MNIST-USPS BDGP Caltech101-7 COIL-20 Multi-Fashion
# Samples 5000 2500 1400 480 10000
# Categories 10 5 7 20 10
# views 2 2 5 3 3
Table 4: The statistics of the tested datasets.

c.2 Implementation Details

c.2.1 Cmhhc

Our entire model is implemented in the PyTorch platform. For representing the hierarchical structure efficiently, we use the corresponding SciPy, networkx, and ETE (Environment for Tree Exploration) Python toolkits. To speed up the convergence of our whole model, we first empirically pretrain autoencoders for 200 epochs on all datasets and the contrastive learning module for 10, 50, 50, 50 and 100 epochs on BDGP, MNIST-USPS, Caltech101-7, COIL-20 and Multi-Fashion respectively. Next, we finetune the whole multi-view representation learning process for 50 epochs. Finally, the epochs for HypHC training are set to 50. The batch size is set to and for multi-view representation learning and hierarchical clustering, using the Adam and hyperbolic-matched Riemannian optimizers [Kochurov et al.2020] respectively. The learning rate is set to  for Adam with  batch size, and  on BDGP, MNIST-USPS and Multi-Fashion,  on Caltech101-7 and COIL-20 for Riemannian Adam with  batch size. In addition,  is set to  for all datasets, while  is set to  for BDGP, MNIST-USPS and Multi-Fashion, and for Caltech101-7 and COIL-20.

In our CMHHC model, the network architecture consists of a multi-view alignment learning module, a common-view similarity learning module, and a hyperbolic hierarchical clustering module. (1) In multi-view alignment learning module, autoencoders are designed by the same architecture of full connected layers. Each encoder  is with dimensions of and each decoder  is with dimensions of . The fully-connected contrastive learning layers  are with dimensions. (2) In common-view similarity learning module, on concatenate aligned representations from all views, we adopt fully-connected similarity learning layers  with architecture. (3) HypHC layers are implemented by  with dimensions of to optimize hyperbolic embeddings[Chami et al.2020]. According to the general rule of hard mining strategy,  value is set to , and value equals and the  value equals for all datasets. More specifically,   and  values of 5 datasets are set as Table 5.

To achieve a tradeoff between time complexity and hierarchical clustering quality, the number of sampled triplets for HC input is empirically set to 900000, 3000000, 90000, 90000 and 9000000 for 5 datasets of different scales, i.e., BDGP, MNIST-USPS, Caltech101-7, COIL-20 and Multi-Fashion, respectively. The numbers of triplets sampled from datasets with more instances should be larger for expected clustering results [Chami et al.2020].

Dataset MNIST-USPS BDGP Caltech101-7 COIL-20 Multi-Fashion
250 250 100 12 500
500 500 200 24 1000
Table 5: The  and  values of four datasets.

c.2.2 Baseline Methods

For comparing conventional linkage-based discrete single-view HAC methods and continuous single-view HC methods fairly, we concatenate all views into a single view without losing information, and then apply the above methods. To be specific, we implemented HACs, i.e., Single-linkage, Complete-linkage, Average-linkage, and Ward-linkage algorithms by the corresponding SciPy Python library. We directly use the open-source implementations of UFit [Chierchia and Perret2020] and HypHC [Chami et al.2020]. UFit proposed a series of continuous objectives for ultrametric fitting from different aspects. We follow the hyperparameter  in UFit, and adopt the best cost function “Closest+Size” instead of other proposed cost functions, i.e., “Closest+Triplet” and “Dasgupta”. The reasons for the choice of UFit cost function are twofold. First, “Closest+Triplet” assumes the ground-truth labels of some data points are known to establish triplets for metric learning, which conflicts with the original intention of unsupervised clustering. Second, according to reported results by [Chierchia and Perret2020], the performance of “Dasgupta” is slightly worse than that of ‘Closest+Size” in terms of accuracy  (ACC). For HypHC, we adjust corresponding hyperparameters for 5 datasets of different scales. Here, for the sake of fairness, the number of sampled triplets for each dataset keeps consistent with that in our CMHHC model. Besides, we set for BDGP, MNIST-USPS and Multi-Fashion and for Caltech101-7 and COIL-20.

In terms of MHC [Zheng et al.2020], we implemented it with Python as there was no open-source implementation available. MHC assumed that multiple views can be reconstructed by one fundamental latent representation . However, a detailed explanation of latent representation is implicit. Hence, we empirically adopt non-negative matrix factorization on unified concatenate views to obtain . In addition, the way MHC built the adjacency graph for NNA was ambiguous, so we relax the formulation of the adjacency graph. We consider connecting three kinds of pairs  into one cluster, i.e., point  is the nearest neighbor of point , point  is the nearest neighbor of point  or points  have the same neighbor.

c.3 Hierarchical Clustering Metrics

Following [Kobren et al.2017, Monath et al.2019], we validate hierarchical clustering performance via the Dendrogram Purity (DP), which is a more holistic measurement of the hierarchical tree quality. Given a final clustering tree  of a dataset , and corresponding ground-truth clustering partitions  belonging to  clusters, DP of  is defined as:


where  represents two data points belonging to the same ground-truth cluster  and  means the set of descendant leaves of LCA internal node , and  generally represents the proportion that the data points belonging to both set  and set  account for those belonging to set . Intuitively, high DP scores lead to nodes that are similar to clusters in the ground truth flat partition.

c.4 Hard Mining Strategy Analysis

Dataset MNIST-USPS BDGP Caltech101-7 COIL-20 Multi-Fashion
CMMC 94.49% 91.53% 66.52% 84.89% 96.25%
CMMC 76.24% 83.96% 54.44% 82.05% 93.86%
Table 6: Hard mining strategy analysis, i.e., DP results of CMMC and CMMC.
Figure 3: DP results of our CMHHC on five datasets with respect to different parameter settings.

To demonstrate the effectiveness of the proposed hard mining strategy for unsupervised metric learning, we compare CMHHC with CMHHC, denoting our model replacing proposed strategy with a much more strict triplet sampling strategy for weighted triplet loss training. In this way, the hard negative set for one anchor sample is made by involving nearest neighbors close to the anchor in the Euclidean space but on different manifold [Iscen et al.2018]. Therefore, we denote the hard negative set for CMHHC as:


The more strict triplet sampling strategy makes CMHHC focus on too hard negatives, which tend to be too close to the anchor sample in the Euclidean space. Therefore, mined pairwise similarity information is likely to contain unexpected noise, which will mislead the hierarchical clustering. Besides, CMHHC is limited to local structure around  negatives, which does not respect to the fact that the whole clustering process takes the similarity of all over instances into consideration. However, our hard mining strategy, which regards samples not only far from the anchor on manifolds but also in the Euclidean space, is more applicable for our downstream clustering task.

Table 6 shows the DP results of CMHHC and CMHHC from a quantitative perspective. Clearly, our hard mining strategy improves DP measurement on all datasets by a large margin. In other words, the relaxed strategy is capable of learning more meaningful triplets, which actually help generate clustering-friendly embedding space.

c.5 Parameter Sensitivity Analysis

The hyperparameters of CMHHC include the temperature parameters , for contrastive learning and hyperbolic hierarchical clustering, and also, the number of nearest neighbors in the Euclidean space , and the values of hard positives and hard negatives and for similarity learning. Naturally, we set  for all datasets, while  for BDGP, MNIST-USPS, and Multi-Fashion and for Caltech101-7 and COIL-20, which are empirically efficient. In terms of the similarity learning parameters , and , we made the general rule that the  value equals , value equals and the  value equals . Therefore, we evaluate the effectiveness of the general rule on all five datasets. With fixed the other two parameters  and , we vary , and values in the range of , and for all datasets. We also run the model 5 times, and the DP results with the lowest value of  under each parameter setting is shown in Fig 3. Our CMHHC is insensitive to value setting. Since the parameter is utilized to define the manifold similarity matrix, different values result in different metric properties on manifolds. Hence, the performance of CMHHC is likely to fluctuate within a reasonable range, and when the DP results on different datasets tend to be stable at a better level. Besides, setting and values via our general rule is easier to achieve fairly good hierarchical clustering performance. Actually, the parameters and controls the diversity and the difficulty of positives and negatives. More specifically, making the equals and the equals , offers sufficient hard positives and negatives, and guarantees these tuples to capture pseudo-label information with little noise.