1 Introduction
Clustering is one of the fundamental problems in data analysis, which aims to categorize unlabeled data points into clusters. Existing clustering methods can be divided into partitional clustering and hierarchical clustering (HC) [Jain et al.1999]. The difference is that partitional clustering produces only one partition, while hierarchical clustering produces a nested series of partitions. Compared with partitional clustering, hierarchical clustering reveals more information about finegrained similarity relationships and structures of data.
Hierarchical clustering has gained intensive attention. There are usually two types of hierarchical methods, i.e., the agglomerative methods and continuous ones. Agglomerative methods include SingleLinkage, CompleteLinkage, AverageLinkage, WardLinkage, etc.; continuous methods include Ultrametric Fitting (UFit) [Chierchia and Perret2020] and Hyperbolic Hierarchical Clustering (HypHC) [Chami et al.2020], etc. Existing HC methods are usually limited to data from a single source. However, with the advances of data acquisition in realworld applications, data of an underlying signal is often collected from heterogeneous sources or feature subsets [Li et al.2018]. For example, images can be described by the local binary pattern (LBP) descriptor and the scaleinvariant feature transform (SIFT) descriptor; websites can be represented by text, pictures, and other structured metadata. Data from different views may contain complementary and consensus information. Hence it would be greatly beneficial to utilize multiview data to fertilize hierarchical clustering.
Compared with partitional clustering and singleview hierarchical clustering, multiview hierarchical clustering is less investigated. It requires a finergrained understanding of both the consistency and differences among multiple views. To this end, we propose a novel Contrastive Multiview Hyperbolic Hierarchical Clustering (CMHHC) model, as depicted in Figure 1
. The proposed model consists of three components, i.e., multiview alignment learning, aligned feature similarity learning, and continuous hyperbolic hierarchical clustering. First, to encourage consistency among multiple views and to suppress viewspecific noise, we align the representations by contrastive learning, whose intuition is that the same instance from different views should be mapped closer while different instances should be mapped separately. Next, to improve the metric property among data instances, we exploit more reasonable similarity of aligned features measured both on manifolds and in the Euclidean space, upon which we mine hard positives and hard negatives for unsupervised metric learning. Besides, we also assign autoencoders to each view to regularize the model and prevent model collapse. Then, we embed the representations with good similarities into the hyperbolic space and optimize the hyperbolic embeddings via the continuous relaxation of Dasgupta’s discrete objective for HC
[Chami et al.2020]. Finally, the tree decoding algorithm helps decode the binary clustering tree from optimized hyperbolic embedding with low distortion.To the best of our knowledge, the only relevant work on hierarchical clustering for multiview data is the multiview hierarchical clustering (MHC) model proposed by Zheng et al.
zheng2020multi. MHC clusters multiview data by alternating the cosine distance integration and the nearest neighbor agglomeration. Compared with MHC and naïve methods like multiview concatenation followed by singleview hierarchical agglomerative clustering, our method enjoys several advantages. Firstly, compared with simply concatenating multiple views in naïve methods or the averaging operation in MHC, CMHHC incorporates a contrastive multiview alignment process, which can better utilize the complementary and consensus information among different views to learn meaningful viewinvariance features of instances. Secondly, compared with the shallow HC framework, deep representation learning and similarity learning are applied in our model to match complex realworld multiview datasets and obtain more discriminative representations. Thirdly, compared with heuristic agglomerative clustering, our method is oriented to gradientbased clustering process by optimizing multiview representation learning and hyperbolic hierarchical clustering loss functions. These loss functions provide explicit evidence for measuring the quality of tree structure, which is crucial to achieving better HC performance.
The contributions of this work are summarized as follows:

To our knowledge, we propose the first deep neural network model for multiview hierarchical clustering, which can capture the aligned and discriminative representations across multiple views and perform hierarchical clustering at diverse levels of granularity.

To learn representative and discriminative multiview embeddings, we exploit a contrastive representation learning module (to align representations across multiple views) and an aligned feature similarity learning module (to consider the manifold similarity). These embeddings are helpful to the downstream task of similaritybased clustering.

We validate our framework with five multiview datasets and demonstrate that CMHHC outperforms existing HC and MHC algorithms in terms of the Dendrogram Purity (DP) measurement.
2 Related Work
We review related work from three perspectives: hierarchical clustering, multiview clustering, and hyperbolic geometry.
2.1 Hierarchical Clustering
Hierarchical clustering raises a recursive way for partitioning a dataset into successively finer clusters. Classical heuristics, like Single Linkage, Complete Linkage, Average Linkage, and Ward Linkage, are often the methods of choice for small datasets but may not scale well to large datasets as the running time scales cubically with the sample size [Kobren et al.2017]. Another problem with these heuristics is the lack of good objective functions, so there is no solid theoretical basis to support HC algorithms. To overcome the challenge, Dasgupta dasgupta2016cost defined a proper discrete hierarchical clustering loss on all possible hierarchies. Recently, gradientbased HC has gained increasing research attention, like featurebased gHHC [Monath et al.2019], and similaritybased UFit [Chierchia and Perret2020] and HypHC [Chami et al.2020].
This paper deals with multiview hierarchical clustering. The most relevant work to this paper is MHC [Zheng et al.2020]
, which performs the cosine distance integration and the nearest neighbor agglomeration alternately. However, this shallow HC framework ignores extracting meaningful and consistent information from multiple views, probably resulting in degenerated clustering performance.
2.2 Multiview Clustering
Existing multiview clustering (MVC) methods mainly focus on partitional clustering. Traditional multiview clustering includes four types, i.e., multiview subspace clustering [Li et al.2019a]
, multiview spectral clustering
[Kang et al.2020], multiview matrix factorizationbased clustering [Cai et al.2013], and canonical correlation analysis (CCA)based clustering [Chaudhuri et al.2009]. However, many MVC methods do not meet the requirements for complex nonlinear situations. Thus, deep MVC approaches have been attached recently [Trosten et al.2021, Xu et al.2021, Xu et al.2022]. For example, Andrew et al. andrew2013deep and Wang et al. wang2015deep proposed deep versions of CCA, termed as deep CCA and deep canonically correlated autoencoders respectively. Also, Endtoend Adversarialattention network for Multimodal Clustering (EAMC) [Zhou and Shen2020] leveraged adversarial learning and attention mechanism to achieve the separation and compactness of cluster structure.Despite many works towards partitional clustering, there is minimal research towards multiview hierarchical clustering, which we address in this paper.
2.3 Hyperbolic Geometry
Hyperbolic geometry is a nonEuclidean geometry with a constant negative curvature, which drops the parallel line postulate of the postulates of Euclidean geometry [Sala et al.2018]. Since the surface area lying in hyperbolic space grows exponentially with its radius, hyperbolic space can be seen as a continuous version of trees whose number of leaf nodes also increases exponentially with the depth [Monath et al.2019]. Hyperbolic geometry has been adopted to fertilize research related to tree structures recently [Chami et al.2020, Nickel and Kiela2017, Yan et al.2021].
3 Method
Given a set of data points including clusters and different views , where denotes the th dimensional instance from the th view, we aim to produce a nested series of partitions, where the more similar multiview samples are grouped earlier and have lower lowest common ancestors (LCAs). To this end, we establish a multiview hierarchical clustering framework named CMHHC. We first introduce the network architecture, and then we define the loss functions and introduce the optimization process.
3.1 Network Architecture
The proposed network architecture consists of a multiview alignment learning module, an aligned feature similarity learning module, and a hyperbolic hierarchical clustering module, which is shown in Figure 1. We introduce the details of each component as follows.
3.1.1 Multiview Alignment Learning
Data from different sources tend to contain complementary and consensus information, so how to extract comprehensive representations and suppress the viewspecific noise is the major task in multiview representation learning. Therefore, we design the multiview alignment learning module to map the data of each view into a lowdimentional aligned space. First of all, we assign a deep autoencoder [Hinton and Salakhutdinov2006] to each view. The reconstruction from input to output not only helps each view keep the viewspecific information to prevent model collapse, but also fertilizes subsequent similarity computing through representation learning. Specifically, considering the th view , the corresponding encoder is represented as , and the decoder is represented as , where and denote the autoencoder network parameters, and and denote learned dimensional latent features and the reconstructed output, respectively.
For the stilldetached multiple views, inspired by recent works with contrastive learning [Xu et al.2022], we propose to achieve the consistency of highlevel semantic features across views in a contrastive way, i.e., corresponding instances of different views should be mapped closer, while different instances should be mapped more separately. To be specific, we define a crossview positive pair as two views describing the same object, and a negative pair as two different objects from the same view or two arbitrary views. We encode the latent features to aligned features denoted as , where denotes the parameters of contrasive learning encoder , and
. Then, we compute the cosine similarity
[Chen et al.2020] of two representations and ():(1) 
To achieve consistency of all views, we expect the similarity scores of positive pairs to be larger, and those of negative pairs and to be smaller.
3.1.2 Aligned Feature Similarity Learning
After the multiview alignment learning, comprehensive representations of multiview data are obtained, and viewspecific noise is suppressed. However, the aligned hidden representations do not explicitly guarantee an appropriate similarity measurement that preserves desired distance structure between pairs of multiview instances, which is crucial to similaritybased hierarchical clustering. Therefore, we devise an aligned feature similarity learning module to optimize the metric property of similarity used in hierarchy learning.
Intuitively, samples may be mapped to some manifold embedded in a high dimensional Euclidean space, so it would be beneficial to utilize the distance on manifolds to improve the similarity measurement [Iscen et al.2018]. This module performs unsupervised metric learning by mining positives and negatives pairs with both the manifold and Euclidean similarities. Similar with Iscen et al. iscen2018mining, we measure the Euclidean similarity of any sample pair via the mapping function and refer to as Euclidean nearest neighbors (NN) set of , where is the concatenate aligned representation.
In terms of the manifold similarity, the Euclidean affinity graph needs to be calculated as preparations. The elements of the affinity matrix
are weighted as when and are both the Euclidean NN nodes to each other, or else . Following the spirit of advanced random walk model [Iscen et al.2017], we can get the convergence solution efficiently, where an element denotes the “best” walker from the th node to the th node. Therefore, the manifold similarity function can be defined as . Similarly, we denote as the manifold NN set of .To this end, we try to consider every data point in the dataset as an anchor in turn so that the hard mining strategy keeps feasible under the premise of acceptable computability. Given an anchor from , we select
nearest feature vectors on manifolds, which are not that close in the Euclidean space, as good positives. By
and , the hard positive set is descendingly sorted by the manifold similarity as:(2) 
where decides how hard the selected positives are, which is a completely detached value from . However, to keep gained pseudolabel information with little noise, good negatives are expected to be not only relatively far from the anchor on manifolds but also in the Euclidean space. So the hard negative set is in the descending order according to the Euclidean similarity, denoted as:
(3) 
where is the set of all feature vectors, and is the value of the nearest neighbors for negatives, separated from . Intuitively, a larger value of leads to harder positives for considering those with relatively lower confidence, and tolerating the intracluster variability. Similarly, a smaller value of means harder negatives for distinguishing the anchors from more easilyconfused negatives.
After obtaining hard tuples as the supervision of similarity learning for multiview input, we are able to get clusteringfriendly embeddings on top of the aligned feature space via an encoder , where is the discriminative embeddings of dimensions and is the parameter of the encoder .
3.1.3 Hyperbolic Hierarchical Clustering
To reach the goal of hierarchical clustering, we adopt the continuous optimizing process of Hyperbolic Hierarchical Clustering [Chami et al.2020] as the guidance of this module, which is stacked on the learned commonview embeddings. By means of an embedder parameterized by , we embed the learned similarity graph from commonview space to a specific hyperbolic space, i.e., the Poincaré Model whose curvature is constant . The distance of the geodesic between any two points , in hyperbolic space is . With this prior knowledge, we can find the optimal hyperbolic embeddings pushed to the boundary of the ball via a relaxed form of improved Dasgupta’s cost function. Moreover, the hyperbolic LCA of two hyperbolic embeddings and is the point on their geodesic nearest to the origin of the ball. It can be represented as . The LCA of two hyperbolic embeddings is an analogy to that of two leaf nodes and in a discrete tree, where the treelike LCA is the node closest to the root node on the two points’ shortest path [Chami et al.2020]. Based on this, the best hyperbolic embeddings can be decoded back to the original tree via getting merged iteratively along the direction of the ball radius from boundary to origin.
3.2 Loss Functions and Optimization Process
This section introduces the loss functions for CMHHC, including multiview representation learning loss and hierarchical clustering loss, and discusses the optimization process.
3.2.1 Multiview Representation Learning Loss
In our model, we jointly align multiview representations and learn the reliable similarities in commonview space by integrating the autoencoder reconstruction loss , the multiview contrastive loss and the positive weighted triplet metric loss , so the objective for multiview representation learning loss is defined as:
(4) 
First of all, is the th view loss function of reconstructing , so the complete reconstruction loss for all views is:
(5) 
As for the second term, assuming that alignment between every two views ensures alignment among all views, we introduce total multiview mode of contrastive loss as follows:
(6) 
where the contrastive loss function between reference view and contrast view is:
(7) 
where
is the temperature hyperparameter for multiview contrastive loss.
The third term is for similarity learning. Hard tuples includes an anchor , a hard positive sample and a hard negative sample . Both the single and the single are randomly selected from corresponding sets. Then we calculate the embeddings of the tuple: , and , so that we can measure the similarities of the embeddings via the weighted triplet loss:
(8) 
where represents the degree of contribution of every tuple. Weighting the standard triplet loss by the similarity between the anchor and the positive on manifolds relieves the pressure from the tuples with too hard positives.
It is worth mentioning that in Eq (4) is optimized by minibatch training so that the method can scale up to large multiview datasets.
3.2.2 Hierarchical Clustering Loss
A “good” hierarchical tree means that more similar data instances should be merged earlier. Dasgupta dasgupta2016cost first proposed an explicit hierarchical clustering loss function:
(9) 
where is the Euclidean similarity between and and is the subtree rooted at the LCA of the and nodes, and denotes the set of descendant leaves of internal node .
Here, we adopt the differentiable relaxation of constrained Dasguta’s objective [Chami et al.2020]. Our pairwise similarity graph is learned from the multiview representation learning process so that the unified similarity measurement narrows the gap between representation learning and clustering. The hyperbolic hierarchical clustering loss for our model is defined as:
(10)  
We compute the hierarchy of any embedding triplet through the similarities of all embedding pairs among three samples:
(11)  
where is the softmax function scaled by the temperature parameter for the hyperbolic hierarchical clustering loss. Eq. (10) theoretically asks for similarities among all tuples of the dataset, which takes a high time complexity of . However, we could sample order triplets by obtaining all possible node pairs and then choosing the third node randomly from the rest [Chami et al.2020]. With minibatch training and sampling tricks, HC can scale to large datasets with acceptable computation complexity.
Then the optimal hyperbolic embeddings are denoted as:
(12) 
Finally, the nature of negative curvature and bent treelike geodesics in hyperbolic space allows us to decode the binary tree in the original space by grouping two similar embeddings whose hyperbolic LCA is the farthest to the origin:
(13) 
where is the decoding function [Chami et al.2020].
3.2.3 Optimization Process
4 Experiments
4.1 Experimental Setup
4.1.1 Datasets
We conduct our experiments on the following five realworld multiview datasets.

MNISTUSPS [Peng et al.2019] is a twoview dataset with 5000 handwritten digital (09) images. The MNIST view is in size, and the USPS view is in size.

BDGP [Li et al.2019b] contains 2500 images of Drosophila embryos divided into five categories with two extracted features. One view is 1750dim visual features, and the other view is 79dim textual features.

Caltech1017 [Dueck and Frey2007]
is established with 5 diverse feature descriptors, including 40dim wavelet moments (WM), 254dim CENTRIST, 1,984dim HOG, 512dim GIST, and 928dim LBP features, with 1400 RGB images sampled from 7 categories.

COIL20 contains object images of 20 categories. Following Trosten et al. trosten2021reconsidering, we establish a variant of COIL20, where 480 grayscale images of pixel size are depicted from 3 different random angles.

MultiFashion [Xu et al.2022] is a threeview dataset with 10,000 images of different fashionable designs, where different views of each sample are different products from the same category.
4.1.2 Baseline Methods
We demonstrate the effects of our CMHHC by comparing with the following three kinds of hierarchical clustering methods. Note that for singleview methods, we concatenate all views into a single view to provide complete information [Peng et al.2019].
Firstly, we compare CMHHC with conventional linkagebased discrete singleview hierarchical agglomerative clustering (HAC) methods, including Singlelinkage, Completelinkage, Averagelinkage, and Wardlinkage algorithms. Secondly, we compare CMHHC with the most representative similaritybased continuous singleview hierarchical clustering methods, i.e., UFit [Chierchia and Perret2020], and HypHC [Chami et al.2020]. For UFit, we adopt the best loss function “Closest+Size
”. As for HypHC, we directly use the continuous Dasgupta’s objective. We utilized the corresponding opensource versions for both of the methods and followed the default parameters in the codes provided by the authors. Lastly, to the best of our knowledge, MHC
[Zheng et al.2020] is the only proposed multiview hierarchical clustering method. We implemented it with Python as there was no opensource implementation available.4.1.3 Hierarchical Clustering Metrics
Unlike partitional clustering, binary clustering trees, i.e., the results of hierarchical clustering, can provide diversegranularity cluster information when the trees are truncated at different altitudes. Hence, general clustering metrics, such as clustering accuracy (ACC) and Normalized Mutual Information (NMI), is not able to display the characteristics of clustering hierarchies comprehensively. To this end, following previous literature [Kobren et al.2017, Monath et al.2019], we validate hierarchical clustering performance via the Dendrogram Purity (DP) measurement. To sum up, DP measurement is equivalent to the average purity over the leaf nodes of LCA of all available data point pairs in the same groundtruth clusters. Clustering tree with higher DP value contains purer subtrees and keeps a more consistent structure with groundtruth flat partitions. A detailed explanation of DP is in the appendix.
4.1.4 Implementation Details
Our entire CMHHC model is implemented with PyTorch. We first pretrain
autoencoders for 200 epochs and contrastive learning encoder for 10, 50, 50, 50 and 100 epochs on BDGP, MNISTUSPS, Caltech1017, COIL20, and MultiFashion respectively, and then finetune the whole multiview representation learning process for 50 epochs, and finally train the hyperbolic hierarchical clustering loss for 50 epochs. The batch size is set to 256 for representation learning and 512 for hierarchical clustering, using the Adam and hyperbolicmatched Riemannian optimizer
[Kochurov et al.2020] respectively. The learning rate is set to for Adam, and a search over for Riemannian Adam of different datasets. We empirically set for all datasets, while for BDGP, MNISTUSPS, and MultiFashion and for Caltech1017 and COIL20. We run the model 5 times and report the results with the lowest value of . In addition, we create an adjacency graph with 50 Euclidean nearest neighbors to compute manifold similarities. We make the general rule that the value equals and the value equals , making the selected tuples hard and reliable. More detailed parameter setting can be found in the Appendix.4.2 Experimental Results
4.2.1 Performance Comparison with Baselines
Experimental DP results are reported in Table 1. The results illustrate that our unified CMHHC model outperforms comparing baseline methods. As it shows, our model gains a significant growth in DP by on BDGP, on MNISTUSPS, on Caltech1017, on COIL20 and on MultiFashion over the secondbest method. The underlying reason is that CMHHC captures much more meaningful multiview aligned embeddings instead of concatenating all views roughly without making full use of the complementary and consensus information. Our deep model greatly exceeds the level of the only multiview hierarchical clustering work MHC, especially on Caltech1017, COIL20, and largescale MultiFashion. This result can be attributed to the alignment and discrimination of the multiview similarity graph learning for hyperbolic hierarchical clustering. Additionally, the performance gap between our model and deep continuous UFit and HypHC reflects the limitations of fixing input graphs without an effective similarity learning process.
Method  MNISTUSPS  BDGP  Caltech1017  COIL20  MultiFashion 

HAC (Singlelinkage)  29.81%  61.88%  23.67%  72.56%  27.89% 
HAC (Completelinkage)  54.36%  56.57%  30.19%  69.95%  48.72% 
HAC (Averagelinkage)  69.67%  45.91%  30.90%  73.14%  65.70% 
HAC (Wardlinkage)  80.38%  58.61%  35.69%  80.81%  72.33% 
UFit  21.67%  69.20%  19.00%  55.41%  25.94% 
HyperHC  32.99%  31.21%  22.46%  28.50%  25.65% 
MHC  78.27%  89.14%  45.22%  66.50%  54.81% 
CMHHC (Ours)  94.49%  91.53%  66.52%  84.89%  96.25% 
4.2.2 Ablation Study
We conduct an ablation study to evaluate the effects of the components in the multiview representation learning module. To be specific, we refer to the CMHHC without autoencoders for reconstruction of multiple views as CMHHC, without contrastive learning submodel as CMHHC, and without similarity learning as CMHHC. We train CMHHC, CMHHC and CMHHC after removing the corresponding network layers. Table 2 shows the experimental results on 5 datasets. The results show that the proposed components contribute to the final hierarchical clustering performance in almost all cases.
Ablation  MNISTUSPS  BDGP  Caltech1017  COIL20  MultiFashion 

CMHHC  94.49%  91.53%  66.52%  84.89%  96.25% 
CMHHC  92.92%  86.19%  68.50%  27.16%  89.06% 
CMHHC  43.10%  24.51%  33.02%  14.73%  43.57% 
CMHHC  89.78%  90.10%  19.74%  53.40%  44.65% 
4.2.3 Case Study
We qualitatively evaluate a truncated subtree structure learned via our method for the hierarchical tree. We plot the sampled MNISTUSPS subtrees of the final clustering tree in Figure 2. As shown, the similarity between two nodes is getting more substantial from the root to the leaves, indicating that the hierarchical tree can reveal finegrained similarity relationships and groundtruth flat partitions for the multiview data.
5 Conclusion
This paper proposed a novel multiview hierarchical clustering framework based on deep neural networks. Employing multiple autoencoders, contrastive multiview alignment learning, and unsupervised similarity learning, we capture the invariance information across views and learn the meaningful metric property for similaritybased continuous hierarchical clustering. Our method aims at providing a clustering tree with high interpretability oriented towards multiview data, highlighting the importance of representations’ alignment and discrimination, and indicating the potential of gradientbased hyperbolic hierarchical clustering. Extensive experiments illustrate CMHHC is capable of clustering multiview data at diverse levels of granularity.
Acknowledgments
This work was partially supported by the National Key Research and Development Program of China (No. 2018AAA0100204), and a key program of fundamental research from Shenzhen Science and Technology Innovation Commission (No. JCYJ20200109113403826).
References
 [Andrew et al.2013] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In ICML, pages 1247–1255, 2013.

[Cai et al.2013]
Xiao Cai, Feiping Nie, and Heng Huang.
Multiview kmeans clustering on big data.
In IJCAI, pages 2598–2604, 2013.  [Chami et al.2020] Ines Chami, Albert Gu, Vaggos Chatziafratis, and Christopher Ré. From trees to continuous embeddings and back: Hyperbolic hierarchical clustering. In NeurIPS, pages 15065–15076, 2020.
 [Chaudhuri et al.2009] Kamalika Chaudhuri, Sham M Kakade, Karen Livescu, and Karthik Sridharan. Multiview clustering via canonical correlation analysis. In ICML, pages 129–136, 2009.
 [Chen et al.2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
 [Chierchia and Perret2020] Giovanni Chierchia and Benjamin Perret. Ultrametric fitting by gradient descent. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124004, 2020.
 [Dasgupta2016] Sanjoy Dasgupta. A cost function for similaritybased hierarchical clustering. In STOC, pages 118–127, 2016.
 [Dueck and Frey2007] Delbert Dueck and Brendan J Frey. Nonmetric affinity propagation for unsupervised image categorization. In ICCV, pages 1–8, 2007.
 [Hinton and Salakhutdinov2006] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [Iscen et al.2017] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Teddy Furon, and Ondrej Chum. Efficient diffusion on region manifolds: Recovering small objects with compact cnn representations. In CVPR, pages 2077–2086, 2017.
 [Iscen et al.2018] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej Chum. Mining on manifolds: Metric learning without labels. In CVPR, pages 7642–7651, 2018.
 [Jain et al.1999] Anil K Jain, M Narasimha Murty, and Patrick J Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
 [Kang et al.2020] Zhao Kang, Guoxin Shi, Shudong Huang, Wenyu Chen, Xiaorong Pu, Joey Tianyi Zhou, and Zenglin Xu. Multigraph fusion for multiview spectral clustering. KnowledgeBased Systems, 189:105102, 2020.
 [Kobren et al.2017] Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A hierarchical algorithm for extreme clustering. In SIGKDD, pages 255–264, 2017.
 [Kochurov et al.2020] Max Kochurov, Rasul Karimov, and Serge Kozlukov. Geoopt: Riemannian optimization in PyTorch. arXiv preprint arXiv:2005.02819, 2020.
 [Li et al.2018] Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multiview representation learning. TKDE, 31(10):1863–1883, 2018.
 [Li et al.2019a] Ruihuang Li, Changqing Zhang, Huazhu Fu, Xi Peng, Tianyi Zhou, and Qinghua Hu. Reciprocal multilayer subspace learning for multiview clustering. In ICCV, pages 8172–8180, 2019.
 [Li et al.2019b] Zhaoyang Li, Qianqian Wang, Zhiqiang Tao, Quanxue Gao, and Zhaohua Yang. Deep adversarial multiview clustering network. In IJCAI, pages 2952–2958, 2019.
 [Monath et al.2019] Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, and Amr Ahmed. Gradientbased hierarchical clustering using continuous representations of trees in hyperbolic space. In SIGKDD, pages 714–722, 2019.
 [Nickel and Kiela2017] Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In NeurIPS, pages 6338–6347, 2017.
 [Peng et al.2019] Xi Peng, Zhenyu Huang, Jiancheng Lv, Hongyuan Zhu, and Joey Tianyi Zhou. Comic: Multiview clustering without parameter selection. In ICML, pages 5092–5101, 2019.
 [Sala et al.2018] Frederic Sala, Chris De Sa, Albert Gu, and Christopher Ré. Representation tradeoffs for hyperbolic embeddings. In ICML, pages 4460–4469, 2018.
 [Trosten et al.2021] Daniel J Trosten, Sigurd Lokse, Robert Jenssen, and Michael Kampffmeyer. Reconsidering representation alignment for multiview clustering. In CVPR, pages 1255–1265, 2021.
 [Wang et al.2015] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multiview representation learning. In ICML, pages 1083–1092, 2015.
 [Xu et al.2021] Jie Xu, Yazhou Ren, Huayi Tang, Xiaorong Pu, Xiaofeng Zhu, Ming Zeng, and Lifang He. MultiVAE: Learning disentangled viewcommon and viewpeculiar visual representations for multiview clustering. In ICCV, pages 9234–9243, 2021.
 [Xu et al.2022] Jie Xu, Huayi Tang, Yazhou Ren, Liang Peng, Xiaofeng Zhu, and Lifang He. Multilevel feature learning for contrastive multiview clustering. In CVPR, 2022.
 [Yan et al.2021] Jiexi Yan, Lei Luo, Cheng Deng, and Heng Huang. Unsupervised hyperbolic metric learning. In CVPR, pages 12465–12474, 2021.
 [Zheng et al.2020] Qinghai Zheng, Jihua Zhu, and Shuangxun Ma. Multiview hierarchical clustering. arXiv preprint arXiv:2010.07573, 2020.
 [Zhou and Shen2020] Runwu Zhou and YiDong Shen. Endtoend adversarialattention network for multimodal clustering. In CVPR, pages 14619–14628, 2020.
Appendix A Algorithm Pseudocode for CMHHC
Algorithm 1 presents the stepbystep procedure of the proposed CMHHC.
Appendix B Summarization of Notations
Notation  Definition 

The set of training examples.  
The data samples in the th view.  
The latent features of th view.  
The reconstructed samples in the th view.  
The aligned representations of th view.  
The concatenate aligned representations among all views.  
The discriminative embeddings among all views.  
The hyperbolic embeddings in Poincaré Model.  
The optimal hyperbolic embeddings.  
Binary HC decoding tree.  
The th data sample in the th view.  
The th latent feature in the th view.  
The th reconstructed sample in the th view.  
The th aligned representation in the th view.  
The th concatenate aligned representations.  
Hard positive representation for the th anchor representation.  
Hard negative representation for the th anchor representation.  
The th discriminative embedding from multiple views.  
Hard positive embedding for the th anchor embedding.  
Hard negative embedding for the th anchor embedding.  
The th hyperbolic embeddings.  
The number of data instances (leaf nodes).  
The number of clusters.  
The number of views.  
The number of clusters.  
The dimensionality of the th view.  
The dimensionality of latent space.  
The dimensionality of aligned representation space.  
The dimensionality of discriminative embedding space.  
Cosinne distance of the th in the th view and the th in the th view in commonview space.  
Affinity matrix of concatenate aligned representations.  
Adjacanncy of the th and th representations.  
The solution to random walk model of th sample.  
The best walker from the th node to the th node. 
We summarize the notations used in the paper in Table 3.
Appendix C Experimental Details
We introduce more experimental details in this section. All the experiments are conducted on a Linux Server with TITAN Xp (10G) GPU and Intel(R) Xeon(R) CPU E52630 v4 @ 2.20GHz.
c.1 Datasets
We conduct our experiments on the following multiview datasets. MNISTUSPS [Peng et al.2019] consists of 5000 handwritten digital (09) images with 10 categories. The MNIST view is in size, sampled randomly from MNIST dataset, and the USPS view is in
size, sampled randomly from USPS dataset. Each of category is with 500 images. Specifically, for convenience of experiments, we adopt zeropadding to make the dimensions of USPS view
, the same as MNIST view. BDGP [Li et al.2019b] contains 2500 images of Drosophila embryos of 5 different categories, each of which contains 500 samples. One view is with 1750dimensional visual features and the other view is with 79dimensional textual features. Caltech1017 [Dueck and Frey2007] includes 5 visual feature descriptors, i.e., 40dim wavelet moments (WM) feature, 254dim CENTRIST feature, 1,984dim HOG feature, 512dim GIST feature, and 928dim LBP feature. These 1400 RGB images are sampled from 7 categories. Each category contains 200 images. As for COIL20, following Trosten et al. trosten2021reconsidering, we establish a variant of COIL20 with 480 object images divided into 20 classes, each of which is with 24 images. Every object is captured in 3 different poses with pixel size for each view. In terms of largescale multiview dataset, we apply MultiFashion [Xu et al.2022], which contains 10 types of clothes, e.g., pullover, shirt, and coat, to verify the scalability of the proposed CMHHC. In MultiFashion, different views of each instance are different fashionable designs of the same category, and there are 10000 grey images in each view.For all the above five datasets, we utilize the entire dataset of all samples and perform our model among all views. The statistics of the experimental datasets are summarized in Table 4.
Dataset  MNISTUSPS  BDGP  Caltech1017  COIL20  MultiFashion 

# Samples  5000  2500  1400  480  10000 
# Categories  10  5  7  20  10 
# views  2  2  5  3  3 
c.2 Implementation Details
c.2.1 Cmhhc
Our entire model is implemented in the PyTorch platform. For representing the hierarchical structure efficiently, we use the corresponding SciPy, networkx, and ETE (Environment for Tree Exploration) Python toolkits. To speed up the convergence of our whole model, we first empirically pretrain autoencoders for 200 epochs on all datasets and the contrastive learning module for 10, 50, 50, 50 and 100 epochs on BDGP, MNISTUSPS, Caltech1017, COIL20 and MultiFashion respectively. Next, we finetune the whole multiview representation learning process for 50 epochs. Finally, the epochs for HypHC training are set to 50. The batch size is set to and for multiview representation learning and hierarchical clustering, using the Adam and hyperbolicmatched Riemannian optimizers [Kochurov et al.2020] respectively. The learning rate is set to for Adam with batch size, and on BDGP, MNISTUSPS and MultiFashion, on Caltech1017 and COIL20 for Riemannian Adam with batch size. In addition, is set to for all datasets, while is set to for BDGP, MNISTUSPS and MultiFashion, and for Caltech1017 and COIL20.
In our CMHHC model, the network architecture consists of a multiview alignment learning module, a commonview similarity learning module, and a hyperbolic hierarchical clustering module. (1) In multiview alignment learning module, autoencoders are designed by the same architecture of full connected layers. Each encoder is with dimensions of and each decoder is with dimensions of . The fullyconnected contrastive learning layers are with dimensions. (2) In commonview similarity learning module, on concatenate aligned representations from all views, we adopt fullyconnected similarity learning layers with architecture. (3) HypHC layers are implemented by with dimensions of to optimize hyperbolic embeddings[Chami et al.2020]. According to the general rule of hard mining strategy, value is set to , and value equals and the value equals for all datasets. More specifically, and values of 5 datasets are set as Table 5.
To achieve a tradeoff between time complexity and hierarchical clustering quality, the number of sampled triplets for HC input is empirically set to 900000, 3000000, 90000, 90000 and 9000000 for 5 datasets of different scales, i.e., BDGP, MNISTUSPS, Caltech1017, COIL20 and MultiFashion, respectively. The numbers of triplets sampled from datasets with more instances should be larger for expected clustering results [Chami et al.2020].
Dataset  MNISTUSPS  BDGP  Caltech1017  COIL20  MultiFashion 

250  250  100  12  500  
500  500  200  24  1000 
c.2.2 Baseline Methods
For comparing conventional linkagebased discrete singleview HAC methods and continuous singleview HC methods fairly, we concatenate all views into a single view without losing information, and then apply the above methods. To be specific, we implemented HACs, i.e., Singlelinkage, Completelinkage, Averagelinkage, and Wardlinkage algorithms by the corresponding SciPy Python library. We directly use the opensource implementations of UFit [Chierchia and Perret2020] and HypHC [Chami et al.2020]. UFit proposed a series of continuous objectives for ultrametric fitting from different aspects. We follow the hyperparameter in UFit, and adopt the best cost function “Closest+Size” instead of other proposed cost functions, i.e., “Closest+Triplet” and “Dasgupta”. The reasons for the choice of UFit cost function are twofold. First, “Closest+Triplet” assumes the groundtruth labels of some data points are known to establish triplets for metric learning, which conflicts with the original intention of unsupervised clustering. Second, according to reported results by [Chierchia and Perret2020], the performance of “Dasgupta” is slightly worse than that of ‘Closest+Size” in terms of accuracy (ACC). For HypHC, we adjust corresponding hyperparameters for 5 datasets of different scales. Here, for the sake of fairness, the number of sampled triplets for each dataset keeps consistent with that in our CMHHC model. Besides, we set for BDGP, MNISTUSPS and MultiFashion and for Caltech1017 and COIL20.
In terms of MHC [Zheng et al.2020], we implemented it with Python as there was no opensource implementation available. MHC assumed that multiple views can be reconstructed by one fundamental latent representation . However, a detailed explanation of latent representation is implicit. Hence, we empirically adopt nonnegative matrix factorization on unified concatenate views to obtain . In addition, the way MHC built the adjacency graph for NNA was ambiguous, so we relax the formulation of the adjacency graph. We consider connecting three kinds of pairs into one cluster, i.e., point is the nearest neighbor of point , point is the nearest neighbor of point or points have the same neighbor.
c.3 Hierarchical Clustering Metrics
Following [Kobren et al.2017, Monath et al.2019], we validate hierarchical clustering performance via the Dendrogram Purity (DP), which is a more holistic measurement of the hierarchical tree quality. Given a final clustering tree of a dataset , and corresponding groundtruth clustering partitions belonging to clusters, DP of is defined as:
(14) 
where represents two data points belonging to the same groundtruth cluster , and means the set of descendant leaves of LCA internal node , and generally represents the proportion that the data points belonging to both set and set account for those belonging to set . Intuitively, high DP scores lead to nodes that are similar to clusters in the ground truth flat partition.
c.4 Hard Mining Strategy Analysis
Dataset  MNISTUSPS  BDGP  Caltech1017  COIL20  MultiFashion 

CMMC  94.49%  91.53%  66.52%  84.89%  96.25% 
CMMC  76.24%  83.96%  54.44%  82.05%  93.86% 
To demonstrate the effectiveness of the proposed hard mining strategy for unsupervised metric learning, we compare CMHHC with CMHHC, denoting our model replacing proposed strategy with a much more strict triplet sampling strategy for weighted triplet loss training. In this way, the hard negative set for one anchor sample is made by involving nearest neighbors close to the anchor in the Euclidean space but on different manifold [Iscen et al.2018]. Therefore, we denote the hard negative set for CMHHC as:
(15) 
The more strict triplet sampling strategy makes CMHHC focus on too hard negatives, which tend to be too close to the anchor sample in the Euclidean space. Therefore, mined pairwise similarity information is likely to contain unexpected noise, which will mislead the hierarchical clustering. Besides, CMHHC is limited to local structure around negatives, which does not respect to the fact that the whole clustering process takes the similarity of all over instances into consideration. However, our hard mining strategy, which regards samples not only far from the anchor on manifolds but also in the Euclidean space, is more applicable for our downstream clustering task.
Table 6 shows the DP results of CMHHC and CMHHC from a quantitative perspective. Clearly, our hard mining strategy improves DP measurement on all datasets by a large margin. In other words, the relaxed strategy is capable of learning more meaningful triplets, which actually help generate clusteringfriendly embedding space.
c.5 Parameter Sensitivity Analysis
The hyperparameters of CMHHC include the temperature parameters , for contrastive learning and hyperbolic hierarchical clustering, and also, the number of nearest neighbors in the Euclidean space , and the values of hard positives and hard negatives and for similarity learning. Naturally, we set for all datasets, while for BDGP, MNISTUSPS, and MultiFashion and for Caltech1017 and COIL20, which are empirically efficient. In terms of the similarity learning parameters , and , we made the general rule that the value equals , value equals and the value equals . Therefore, we evaluate the effectiveness of the general rule on all five datasets. With fixed the other two parameters and , we vary , and values in the range of , and for all datasets. We also run the model 5 times, and the DP results with the lowest value of under each parameter setting is shown in Fig 3. Our CMHHC is insensitive to value setting. Since the parameter is utilized to define the manifold similarity matrix, different values result in different metric properties on manifolds. Hence, the performance of CMHHC is likely to fluctuate within a reasonable range, and when the DP results on different datasets tend to be stable at a better level. Besides, setting and values via our general rule is easier to achieve fairly good hierarchical clustering performance. Actually, the parameters and controls the diversity and the difficulty of positives and negatives. More specifically, making the equals and the equals , offers sufficient hard positives and negatives, and guarantees these tuples to capture pseudolabel information with little noise.