Deep Multi-view Semi-supervised Clustering with Sample Pairwise Constraints

by   Rui Chen, et al.

Multi-view clustering has attracted much attention thanks to the capacity of multi-source information integration. Although numerous advanced methods have been proposed in past decades, most of them generally overlook the significance of weakly-supervised information and fail to preserve the feature properties of multiple views, thus resulting in unsatisfactory clustering performance. To address these issues, in this paper, we propose a novel Deep Multi-view Semi-supervised Clustering (DMSC) method, which jointly optimizes three kinds of losses during networks finetuning, including multi-view clustering loss, semi-supervised pairwise constraint loss and multiple autoencoders reconstruction loss. Specifically, a KL divergence based multi-view clustering loss is imposed on the common representation of multi-view data to perform heterogeneous feature optimization, multi-view weighting and clustering prediction simultaneously. Then, we innovatively propose to integrate pairwise constraints into the process of multi-view clustering by enforcing the learned multi-view representation of must-link samples (cannot-link samples) to be similar (dissimilar), such that the formed clustering architecture can be more credible. Moreover, unlike existing rivals that only preserve the encoders for each heterogeneous branch during networks finetuning, we further propose to tune the intact autoencoders frame that contains both encoders and decoders. In this way, the issue of serious corruption of view-specific and view-shared feature space could be alleviated, making the whole training procedure more stable. Through comprehensive experiments on eight popular image datasets, we demonstrate that our proposed approach performs better than the state-of-the-art multi-view and single-view competitors.


A Survey on Multi-View Clustering

With the fast development of information technology, especially the popu...

Deep Embedded Multi-view Clustering with Collaborative Training

Multi-view clustering has attracted increasing attentions recently by ut...

Semi-Supervised Co-Analysis of 3D Shape Styles from Projected Lines

We present a semi-supervised co-analysis method for learning 3D shape st...

Deep Adversarial Inconsistent Cognitive Sampling for Multi-view Progressive Subspace Clustering

Deep multi-view clustering methods have achieved remarkable performance....

Multiple Discrimination and Pairwise CNN for View-based 3D Object Retrieval

With the rapid development and wide application of computer, camera devi...

MultiDEC: Multi-Modal Clustering of Image-Caption Pairs

In this paper, we propose a method for clustering image-caption pairs by...

Multi-view Drone-based Geo-localization via Style and Spatial Alignment

In this paper, we focus on the task of multi-view multi-source geo-local...

1 Introduction

Clustering, a crucial but challenging topic in both data mining and machine learning communities, aims to partition the data into different groups such that samples in the same group are more similar to each other than to those from other groups. Over the past few decades, various efforts have been exploited, such as prototype-based clustering

KM ; DEC , graph-based clustering SC ; SDCN , model-based clustering GMM ; DGG , density-based clustering DBSCAN ; DDC

, etc. With the prevalence of deep learning technology, many researches have integrated the powerful nonlinear embedding capability of deep neural networks (DNNs) into clustering, and achieved dazzling clustering performance. Xie et al.

DEC make use of DNNs to mine the cluster-oriented feature for raw data, realizing a substantial improvement compared with conventional clustering techniques. Bo et al. SDCN combine autoencoder representation with graph embedding and propose a structural deep clustering network (SDCN) owning a better performance over the other baseline methods. Yang et al. DGG

develop a variational deep Gaussian mixture model (GMM)

GMM to facilitate clustering. Ren et al. DDC

present a deep density-based clustering (DDC) approach, which is able to adaptively estimate the number of clusters with arbitrary shapes. Despite the great success of deep clustering methods, they can only be satisfied with single-view clustering scenarios.

In the real-world applications, data are usually described as various heterogeneous views or modalities, which are mainly collected from multiple sensors or feature extractors. For instance, in computer vision, images can be represented by different hand-crafted visual features such as Gabor

Gabor , LBP LBP , SIFT SIFT , HOG HOG ; in information retrieval, web pages can be exhibited by page text or links to them; in intelligent security, one person can be identified by face, fingerprint, iris, signature; in medical image analysis, a subject may have a binding relationship with different types of medical images (e.g., X-ray, CT, MRI). Obviously, single-view based methods are no longer suitable for such multi-view data, and how to cluster this kind of data is still a long-standing challenge on account of the inefficient incorporation of multiple views. Consequently, numerous multi-view clustering applications have been developed to jointly deal with several types of features or descriptors.

Canonical correlation analysis (CCA) CCA seeks two projections to map two views onto a low-dimensional common subspace, in which the linear correlation between the two views is maximized. Kernel canonical correlation analysis (KCCA) KCCA resolves more complicated correlations by equipping the kernel trick into CCA. Multi-view subspace clustering (MvSC) methods MVSC ; LMSC ; CoMSC ; CTRL ; JSTC ; T-MEK-SPL are aimed at utilizing multi-view data to reveal the potential clustering architecture, most of which usually devise multi-view regularizer to describe the inter-view relationships between different formats of features. In recent years, a variety of DNNs-based multi-view learning algorithms have emerged one after another. Deep canonical correlation analysis (DCCA) DCCA and deep canonically correlated autoencoders (DCCAE) DCCAE successfully draw on DNNs’ advantage of nonlinear mapping and improve the representation capacity of CCA. Deep generalized canonical correlation analysis (DGCCA) DGCCA combines the effectiveness of deep representation learning with the generalization of integrating information from more than two independent views. Deep embedded multi-view clustering (DEMVC) DEMVC learns the consistent and complementary information from multiple views with a collaborative training mechanism to heighten clustering effectiveness. Autoencoder in autoencoder networks (AE2) AE2 jointly learns view-specific feature for each view and encodes them into a complete latent representation with a deep nested autoencoder framework. Cognitive deep incomplete multi-view clustering network (CDIMC) CDIMC incorporates DNNs pretraining, graph embedding and self-paced learning to enhance the robustness of marginal samples while maintaining the local structure of data, and a superior performance is accomplished.

Despite these excellent achievements, the current deep multi-view clustering methods still present two obvious drawbacks. Firstly, most previous approaches fail to take advantage of semi-supervised prior knowledge to guide multi-view clustering. It is known that pairwise constraints are easy to obtain in practice and have been frequently utilized in many semi-supervised learning scenes

semi-KM 1 ; semi-KM 2 ; semi-SC . Therefore, ignoring this kind of precious weakly-supervised information will undoubtedly place restrictions on the model performance. Meanwhile, the constructed clustering structure is likely to be unreasonable and imperfect as well. Besides, one more issue attracting our attention is that most existing studies typically cast away the decoding networks during the finetuning process while overlooking the preservation of feature properties. Such an operation may cause serious corruption of both view-specific and view-shared feature space, thus hindering the clustering performance accordingly.

In order to settle the aforementioned defectiveness, we propose a novel Deep Multi-view Semi-supervised Clustering (DMSC) method in this paper. Our method embodies two stages: 1) parameters initialization, 2) networks finetuning. In the initialization stage, we pretrain multiple deep autoencoder branches by minimizing their reconstruction losses end-to-end to extract high-level compact feature for each view. In the finetuning stage, we consider three loss items, i.e., multi-view clustering loss, semi-supervised pairwise constraint loss and multiple autoencoders reconstruction loss. Specifically, for multi-view clustering loss, we adopt the KL divergence based soft assignment distribution strategy proposed by the pioneering work DMJCS to perform heterogeneous feature optimization, multi-view weighting and clustering prediction simultaneously. Then, in order to exploit the weakly-supervised pairwise constraint information that plays a key role in shaping a reasonable latent clustering structure, we introduce a constraint matrix and enforce the learned multi-view common representation to be similar for must-link samples and dissimilar for cannot-link samples. For multiple autoencoders reconstruction loss, we tune the intact autoencoder frame for each heterogeneous branch, such that view-specific attributes can be well protected to evade the unexpected destruction of the corresponding feature domain. Through this way, our learned conjoint representation could be more robust than that in rivals who only hold back the encoder part during finetuning. To sum up, the main contributions of this work are highlighted as follows:

  • We innovatively propose a deep multi-view semi-supervised clustering approach termed DMSC, which can utilize the user-given pairwise constraints as weak supervision to lead cluster-oriented representation learning for joint multi-view clustering.

  • During networks finetuning, we introduce the feature structure preservation mechanism into our model, which is conducive to ensuring both distinctiveness of the local specific view and completeness of the global shared view.

  • The proposed DMSC enjoys the strength of efficiently digging out the complementary information hidden in different views and the cluster-friendly discriminative embeddings to rouse model performance.

  • Comprehensive comparison experiments on eight widely used benchmark image datasets demonstrate that our DMSC possesses superior clustering performance against the state-of-the-art multi-view and single-view competitors. The elaborate experimental analysis confirms the effectiveness and generalization of the proposed approach.

The remainder of this paper is organized as follows. In Section 2, we make a brief review on the related work. Section 3 describes the details of the developed DMSC algorithm. Extensive experimental results are reported and analyzed in Section 4. Finally, Section 5 concludes this paper.

2 Related Work

This section reviews some of the previous researches closely related to this paper. We first briefly review a few antecedent works on deep clustering. Then, related studies of multi-view clustering are reviewed. Finally, we introduce the semi-supervised clustering paradigm.

2.1 Deep Clustering

Existing deep clustering approaches can be generally partitioned into two categories. One category covers methods that usually treat representation learning and clustering separately, i.e., project the original data into a low-dimensional feature space first, and then perform traditional clustering algorithms KM ; SC ; GMM ; DBSCAN to group feature points. Unfortunately, this kind of independent form may restrict the clustering performance due to the oversight of some underlying relationships between representation learning and clustering. Another category refers to methods that apply the joint optimization criterion, which perform both representation learning and clustering simultaneously, showing considerable superiority beyond the separated counterparts. Recently, several attempts have been proposed to integrate representation learning and clustering into a unified framework. Inspired by t-SNE t-SNE , Xie et al. DEC propose a deep embedded clustering (DEC) model to utilize a stacked autoencoder (SAE) to excavate the high-level representation for input data, then iteratively optimize a KL divergence based clustering objective with the help of auxiliary target distribution. Guo et al. IDEC further put forward to integrate SAE’s reconstruction loss into the DEC objective to avoid corrosion of the embedded space, bringing about appreciable advancement. Yang et al. DCN

combine SAE-based cluster-oriented dimensionality reduction and K-means

KM clustering together to jointly enhance the performance of both, which requires an alternative optimization strategy to discretely update cluster centers, cluster pseudo labels and network parameters. Drawing on the experience of hard-weighted self-paced learning, Guo et al. ASPC and Chen et al. DCSPC

prioritize high-confidence samples during the clustering network training to buffer the negative impact of outliers and steady the whole training process. Ren et al.

SDEC overcome the vulnerability in DEC that fails to guide the clustering by making use of prior information. Li et al. DBC present a discriminatively boosted clustering framework with the help of a convolutional feature extractor and a soft assignment model. Fard et al. DKM raise an approach for jointly clustering by reconsidering the K-means loss as the limit of a differentiable function that touches off a truly solution.

2.2 Multi-view Clustering

Multi-view clustering review ; SAMVC ; DCMSC ; MCIM aims to utilize the available multi-view features to learn common representation and perform clustering to obtain data partitions. With regard to shallow methods, Cai et al. RMKMC propose a robust multi-view K-means clustering (RMKMC) algorithm by introducing a shared indicator matrix across different views. Xu et al. MSPL develop an improved version of RMKMC to learn the multi-view model by simultaneously considering the complexities of both samples and views, relieving the local minima problem. Zhang et al. BMVC decompose each view into two low-rank matrices with some specific constraints and conduct a conventional clustering approach to group objects. As one of the most significant learning paradigms, canonical correlation analysis (CCA) CCA projects two views to a compact collective feature domain where the two views’ linear correlation is maximal.

With the development of deep learning, a variety of deep multi-view clustering methods have been proposed recently. Andrew et al. DCCA try to search for linearly correlated representation by learning nonlinear transformations of two views with deep canonical correlation analysis (DCCA). As an improvement of DCCA, Wang et al. DCCAE add autoencoder-based terms to stimulate the model performance. To resolve the bottleneck of the above two techniques that can only be applied to two views, Benton et al. DGCCA further propose to learn a compact representation from data covering more than two views. More recently, Xie et al. DMJCS introduce two deep multi-view joint clustering models, in which multiple latent embedding, weighted multi-view learning mechanism and clustering prediction can be learned simultaneously. Xu et al. DEMVC adopt collaborative training strategy and alternately share the auxiliary distribution to achieve consistent multi-view clustering assignment. Zhang et al. AE2 carefully design a nested autoencoder to incorporate information from heterogeneous sources into a complete representation, which flexibly balances the consistency and complementarity among multiple views. Wen et al. CDIMC

combine view-specific deep feature extractor and graph embedding strategy together to capture robust feature and local structure for each view.

2.3 Semi-supervised Clustering

As is known that semi-supervised learning is a learning paradigm between unsupervised learning and supervised learning that has the ability to jointly use both labeled and unlabeled patterns. It usually appears in machine learning tasks such as regression, classification and clustering. In semi-supervised clustering, pairwise constraints are frequently utilized as a priori knowledge to guide the training procedure, since the pairwise constraints are easy to obtain practically and flexible for scenarios where the number of clusters is inaccessible. In fact, the pairwise constraints can be vividly represented as “must-link” (ML) and “cannot-link” (CL) used to record the pairwise relationship between two examples in a given dataset. Over the past few years, semi-supervised clustering with pairwise constraints has become an alive area of research. For instance, the literature

semi-KM 1 ; semi-KM 2 improve classical K-means by integrating pairwise constraints. Based on the idea of modifying the similarity matrix, Kamvar et al. semi-SC

incorporate constraints into spectral clustering (SC)

SC such that both ML and CL can be well satisfied. Chang et al. DAC propose to reestablish the clustering task as a binary pairwise-classification problem, showing excellent clustering results on six image datasets. Shi et al. ConPaC utilize pairwise constraints to meet an enhanced performance in face clustering scenario. Wang et al. SSFPC conceive soft pairwise constraints to cooperate with fuzzy clustering.

In multi-view learning territory, there are also various pairwise constraints based semi-supervised applications. Tang et al. CTRL elaborate a semi-supervised MvSC approach to foster representation learning with the help of a novel regularizer. Nie et al. MLAN simultaneously execute multi-view clustering and local structure uncovering in a semi-supervised fashion to learn the local manifold structure of data, achieving a satisfactory clustering performance. Qin et al. SSSL-M

achieve a desirable shared affinity matrix to realize semi-supervised subspace learning by jointly learning the multiple affinity matrices, the encoding mappings, the latent representation and the block-diagonal structure-induced shared affinity matrix. Bai et al.

SC-MPI incorporate multi-view constraints to mitigate the influence of inexact constraints from a certain specific view to discover an ideal clustering effectiveness. Due to space limitations, we refer interested readers to for readers 1 ; for readers 2 for a comprehensive understanding.

Figure 1: The overall framework of the proposed DMSC approach.

3 Deep Multi-view Semi-supervised Clustering with Sample Pairwise Constraints

This section elaborates the proposed Deep Multi-view Semi-supervised Clustering (DMSC). Suppose one multi-view dataset with views provided, we use to represent the sample set of the -th view, where is the feature dimension and denotes the number of unlabeled instances. Given a little prior knowledge of pairwise constraints, we construct a sparse symmetric matrix () with its diagonal elements all zero to describe the connection relationship between pairwise patterns. If pairwise examples share the same label, a ML constraint is built, i.e., (), and () otherwise, generating a CL constraint. Provided that the number of clusters is predefined according to the ground-truth, our goal is to cluster these multi-view patterns into groups using prior information , and we also wish that points with the same label are near to each other, while points from different categories are far away from each other. The overall framework of our DMSC is portrayed in Figure 1.

3.1 Parameters Initialization

Similar to some previous studies DMJCS ; DEMVC ; IDEC ; DCN ; ASPC , the proposed model also needs pretraining for a better clustering initialization. In our proposal, we utilize heterogeneous autoencoders as different deep branches to efficiently extract the view-specific feature for every independent view. Specifically, in the -th view, each sample is first transformed to a -dimensional feature space by the encoder network :


and then is reconstructed by the decoder network using the corresponding -dimensional latent embedding :


where . Obviously, in an unsupervised mode, it is easy to obtain the initial high-level compact representation for view

by minimizing the following loss function:


Therefore, the total reconstruction loss of all views can be computed by


After pretraining of multiple deep branches, a familiar treatment is directly concatenating the embedded features as and carrying out K-means to achieve initialized cluster centers with .

3.2 Networks Finetuning with Pairwise Constraints

In the finetuning stage, the anterior study DMJCS introduces a novel multi-view soft assignment distribution to implement the multi-view fusion, which is defined as


where indicates the importance weight that measures the importance of the cluster center for consistent clustering. As narrated in DMJCS , this multi-view soft assignment distribution (denoted as ) attains the multi-view fusion via implicitly exerting the multi-view constraint on the view-specific soft assignment (denoted as ), which is more advantageous than single-view one in DEC . Note that there are two constraints for , i.e.,




It is not hard to notice that directly optimizing the objective function with respect to is laborious. Therefore, the constrained weight can be logically represented in terms of the unconstrained weight in a softmax like form as


In this way, can definitely meet the above two limitations (6)(7) and

can be expediently learned by stochastic gradient descent (SGD) as well. For simplicity, the view importance matrix

is constructed to collect unconstrained weights with and being the number of clusters and the number of views respectively. To optimize the multi-view soft assignment distribution , the auxiliary target distribution is further derived as


The auxiliary target distribution can guide the clustering by enhancing the discrimination of the soft assignment distribution . As a result, with the help of and , the KL divergence based clustering loss is defined as


Owning an excellent learning paradigm, DEC DEC like multi-view learning methods DMJCS ; DEMVC take samples with high confidence as supervisory signals to make them more densely distributed in each cluster, which is the main innovation and contribution. However, they fail to take advantage of user-specific pairwise constraints to boost clustering performance. In order to track this issue, drawing lessons from SDEC , we innovatively propose to integrate pairwise constraints into the objective (10) to bring about more robust joint multi-view representation learning and latent clustering. As mentioned earlier, the constraint matrix is used for storing ML and CL constraints. When the ML constraint is established, a pair of data points share the same cluster, while satisfying the CL constraint means that the pairwise patterns belong to different clusters. Meanwhile, we also hope that this kind of prior information can help the model better force the two instances to be scattered in their correct and reasonable clusters. To achieve this aim, a -norm based semi-supervised loss employed to measure the connection status between sample and sample is defined as follows:


where and are the two concatenated feature points. is a scalar variable that always satisfies the following settings:


By introducing these valuable weak supervisors (), the model could furnish a strong pulling force over data themselves, so that patterns sharing the same ground-truth label can be as crowded as possible, while those with conflict categories are far away from each other. In reality, benefiting from this, the formed clustering construction would be more rational and prettily, where the elements lying in the cluster are quite agglomerative and the distances between clusters are far-off enough.

Furthermore, to guard against the corruption of common feature space and to protect view-specific feature properties simultaneously, inspired by IDEC ; DCN , we further propose retaining the view-specific decoders and taking their reconstruction losses into account during network finetuning. Naturally, with the reconstruction part considered, a more robust shared representation can be learned to create a stable training process and a cluster-friendly circumstance. In summary, the objective of our enhanced model, called Deep Multi-view Semi-supervised Clustering (DMSC), can be formulated as


where indicates the total reconstruction loss of multiple deep autoencoders as Eq. (4). refers to the KL divergence between multi-view soft assignment distribution and auxiliary target distribution . represents the aforementioned semi-supervised pairwise constraint loss. and are two balance factors attached on and respectively to trade off the three terms of losses.

As a matter of fact, optimizing Eq. (13) brings two benefits: 1) the costs of violated constraints can be minimized to generate a more reasonable cluster-oriented architecture; 2) both the local structure of specific view feature and the common global attributes of multiple view features can be well preserved so as to perform a better clustering achievement. These two superiorities lead our model to be able to jointly learn a shared high-quality representation and perform a perfect clustering assignment in a semi-supervised manner based on the user-given prior knowledge.

3.3 Optimization

In this subsection, we focus on the optimization in the finetuning stage, where mini-batch stochastic gradient decent (SGD) and backpropagation (BP) are resorted to optimize the loss function (13). Specifically, there are four types of variables need to be updated: network parameters

, , cluster center , unconstrained importance weight and target distribution . Note that the constrained importance weight is initialized as and the initial network parameters , are gained by pretraining isomeric network branches (i.e., by minimizing Eq. (3) for each view).

3.3.1 Update , , ,

With target distribution fixed, the gradients of with respect to feature point , cluster center , and unconstrained importance weight for the -th view are respectively computed as


where can be described as the distance between and . Let


since set in DEC , thus the gradient derivations of and are


Similarly, it is easy to prove that the gradients of with respect to can be expressed as follows:


It is evidently clear that the gradients and () can be passed down to the corresponding deep network to further compute and during backpropagation (BP). As a result, given a mini-batch with samples and learning rate , the network parameters , are updated by


The cluster center and the unconstrained importance weight are updated by


3.3.2 Update

Although the target distribution serves as a ground-truth soft label to facilitate clustering, it also depends on the predicted soft assignment . Hence, we should not update at each iteration just using a mini-batch of samples to avoid numerical instability. In practice, should be updated considering all embedded feature points every iterations. The update interval is determined jointly by both sample size and mini-batch size . After is updated, the pseudo label for sample is obtained by


3.3.3 Stopping Criterion

If the change in predicted pseudo labels between two consecutive update intervals is not greater than a threshold , we will terminate the training procedure. Formally, the stopping criterion can be written as


where and are indicators for whether the -th example is clustered to the -th group at the -th and -th iteration, respectively. We empirically set in our subsequent experiments.

Input: Dataset ; Number of clusters ; Maximum iterations ; Update interval ; Stopping threshold

; Degree of freedom

; Proportion of prior knowledge ; Parameters and .
Output: Clustering assignment .
1 // Initialization
2 Initialize by (12), (30);
3 Initialize , by minimizing (3);
4 Initialize , by performing K-means on ;
5 // Finetuning
6 for  do
7       Select a mini-batch with samples and set the learning rate as ;
8       if  then
9             Update by (5), (9);
10             Update by (25);
11       end if
12      if Stopping criterion (26) is met then
13             Terminate training.
14       end if
15      Update , by (21), (22);
16       Update by (23);
17       Update by (24);
19 end for
Algorithm 1 Deep Multi-view Semi-supervised Clustering with Sample Pairwise Constraints

The entire optimization process is summarized in Algorithm 1. By iteratively updating the above variables, the proposed DMSC can converge to the local optimal solution in theory.

4 Experiment

In this section, we carry out comprehensive experiments to investigate the performance of our DMSC. All experiments are implemented on a standard Linux Server with an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90 GHz, 376 GB RAM, and two NVIDIA A40 GPUs (48 GB caches).

4.1 Datasets

  • USPS consists of grayscale handwritten digit images with a size of pixels from categories.

  • COIL20 includes gray object images from categories, which are shotted from different angles. The resized version of is adopted in our experiments.

  • MEDICAL is a simple medical dataset in dimension. There are medical images belonging to classes, i.e., abdomen CT, breast MRI, chest X-ray, chest CT, hand X-ray, head CT.

  • FASHION is a collection of fashion product images from classes, with image size and one image channel.

  • STL10 embraces color images with the size of pixels from object categories.

  • COIL100 incorporates image samples of object categories, with image size and three image channels.

  • CALTECH101 owns irregular object images from classes, which is widely utilized in the field of multi-view learning.

  • CIFAR10 comprises RGB images of object classes, whose image size is standardized as .

The properties and examples are summarized in Table 1 and Figure 2

. Since these datasets have been split into training set and testing set, both subsets are jointly utilized for clustering analysis. Besides, in our experiments, the aforementioned datasets are rescaled to

for each entity before being infused to model training.

Dataset Instance Category Size Channel
Table 1: The properties of datasets.
(a) USPS (e) STL10 (b) COIL20 (f) COIL100 (c) MEDICAL (g) CALTECH101 (d) FASHION (h) CIFAR10
Figure 2: The examples of datasets.

4.2 Evaluation Metrics

We adopt three standard metrics, i.e., clustering accuracy (ACC) ACC , normalized mutual information (NMI) NMI and adjusted rand index (ARI) ARI , to evaluate the performance of different clustering methods. Their definitions can be formulated as follows:


where is the sample size. and denote the ground-truth label and the clustering assignment generated by the model for the -th pattern, respectively. is the permutation function, which embraces all possible one-to-one projections from clusters to labels. The best projection can be efficiently computed by the Hungarian Hungarian algorithm algorithm.


where and represent the mutual information between and and the entropic cost, respectively.


where is the expectation of the rand index (RI) RI .

Note that ACC and NMI range within , while the range of ARI is , and a higher score indicates a better clustering performance. Generally, the aforesaid metrics are extensively considered in various clustering literature CTRL ; JSTC ; T-MEK-SPL ; DCSPC ; DMNEC . Each one offers pros and cons, but using them together is sufficient to test the effectiveness of the clustering algorithms.

4.3 Compared Methods

Several clustering methods are chosen to comprehensively compare with the proposed DMSC, which can be roughly grouped as: 1) single-view methods, including autoencoder (AE), deep embedded clustering (DEC) DEC , improved deep embedded clustering (IDEC) IDEC , deep clustering network (DCN) DCN , adaptive self-paced clustering (ASPC) ASPC , semi-supervised deep embedded clustering (SDEC) SDEC ; 2) multi-view methods, containing robust multi-view K-means clustering (RMKMC) RMKMC , multi-view self-paced clustering (MSPL) MSPL , deep canonical correlation analysis (DCCA) DCCA , deep canonically correlated autoencoders (DCCAE) DCCAE , deep generalized canonical correlation analysis (DGCCA) DGCCA , deep multi-view joint clustering with soft assignment distribution (DMJCS) DMJCS , deep embedded multi-view clustering (DEMVC) DEMVC .

4.4 Experimental Setups

In this subsection, we will introduce the experimental setups in detail, including pretraining setup, prior knowledge utilization and finetuning setup.

4.4.1 Pretraining Setup

For one-channel image datasets, we use a stacked autoencoder (SAE) and a convolutional autoencoder (CAE) as two different deep network branches to extract low-dimensional multi-view features. Specifically, the raw image vectors and pixels are fed into SAE and CAE respectively. For three-channel image datasets, two SAEs with different structures and data sources are considered as two multiple branches, whose inputs are the pretrained feature extracted by using DenseNet121

999 and InceptionV3 101010

on ILSVRC2012 (ImageNet Large Scale Visual Recognition Competition in 2012), with

and dimensions respectively. During pretraining, the Adam Adam optimizer with initial learning rate is utilized to train multi-view branches in an end-to-end fashion for epochs. The batch size is set as

. Moveover, all internal layers of each branch are activated by the ReLU

ReLU nonlinearity function, and the Xariver Xariver method is employed as the layer kernel initializer.

Dataset Branch Encoder Input
1-channel View 1 (SAE) Raw image vectors
View 2 (CAE) Raw image pixels
3-channel View 1 (SAE) DenseNet121 feature
View 2 (SAE) InceptionV3 feature
Table 2: The experimental configurations.

4.4.2 Prior Knowledge Utilization

The pairwise constraint matrix () is randomly constructed on the basis of the ground-truth labels for each dataset. Thus we indiscriminately pick pairs of data samples from the datasets and put forward a hypothesis: if pairwise patterns share the identical label, a connected constraint is generated, otherwise establishing a disconnected constraint, which is expressed as


Note that the symmetric sparse matrix provides us with sample constraints in total. Due to its symmetry and sparsity, the number of sample constraints should only be adjusted up to at most. Based on such recognition, the scalefactor is prophetically set as in our experiments, which supplies pairwise constraints (or specifically sample constraints) for the learning model. The sensitivity of will be analyzed and discussed later in Section 4.7.

4.4.3 Finetuning Setup

Different from some preceding studies DEC ; DMJCS ; ASPC ; SDEC ; DBC that only keep the encoding block retained in their model finetuning stage, we conversely preserve the end-to-end structure (i.e., hold back both encoder and decoder simultaneously) of each branch to protect feature properties for the multi-view data. The entire clustering network is trained for epochs by equipping the Adam Adam optimizer with default learning rate . The batch size is fixed to . The importance coefficients for clustering loss and constraint loss are set as and , respectively. The threshold in stopping criterion is . The degree of freedom for Student’s t-distribution is assigned as . The number of clusters is hypothetically given as a priori knowledge according to the ground-truth, i.e., equals to the ground-truth cluster numbers.

The above experimental configurations are summarized in Table 2

. Besides, for single-view methods, we take the raw images and the concatenated ImageNet features as the network input when performing on gray and color image datasets respectively. For DCCA

DCCA , DCCAE DCCAE , DGCCA DGCCA , we concatenate the multiple latent features gained from their model training and directly perform K-means. With regard to RMKMC RMKMC , MSPL MSPL , the low-dimensional embeddings learned by our pretrained multi-view branches are considered as their multiple inputs. As for DEMVC DEMVC and DMJCS DMJCS , we set the model configuration to be the same as the corresponding recommended setting. Note that for reasonable estimation, we perform random restarts for all experiments and report the average results to compare with the others based on Python 3.7

and TensorFlow


4.5 Experimental Comparison

Table 3 and Table 4 list the clustering results of the compared baseline methods, where the mark “” indicates that the experimental results or codes are unavailable from the corresponding paper, and the boldface refers to the best clustering result. As is illustrated, our DMSC achieves the highest scores in terms of all metrics on all datasets among Type-MvC, demonstrating its superiority compared to the state-of-the-art deep multi-view clustering algorithms. In particular, the advantages of DMSC over DMJCS DMJCS verifiy that: 1) the feature space protection (FSP) mechanism can help preserve the properties of both view-specific embedding and view-shared representation; 2) the user-offered semi-supervised signals are conducive to forming a more perfect clustering structure.

Moreover, we also compare the proposed DMSC with some advanced single-view methods. The quantitative results are exhibited in Type-SvC, where we can notice that the single-view rivals are unable to mine useful complementary information since they can only process one single view, thus leading to a poor performance. In contrast, our DMSC can flexibly handle multi-view information, such that the view-specific feature and the inherent complementary information concealed in different views can be simultaneously learned as a robust global representation to obtain a satisfactory clustering result. Additionally, we also found that, as one of the joint learning based clustering algorithms, the proposed DMSC achieves better performance than the corresponding representation-based approaches (i.e., AE-View1, AE-View2, AE-View1,2) in all cases for all metrics, which clearly demonstrates that combining feature learning with pattern partitioning can provide a more appropriate representation for clustering analysis, implying the progressiveness of the joint optimization criterion.

SvC AE-View1 0.7198 0.7036 0.6017 0.5643 0.7206 0.5101 0.6513 0.7157 0.5603 0.5862 0.5899 0.4522
AE-View2 0.7388 0.7309 0.6506 0.6729 0.7676 0.5986 0.7208 0.8180 0.7015 0.6250 0.6476 0.4925
AE-View1,2 0.7425 0.7413 0.6613 0.6764 0.7751 0.6037 0.7227 0.8206 0.7030 0.6268 0.6496 0.4961
DEC DEC 0.7532 0.7544 0.6852 0.5756 0.7650 0.5555 0.6633 0.7300 0.5842 0.5923 0.6076 0.4632
IDEC IDEC 0.7680 0.7794 0.7080 0.5990 0.7702 0.5745 0.6836 0.7753 0.6358 0.5977 0.6348 0.4740
DCN DCN 0.7367 0.7353 0.6354 0.5982 0.7463 0.5430 0.6670 0.7350 0.5723 0.5947 0.6329 0.4678
ASPC ASPC 0.7578 0.7673 0.6753 0.5935 0.7644 0.5535 0.6960 0.7692 0.6155 0.6036 0.6385 0.4806
SDEC SDEC 0.7630 0.7705 0.6995 0.5915 0.7717 0.5650 0.6748 0.7466 0.6055 0.6028 0.6243 0.4754
MvC RMKMC RMKMC 0.7441 0.7278 0.6667 0.5799 0.7487 0.5275 0.5912 0.6169 0.4636
MSPL MSPL 0.7414 0.7174 0.6370 0.5992 0.7623 0.5608 0.5607 0.6068 0.4457
DCCA DCCA 0.4042 0.3895 0.2480 0.5512 0.7013 0.4600 0.4105 0.4028 0.2342
DCCAE DCCAE 0.3793 0.3895 0.2135 0.5551 0.7058 0.4667 0.4109 0.3836 0.2303
DGCCA DGCCA 0.5473 0.5079 0.4011 0.5337 0.6762 0.4370 0.4765 0.4827 0.3105
DMJCS DMJCS 0.7727 0.7941 0.7207 0.6986 0.8001 0.6384 0.7341 0.7837 0.6737 0.6370 0.6628 0.5143
DEMVC DEMVC 0.7803 0.8051 0.7245 0.7033 0.8049 0.6453 0.7387 0.8199 0.7006 0.6357 0.6605 0.5006
DMSC (ours) 0.7866 0.8163 0.7380 0.7126 0.8180 0.6644 0.7451 0.8287 0.7116 0.6401 0.6686 0.5183

Note that both SDEC and DMSC are with the semi-supervised learning paradigm.

Table 3: The experimental comparison on grayscale image datasets.
Type Method STL10 COIL100 CALTECH101 CIFAR10
SvC AE-View1 0.7521 0.7218 0.6367 0.7546 0.9278 0.7473 0.5088 0.7555 0.4259 0.5300 0.4384 0.3335
AE-View2 0.8716 0.8401 0.8023 0.7079 0.9152 0.7018 0.5877 0.8019 0.4636 0.6586 0.5697 0.4696
AE-View1,2 0.9098 0.8706 0.8478 0.7676 0.9393 0.7706 0.6096 0.8278 0.4924 0.6658 0.5883 0.4965
DEC DEC 0.9574 0.9106 0.9091 0.7794 0.9459 0.7779 0.6282 0.8364 0.5261 0.6744 0.5930 0.5137
IDEC IDEC 0.9605 0.9150 0.9155 0.7921 0.9481 0.7955 0.6373 0.8393 0.5398 0.6866 0.6072 0.5298
DCN DCN 0.9318 0.8965 0.8781 0.7771 0.9399 0.7726 0.6626 0.8418 0.6022 0.6828 0.6326 0.5308
ASPC ASPC 0.9381 0.9061 0.8908 0.7854 0.9497 0.7869 0.6729 0.8495 0.6087 0.6692 0.6162 0.5153
SDEC SDEC 0.9585 0.9120 0.9115 0.7942 0.9523 0.8009 0.6433 0.8450 0.5471 0.6953 0.6141 0.5353
MvC RMKMC RMKMC 0.8344 0.8273 0.7635 0.5714 0.4688 0.3679
MSPL MSPL 0.7414 0.7174 0.6370 0.7156 0.5948 0.5174
DCCA DCCA 0.8411 0.7477 0.6917 0.4242 0.3385 0.2181
DCCAE DCCAE 0.8235 0.7273 0.6632 0.3960 0.3226 0.2034
DGCCA DGCCA 0.8960 0.8218 0.7970 0.4703 0.3577 0.2634
DMJCS DMJCS 0.9374 0.9063 0.8989 0.7841 0.9532 0.7991 0.6998 0.8578 0.7054 0.7184 0.6188 0.5527
DEMVC DEMVC 0.9582 0.9121 0.9132 0.7563 0.9382 0.7626 0.6719 0.8419 0.6991 0.6998 0.6351 0.5457
DMSC (ours) 0.9679 0.9268 0.9305 0.8077 0.9569 0.8159 0.7161 0.8593 0.7230 0.7337 0.6442 0.5712

Note that both SDEC and DMSC are with the semi-supervised learning paradigm.

Table 4: The experimental comparison on RGB image datasets.
benchmark 0.7727 0.7941 0.7207 0.6986 0.8001 0.6384 0.7341 0.7837 0.6737 0.6370 0.6628 0.5143
0.7825 0.8087 0.7326 0.7049 0.8054 0.6458 0.7363 0.7921 0.6790 0.6386 0.6647 0.5165
0.7780 0.8142 0.7353 0.7052 0.8176 0.6574 0.7358 0.8201 0.7033 0.6349 0.6663 0.5172
DMSC (ours) 0.7866 0.8163 0.7380 0.7126 0.8180 0.6644 0.7451 0.8287 0.7116 0.6401 0.6686 0.5183
Table 5: The performance of DMSC with different configurations on grayscale image datasets.
benchmark 0.9374 0.9063 0.8989 0.7841 0.9532 0.7991 0.6998 0.8578 0.7054 0.7184 0.6188 0.5527
0.9579 0.9115 0.9102 0.7962 0.9556 0.8097 0.7029 0.8588 0.7136 0.7288 0.6276 0.5587
0.9368 0.9123 0.8996 0.7892 0.9510 0.8010 0.7054 0.8562 0.7070 0.7243 0.6263 0.5582
DMSC (ours) 0.9679 0.9268 0.9305 0.8077 0.9569 0.8159 0.7161 0.8593 0.7230 0.7337 0.6442 0.5712
Table 6: The performance of DMSC with different configurations on RGB image datasets.

4.6 Ablation Study

From the foregoing, we can see that the main contributions of the proposed DMSC are using the prior pairwise constraint information and introducing the (view-specific/common) feature properties protection paradigm to jointly carry out weighted multi-view representation learning and coherent clustering assignment. Therefore, this subsection focuses on exploring the importance of the semi-supervised (SEMI) module and the feature space protection (FSP) mechanism. Table 5 and Table 6 reveal the ablation results, where whether to fit out a specific part in DMSC is marked by “” or “”, and from which something could be seen that when we individually add one of two parts to the benchmark model DMJCS , enhanced performance can be observed in almost all cases. Furthermore, when both SEMI and FSP are simultaneously considered, our DMSC algorithm realizes the best clustering performance on the eight popular image datasets for all metrics. This observation legibly demonstrates that it is a very natural motivation to integrate semi-supervised learning paradigm and feature space preservation mechanism into deep multi-view clustering model, because the prior knowledge of pairwise constraints can better guide the intact clustering progress to obtain a more robust cluster-oriented shared representation based on the innovativeness of feature properties protection, shaping a perfect clustering structure and achieving an ideal clustering performance.

(a) USPS (b) STL10
Figure 3: Clustering performance v.s. parameter .
(a) USPS (b) STL10
Figure 4: Clustering performance v.s. parameter .
(a) USPS (b) STL10
Figure 5: Clustering performance v.s. parameter .
Figure 6: Clustering performance v.s. parameter .

4.7 Parameter Analysis

In this subsection, we will discuss how four hyper-parameters, i.e., the clustering loss coefficient , the constraint loss coefficient , the prior knowledge proportion and the number of clusters , affect the performance of the proposed DMSC.

We first probe into the sensitivity of , which is attached on the clustering term to protect feature properties. As exhibited from the comparative experiments in Section 4.5, our DMSC works well with . Figure 3 shows how our model performs with different values. When , the clustering constraint loses efficacy, leading to poor performance. When raises gradually, the KL divergence based clustering constraint returns to life and enhanced clustering performance is acquired. In addition, with the increasement of , the fluctuation of metrics is considerably mild, which means that our model yields satisfactory performance for a suitable range of and demonstrates that the proposed method is desensitized to the specific value of .

Next, the susceptibility variation of is presented in Figure 4, from which the three metrics’ values soar as changes from zero to non-zero, then they maintain perfect stability as appropriately rises. This observation clearly suggests that when we consider a small amount of pairwise constraints prior knowledge, the model can make good use of this kind of valuable information to provoke a preferable representation learning capability and a superior clustering expressiveness.

After that, we analyze the parameter that renders paired-sample constraints (or in other words one-sample constraints) for the model training. As seen from Figure 5, as increases, the performance of DMSC generally promotes at the beginning and achieves stability in a wide range of , which suggests that the incorporation of such pairwise constraint based semi-supervised learning rule into deep multi-view clustering model can result in a satisfactory performance via prior information capture.

With regard to the number of clusters , we have assumed that on each dataset is predefined based on the ground-truth labels in the preceding experiments in Section 4.5 and Section 4.6. Nevertheless, this is a strong assumption. In many real-world clustering applications, is usually unknown. Hence, we run our model on the STL10 dataset with different to search for the optimal value. As presented in Figure 6, we can see that our model achieves the highest scores when , i.e., our model tends to group these objects into ten clusters, which is in accordance with the ground-truth labels.

Stage Method ACC NMI ARI
Initialization View1 0.7112 0.6926 0.5940
View2 0.7317 0.7169 0.6348
View3 0.7252 0.7114 0.6268
View1,2,3 0.7492 0.7458 0.6678
Finetuning View1,2 0.7866 0.8163 0.7380
View1,3 0.7782 0.8131 0.7362
View2,3 0.7804 0.8119 0.7317
View1,2,3 0.7873 0.8263 0.7452
Table 7: The robustness study for the number of views on the USPS dataset.
(a) View1 (b) View2 (c) View3 (d) View1,2,3
Figure 7: t-SNE visualization of the cluster initialization.

4.8 Robustness Study

Note that in the previous experiments in Section 4.5, Section 4.6, Section 4.7, we have assumed that the number of views on each dataset is given as two. However, in many real-world applications, is usually greater than that. Consequently, here we run our model on the USPS dataset with three multiple deep branches () equipped to study the robustness with regard to the view numbers, whose quality is also measured by ACC/NMI/ARI. To be specific, the SAE and CAE defined in Table 2 are considered as the first two views, and a variational autoencoder (VAE) is utilized as the third view. Its encoding architecture is , where refers to a convolutional layer with filters, kernel size and stride length, represents a fully connected layer with neurons. Naturally, the mirrored version of the encoder is deemed as the decoding network. The results of the robustness experiment are presented in Table 7 and Figure 7. Generally speaking, simply concatenating three heterogeneous features does bring better initialization performance than single view, see the numerical comparisons of View1,2,3 (multi-view concatenated feature) v.s. View1/2/3 (single-view feature) in Stage-Initialization from Table 7 and the t-SNE t-SNE visualization from Figure 7. Meanwhile, as finetuning iteratively proceeds until model convergence, the proposed approach achieves more brilliant clustering performance than two views ones, which implies that three types of feature embeddings obtained by SAE, CAE, VAE can nicely complement each other and boost the clustering uniformity under our DMSC framework, see the clustering results in Stage-Finetuning from Table 7. In one word, the DMSC method owns a relatively good generalization for the number of views .

(a) USPS (b) COIL20 (c) STL10 (d) CIFAR10
Figure 8: Clustering performance v.s. iterations.
(a) Original (b) Initial (c) Iterative (d) Final
Figure 9: t-SNE visualization of the clusters during networks finetuning.

4.9 Convergence Analysis

To study the convergence of the proposed DMSC, we first record the three evaluation metrics over iterations on the datasets USPS, COIL20, STL10, CIFAR10. As can be observed from the results described in Figure

8, there is a distinct upward trend of each metric in the first few iterations, and all metrics eventually reach stability. Moreover, we use t-SNE t-SNE to visualize the learned common representation in different periods of the training process on a subset of the USPS dataset with samples, see Figure 9. We can heed that feature points mapped from raw pixels are extremely overlapped, implying the challenge of the clustering task, as Figure 9-(a). After parameters initialization shown in Figure 9-(b), the distribution of the merged features embedded by multiple deep branches is more discrete than the original features, and the preliminary clustering structure has been formed, this is because the learned initial representations are high-level and cluster-oriented. As finetuning proceeds until the model achieves convergence, feature points remain steady and are nicely separated, as Figure 9-(c)(d) displayed. Overall, Figure 8 and Figure 9 indeed illustrate that our DMSC can converge practically.

5 Conclusion

In this paper, we propose a novel deep multi-view semi-supervised clustering approach DMSC, which can boost the performance of multi-view clustering effectively by deriving the weakly-supervised information contained in sample pairwise constraints and protecting the feature properties of multi-view data. Using pairwise constraints prior knowledge during model training is beneficial for shaping a reliable clustering structure. The feature properties protection mechanism effectively prevents view-specific and view-shared feature from being distorted in clustering optimization. In comparison with existing state-of-the-art single-view and multi-view clustering competitors, the proposed method achieves the best performance on eight benchmark image datasets. Future work may cover conducting more trials on large-scale image, text, audio datasets with multiple views to ensure the model generalization and further exploring more advanced multi-view weighting technique for robust common representation learning to enhance multi-view clustering performance.


The authors are thankful for the financial support by the National Key Research and Development Program of China (2020AAA0109500), the Key-Area Research and Development Program of Guangdong Province (2019B010153002), the National Natural Science Foundation of China (62106266, 61961160707, 61976212, U1936206, 62006139).


  • (1) J. MacQueen, “Some methods for classification and analysis of multivariate observations”, in

    Berkeley Symposium on Mathematical Statistics and Probability

    , pp. 281-297, 1967.
  • (2) J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis”, in International Conference on Machine Learning, pp. 478-487, 2016.
  • (3) J. Shi and J. Malik, “Normalized cuts and image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, 2000.
  • (4) D. Bo, X. Wang, C. Shi, M. Zhu, E. Lu, and P. Cui, “Structural deep clustering network”, in International World Wide Web Conferences, pp. 1400-1410, 2020.
  • (5)

    C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

  • (6) L. Yang, N. Cheung, J. Li, and J. Fang, “Deep clustering by Gaussian mixture variational autoencoders with graph embedding”, in International Conference on Computer Vision, pp. 6439-6448, 2019.
  • (7) M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, in International Conference on Knowledge Discovery and Data Mining, pp. 226-231, 1996.
  • (8) Y. Ren, N. Wang, M. Li, and Z. Xu, “Deep density-based image clustering”, Knowledge-Based Systems, vol. 197, 2020.
  • (9) M. Lades, J. C. Vorbruggen, J. Buhmann, J. Lange, C. V. D. Malsburg, R. P. Wurtz, and W. Konen, “Distortion invariant object recognition in the dynamic link architecture”, IEEE Transactions on Computers, vol. 42, no. 3, pp. 300-311, 1993.
  • (10) T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, 2002.
  • (11) D. G. Lowe, “Distinctive image features from scale-invariant keypoints”, International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
  • (12) N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection”, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 886-893, 2005.
  • (13) T. W. Anderson, “An introduction to multivariate statistical analysis”, Technical Report, 1962.
  • (14) S. Akaho, “A kernel method for canonical correlation analysis”, arXiv preprint arXiv:cs/0609071, 2006.
  • (15) H. Gao, F. Nie, X. Li, and H. Huang, “Multi-view subspace clustering”, International Conference on Computer Vision, pp. 4238-4246, 2015.
  • (16) C. Zhang, H. Fu, Q. Hu, X. Cao, Y. Xie, D. Tao, and D. Xu, “Generalized latent multi-view subspace clustering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 1, pp. 86-99, 2020.
  • (17) J. Liu, X. Liu, Y. Yang, X. Guo, M. Kloft, and L. He, “Multiview subspace clustering via co-training robust data representation”, IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • (18)

    Y. Tang, Y. Xie, C. Zhang, and W. Zhang, “Constrained tensor representation learning for multi-view semi-supervised subspace clustering”,

    IEEE Transactions on Multimedia, 2021.
  • (19) Y. Tang, Y. Xie, C. Zhang, Z. Zhang, and W. Zhang, “One-step multiview subspace segmentation via joint skinny tensor learning and latent clustering”, IEEE Transactions on Cybernetics, 2021.
  • (20) Y. Tang, Y. Xie, X. Yang, J. Niu, and W. Zhang, “Tensor multi-elastic kernel self-paced learning for time series clustering”, IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 3, pp. 1223-1237, 2021.
  • (21) G. Andrew, R. Arora, J. Bilmes, and K. Livescu, “Deep canonical correlation analysis”, in International Conference on Machine Learning, pp. 1247-1255, 2013.
  • (22) W. Wang, R. Arora, K. Livescu, and J. Bilmes, “On deep multi-view representation learning”, in International Conference on Machine Learning, pp. 1083-1092, 2015.
  • (23) A. Benton, H. Khayrallah, B. Gujral, D. Reisinger, S. Zhang, and R. Arora, “Deep generalized canonical correlation analysis”, arXiv preprint arXiv:1702.02519, 2017.
  • (24) Y. Xie, B. Lin, Y. Qu, C. Li, W. Zhang, L. Ma, Y. Wen, and D. Tao, “Joint deep multi-view learning for image clustering”, IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 11, pp. 3594-3606, 2021.
  • (25) J. Xu, Y. Ren, G. Li, L. Pan, C. Zhu, and Z. Xu “Deep embedded multi-view clustering with collaborative training”, Information Sciences, vol. 573, pp. 279-290, 2021.
  • (26) C. Zhang, Y. Liu, and H. Fu, “AE2-Nets: Autoencoder in autoencoder networks”, in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2572-2580, 2019.
  • (27) J. Wen, Z. Zhang, Y. Xu, B. Zhang, L. Fei, and G. Xie, “CDIMC-net: Cognitive deep incomplete multi-view clustering network”, in

    International Joint Conference on Artificial Intelligence

    , pp. 3230-3236, 2020.
  • (28) S. Basu, A. Banerjee, and R. J. Mooney, “Active semi-supervision for pairwise constrained clustering”, in International Conference on Data Mining, pp. 333-344, 2004.
  • (29) P. Bradley, K. Bennett, and A. Demiriz, “Constrained k-means clustering”, Microsoft Research, 2000.
  • (30) S. D. Kamvar, D. Klein, and C. D. Manning, “Spectral learning”, in International Joint Conference on Artificial Intelligence, pp. 561-566, 2003.
  • (31) J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deep adaptive image clustering”, in International Conference on Computer Vision, pp. 5880-5888, 2017.
  • (32) Y. Shi, C. Otto, and A. K. Jain, “Face clustering: Representation and pairwise constraints”, IEEE Transactions on Information Forensics and Security, vol. 13, no. 7, pp. 1626-1640, 2018.
  • (33) Z. Wang, S. Wang, L. Bai, W. Wang, and Y. Shao, “Semi-supervised fuzzy clustering with fuzzy pairwise constraints”, IEEE Transactions on Fuzzy Systems, 2021.
  • (34) F. Nie, G. Cai, J. Li, and X. Li, “Auto-weighted multi-view learning for image clustering and semi-supervised classification”, IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1501-1511, 2018.
  • (35) Y. Qin, H. Wu, X. Zhang, and G. Feng, “Semi-supervised structured subspace learning for multi-view clustering”, IEEE Transactions on Image Processing, vol. 31, pp. 1-14, 2022.
  • (36) L. Bai, J. Liang, and F. Cao, “Semi-supervised clustering with constraints of different types from multiple information sources”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 9, pp. 3247-3258, 2021.
  • (37) S. Basu, I. Davidson, and K. Wagstaff, “Constrained clustering: Advances in algorithms theory and applications”, Imaging and Machine Vision Europe, 2008.
  • (38) E. Bair, “Semi-supervised clustering methods”, Wiley Interdisciplinary Reviews Computational Statistics, vol. 5, no. 5, pp. 349-361, 2013.
  • (39) X. Yan, S. Hu, Y. Mao, Y. Ye, and H. Yu, “Deep multi-view learning methods: A review”, Neurocomputing, vol. 448, pp. 106-129, 2021.
  • (40) Y. Ren, S. Huang, P. Zhao, M. Han, and Z. Xu, “Self-paced and auto-weighted multi-view clustering”, Neurocomputing, vol. 383, pp. 248-256, 2020.
  • (41) X. Guo, L. Gao, X. Liu, and J. Yin, “Improved deep embedded clustering with local structure preservation”, in International Joint Conference on Artificial Intelligence, pp. 1753-1759, 2017.
  • (42) B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards kmeans-friendly spaces: Simultaneous deep learning and clustering”, in International Conference on Machine Learning, pp. 3861-3870, 2017.
  • (43) X. Guo, X. Liu, E. Zhu, X. Zhu, M. Li, X. Xu, and J. Yin, “Adaptive self-paced deep clustering with data augmentation”, IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 9, pp. 1680-1693, 2020.
  • (44) Y. Ren, K. Hu, X. Dai, L. Pan, S. C. H. Hoi, and Z. Xu, “Semi-supervised deep embedded clustering”, Neurocomputing, vol. 325, pp. 121-130, 2019.
  • (45) F. Li, H. Qiao, and B. Zhang, “Discriminatively boosted image clustering with fully convolutional auto-encoders”, Pattern Recognition, vol. 83, pp. 161-173, 2018.
  • (46) M. M. Fard, T. Thonet, and E. Gaussier, “Deep k-means: Jointly clustering with k-means and learning representations”, Pattern Recognition Letters, vol. 138, pp. 185-192, 2020.
  • (47) R. Chen, Y. Tang, L. Tian, C. Zhang, and W. Zhang, “Deep convolutional self-paced clustering”, Applied Intelligence, 2021.
  • (48) R. Chen, Y. Tang, C. Zhang, W. Zhang, and Z. Hao, “Deep multi-network embedded clustering”, Pattern Recognition and Artificial Intelligence, vol. 34, no. 1, pp. 14-24, 2021.
  • (49) Z. Li, C. Tang, J. Chen, C. Wan, W. Yan, and X. Liu, “Diversity and consistency learning guided spectral embedding for multi-view clustering”, Neurocomputing, vol. 370, pp. 128-139, 2019.
  • (50) D. Wu, Z. Hu, F. Nie, R. Wang, H. Yang, and X. Li, “Multi-view clustering with interactive mechanism”, Neurocomputing, vol. 449, pp. 378-388, 2021.
  • (51) X. Cai, F. Nie, and H. Huang, “Multi-view k-means clustering on big data”, in International Joint Conference on Artificial Intelligence, pp. 2598-2604, 2013.
  • (52) C. Xu, D. Tao, and C. Xu, “Multi-view self-paced learning for clustering”, in International Joint Conference on Artificial Intelligence, pp. 3974-3980, 2015.
  • (53) Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao, “Binary multi-view clustering”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 7, pp. 1774-1782, 2019.
  • (54) T. Li and C. Ding, “The relationships among various nonnegative matrix factorization methods for clustering”, in International Conference on Data Mining, pp. 362-371, 2006.
  • (55) A. Strehl and J. Ghosh, “Cluster ensembles: A knowledge reuse framework for combining multiple partitions”, Journal of Machine Learning Research, vol. 3, pp. 583-617, 2002.
  • (56) L. Hubert and P. Arabie, “Comparing partitions”, Journal of Classification, vol. 2, no. 1, pp. 193-218, 1985.
  • (57) H. W. Kuhn, “The hungarian method for the assignment problem”, Naval Research Logistics Quarterly, vol. 2, no. 1, pp. 83-97, 1955.
  • (58) W. M. Rand, “Objective criteria for the evaluation of clustering methods”, Journal of the American Statistical Association, vol. 66, no. 336, pp. 846-850, 1971.
  • (59) D. Kingma and J. Ba, “Adam: A method for stochastic optimization”, arXiv preprint arXiv:1412.6980, 2014.
  • (60) X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks”, Journal of Machine Learning Research, vol. 15, pp. 315-323, 2011.
  • (61) X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, Journal of Machine Learning Research, vol. 9, pp. 249-256, 2010.
  • (62) L. V. D. Maaten and G. Hinton, “Visualizing data using t-sne”, Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579-2605, 2008.