Spectral clustering [Ng et al.2001]
, which partitions the data objects via their local graph/manifold structure relying on the Laplacian eigenvalue-eigenvector decomposition, is one fundamental clustering problem. Unlike K-Means clustering[Wu et al.2008], the data objects within the same group characterize not only the large data similarity but also the similar local graph/manifold structure. With the rapid development of information technology, the data are largely available with the multi-view feature representations (e.g., images can be featured by a color histogram view or a texture view), which naturally paves the way to multi-view spectral clustering. As extensively claimed by the multi-view research [Xu et al.2015, Wang et al.2017b, Deng et al.2015, Wu et al.2013, Wang et al.2015b, Wang et al.2013, Wang et al.2015a, Wang et al.2014], the information encoded by multi-view features describe different properties; thus leveraging the multi-view information can outperform the single-view counterparts. One critical issue on a successful multi-view incorporation implied by the existing work [Gui et al.2014, Kumar and Daume2011, Wu and Wang2017, Wang et al.2015c, Wang et al.2017a], lies in how to achieve the multi-view consensus/agreement.
Following such principle, a lot of multi-view clustering methods [Gao et al.2013, Gao et al.2015] claim that similar data objects should be within the same group across all views. Based on that, the consensus multi-view local manifold structure is further explored with great efforts [Xia et al.2014, Wang et al.2016, Kumar et al.2011, Kumar and Daume2011] for multi-view spectral clustering. Among all these methods, Low-Rank Representation (LRR) [Liu et al.2010] coupled with sparse decomposition based model has been emerged as a substantially elegant solution, due to its strength of exploring their intrinsic low-dimensional manifold structure encoded by the data correlations embedded in high-dimensional space, while exhibiting strong robustness to feature noise corruptions addressed by sparse noise modeling, hence attracting great attention.
1.1 Motivation: LRR Revisited for Multi-View Spectral Clustering
Specifically, the typical LRR model for multi-view spectral clustering stems from the formulation below:
where is the data representation for the view with as its feature dimension, as the number of data objects identical for each view, is the balance parameter, and is the view set. is the self-expressive low-rank similarity representation shared by all views, constrained with based on , which can also be substituted by the other specific dictionaries; is modeled to address the noise-corruption for the view-specific feature representation. ensures the nonnegativity for all its entries. Based on such optimized low-rank , the spectral clustering is finally conducted. One significant limitation of Eq.(1) pointed out by [Wang et al.2016] is that, only one common is learned to preserve the flexible local manifold structures for all views, hence fails to achieve the ideal spectral clustering result.
To this end, various low-rank are learned to preserve the view-specific local manifold structures, meanwhile minimize their divergence via an iterative-views-agreement strategy for multi-view consensus, followed by a final spectral clustering stage.
Despite its encouraging performance, the following standout limitations are inattentively overlooked for LRR model: (1) The low-rank data similarity may not well encode the flexible latent cluster structures over primal view-specific feature space; worse still for the non-ideal local graph construction over such representation for spectral clustering; (2) The low-rank data similarities coming from multi-views may not be within the same magnitude, so that the divergence minimization may not achieve the ideal multi-view clustering consensus.
Our new perspective. The above facts motivate us to revisit the low-rank representation to help reconstruct below for the view
where denotes the set of with low-rankness e.g., cluster number far less than ; Instead of narrowing Low-Rank as self-expressive data similarity from the conventional viewpoint, it is essentially seen as a special case of a generalized Low-Rank projection, to map feature representation to a low-dimensional space to reconstruct with minimum error. As discussed, the self-expressive similarity projection equipped with LRR models still suffer from the aforementioned non-trivial limitations.
Here we ask a question: Is there a superior low-rank projection to minimize Eq.(2), meanwhile address the limitations over the existing LRR models. Our answer to this question is positive. Specifically, we propose to consider as a latent clustered orthogonal projection, via , where
Clustered orthogonal projection: , where each column indicates one cluster to characterize its belonging data objects. Compared with LRR over original feature space, the latent factor can better preserve the flexible latent cluster structure.
Feature reconstruction with cluster basis: Instead of low-rank data similarity, essentially serves as a mapping to reconstruct the view-specific features via the column of to encode the latent cluster structures.
Rethinking : We revisit the intuition of via throughout two stages, remind that where
is performed to obtain the new projection value for all features over orthogonal columns of ;
is subsequently the projected representation for all features spanned by clustered orthogonal column basis of .
Same magnitude for multi-view consensus: All enjoy the same magnitude due to their orthonormal columns. Hence, the feasible divergence minimization will facilitate the multi-view consensus.
Before shedding light on our technique, we review the typical related work for multi-view spectral clustering
1.2 Prior Arts
The prior arts can be classified as per the strategy at which the multi-view fusion takes place for spectral clustering.
The most straightforward method goes to the Early fusion [Huang et al.2010]
by concatenating the multi-view feature vectors with equal or varied weights into an unified one, followed by the spectral clustering over such unified space. However, such method ignores the statistical property belonging to an individual view.Late fusion [Greene and Cunningham2009] may address the limitation to some extents by aggregating the spectral clustering result from each individual view, which follows the assumption that all views are independent to each other. Such assumption is not effective for multi-view spectral clustering as they assume the views to be dependent so that the multi-view consensus information can be exploited for promising performance.
Canonical Correlation Analysis (CCA) is applied for multi-view spectral clustering [Chaudhuri et al.2009] by learning a common low-dimensional representations for all views, upon which the spectral clustering is performed. One salient drawback lies in the failure of preserving the flexible local manifold structures for different views via such common subspace. Co-training based model [Kumar and Daume2011] learned the Laplacian eigenmap for each view over its projected data representation throughout the laplacian eigenmaps from other views, such process repeated till the convergence, the final similarity are then aggregated for spectral clustering. A similar method [Kumar et al.2011] is also proposed to coordinate multi-view laplacian eigenmaps consensus for spectral clustering. Despite their effectiveness, they have to follow the scenario of noise free for the feature representations. Unfortunately, it cannot be met in practice. The Low-Rank Representation and sparse decomposition models [Wang et al.2016, Xia et al.2014] well tackle the problem, meanwhile exhibits the robustness to feature noise corruptions. However, they still suffer from the aforementioned limitations. To this end, we make the following orthogonal contributions to typical LRR model for multi-view spectral clustering.
1.3 Our Contributions
We revisit the classical Low-Rank Representation (LRR) for multi-view spectral clustering with a fundamentally novel viewpoint of finding it as essentially the latent clustered orthogonal projection based representation with optimized graph structure, to better encode the flexible latent cluster structures than LRR over primal data objects.
We convert the problem of learning LRR into that of simultaneously learning the clustered orthogonal representation and its optimized local graph structure for each view, rather than directly rely on the local graph construction over original data objects.
The learned multi-view latent clustered representations and local graph structures enjoy the same magnitude, so as to facilitate a feasible divergence minimization to achieve superior multi-view consensus for spectral clustering.
Extensive experiments over multi-view datasets validate the superiority of our method.
2 Learning Clustered Orthogonal projection with Optimized Graph Structure
In this section, we formally discuss our technique. Some notations that are used throughout the paper are shown below.
For Matrix , the trace of is denoted as ; (or for vector space) denotes the Frobenius norm; is the norm, and denotes the transpose of , and its unclear norm as
(sum of all singular values);and as the row and column of . means all entries of are nonnegative.
is the identity matrix with adaptive size.1 indicates the vector of adaptive length with all entries to be 1. indicates the cardinality of the set.
2.2 Problem Formulation
As previously defined in section 1.1, is the data representation for the view. is low-rank data similarity representation for . The Eq.(2) is equivalent to computing such that , where has orthonormal columns with its column representing the relevance each data object belongs to the cluster, and indicates latent cluster number. We then arrive at the following
As discussed in section 1.1, reveals the new projection representation for all features spanned by the orthogonal basis of to reconstruct .
Optimizing Eq.(3) w.r.t. is equivalent to computing an SVD of to constitute the orthogonal columns of using the principle eigenvectors. Inspired by this, we exploit the latent cluster structures of to form non-overlapping clusters with each characterized by one orthogonal column basis of . Thanks to [Recht et al.2010] on low-rank matrix factorization, it yields the following
where and are latent factors from . Based on that, we approximate via the clustered orthogonal projection factorization , and convert the problem of minimizing to that of learning clustered projection representation below
Unlike the data similarity over raw data objects, via the low-rank matrix factorization can achieve the flexible latent cluster structures. Another crucial issue left to be addressed lies in its local manifold/graph structure modeling over , which is crucial for spectral clustering. One may directly refer to the local graph construction over . However, as previously stated, it cannot effectively encode the the local graph structure over .
Towards this end, we propose to learn an optimized local graph structure over by solving the following
where is Laplacian matrix, is the diagonal matrix with its diagonal entry equaled to the sum of the row of . The ideal
reveals the probability ofand data points within the same cluster according to cluster projection representation . We impose the constraint that , and to meet the probability nature of . Following [Nie et al.2014], we will impose the regularization to avoid that only the nearest neighbor of each data point is assigned 1 with others 0.
With all the above collected, we finally formulate the problem below
where is omitted due to constraint ; all the share the same cluster number for multi-view clustering consensus. , and are non-negative weights related to learning the clustered orthogonal representation, its local graph structure and multi-view consensus modeling, and will be studied in Section 4. The constraint ensures the orthonormal columns of .
We introduce two auxiliary variables and . As will be shown later, the intuition of introducing lies in minimizing w.r.t. , where
it is similar as dictionary learning, while popping up as the corresponding sparse representation learning; moreover, it also enjoys the optimization of the isolated after merging the other into .
Solving Eq.(7) is equivalent to be a unified process of simultaneously learning and for the view. As will be shown later, learning either of them will promote the other. Optimizing Eq.(7) is not jointly convex to , and , we hence alternately optimize each of them with the others fixed. Following [Lin et al.2011], we deploy the Augmented Lagrange Multiplier (ALM) together with Alternating Direction Minimization (ADM) strategy, which is widely known as an effective and efficient solver. As the optimization process for the above variables within each view is similar, we only present the optimization process for the view, the same process holds for other views. The augmented lagrangian function can be written below
where , , and are Lagrange multipliers. indicates element-wise multiplication. is a penalty parameter.
Solving : We calculate the partial derivative of Eq.(8) w.r.t. , to be , while fixing others to be constant. After rearranging the terms, it has
Efficient Row updating strategy of . As shown in Eq.(9), the bottleneck of updating lies in the high computational complexity of caused by the matrix inverse operation against the . To resolve it, we propose to update each row of . Without loss of generality, we set the derivative w.r.t. to be . It then yields the following
Orthonormalize : After obtaining the whole by updating all rows for each iteration,
the clustering algorithm e.g., fast k-means is performed, which yields the cluster indicator for each data point/each row, leading to orthogonal columns
then normalize each entry of via the rules as: if is assigned with the cluster , it is 0 otherwise. According to the processing above, it successfully achieves the orthonormal columns of ().
As per the row-update strategy for in Eq.(10), we remark the followings:
We dramatically reduces the computational complexity from by Eq.(9) to , due to .
Another note goes to the process of multi-view consensus of via the row update. Specifically, during each iteration, the is updated via the influence from other views, while served as a constraint to guide the updating, among all of which the divergence is decreased towards a consensus, which is based on the same magnitude among with orthonormal columns.
Solving : We get the partial derivative of Eq.(8) w.r.t. , then yields the following closed form:
The major computational burden lies in , resulting into , which is identical to that for row-updating of , hence efficient.
Solving : Optimizing Eq.(8) w.r.t. is equivalent to solving the following
According to [Recht et al.2008], the following closed form can be obtained
where , if is positive, it is 0 otherwise.
Solving : Optimizing Eq.(8) w.r.t. is equivalent to the following
Based on that, we enjoy the following closed form
Solving : The problem of optimizing can be converted to the following
As the similarity vector for each sample is independent, we only study the sample.
We convert Eq.(18) to the following
where is a vector, with its entry , leading to the following closed form:
where turns the negative entries in to 0 while with positive entries remained. denotes the number of data points that have nonzero weight connected to the sample. We empirically set for all views. Once the is obtained, we may update that to be a balanced undirected graph as .
Consensus : As is solely determined by according to Eq.(20), the consensus on in Remark 3 naturally leads to the consensus over .
Multiplier updating: The lagrange multipliers , and are automatically updated as
Besides, is tuned via the adaptive updating rule according to [Lin et al.2011].
Algorithm convergence: It is worth nothing that ADM strategy converges to a stationary point yet no guaranteed to be global optimum. Upon that, we define the convergence when with or maximum iteration number is reached, which is set to be 25 for our method. The optimization process is conducted regarding each variable alternatively within each view, the entire process is terminated until the convergence rule is met for all views.
Multi-view clustering output: After the above updating rule is converged, we got the final multi-view clustered representation ; and multi-view optimized local graph structure . The normalized graph cut is applied to generate the clusters as the multi-view spectral clustering output.
We summarize the whole updating process in Algorithm 1.
4 Experimental Validation
UCI handwritten Digit set111http://archive.ics.uci.edu/ml/datasets/Multiple+Features: It consists of features of hand-written digits (0-9). The dataset is described by 6 features and contains 2000 samples with 200 in each category. Analogous to [Lin et al.2011], we choose 76 Fourier coefficients (FC) of the character shapes and the 216 profile correlations (PC) as two views.
Method UCI digits AwA NUS LRRGL 17.39 25.78 34.21 Ours 1.15 1.18 1.21 Table 1: Multi-view consensus ratio metric as per Eq.(LABEL:eq:metriclast) between our method and LRRGL over three data sets. Smaller value means similar magnitude.
Animal with Attribute (AwA)222http://attributes.kyb.tuebingen.mpg.de: It consists of 50 kinds of animals described by 6 features (views): Color histogram ( CQ, 2688-dim), local self-similarity (LSS, 2000-dim), pyramid HOG (PHOG, 252-dim), SIFT (2000-dim), Color SIFT (RGSIFT, 2000-dim), and SURF (2000-dim). We randomly sample 80 images for each category and get 4000 images in total.
NUS-WIDE-Object (NUS) [Chua et al.2009]
: The data set consists of 30000 images from 31 categories. We construct 5 views: 65-dimensional color histogram (CH), 226-dimensional color moments (CM), 145-dimensional color correlation (CORR), 74-dimensional edge estimation (EDH), and 129-dimensional wavelet texture (WT).
The following typical multi-view baselines are compared for spectral clustering, covering Early fusion, Late fusion, CCA, Co-training strategy and LRR models as reviewed in Section 1.2. All the parameters are tuned to their best performance.
MFMSC: Concatenating multi-view features to perform spectral clustering.
Multi-view affinity aggregation for multi-view spectral clustering (MAASC) [Huang et al.2012].
Canonical Correlation Analysis (CCA) based multi-view spectral clustering (CCAMSC) [Chaudhuri et al.2009] by learning a common subspace for multi-view data, then perform spectral clustering.
Low-Rank Representation with Multi-Graph Learning (LRRGL) [Wang et al.2016].
|ACC (%)||UCI digits||AwA||NUS|
|NMI (%)||UCI digits||AwA||NUS|
Clustering accuracy (ACC) and normalized mutual information (NMI). Pleaser refer to [Cai and Chen2015, Chen et al.2011] for their detailed descriptions. To demonstrate the robustness superiority over non-LRR methods, following [Wang et al.2016], we set the feature corruption noise for each view is with sparse noise as 20% entries with uniformly noise over [-5,5] for RLRR, LRRGL and our method, with in Eq.(7) for our method. All experiments are repeated 10 times, the average clustering results are shown in Tables 2 and 3, where our method outperforms the others, especially better than RLRR and LRRGL, due to its strengthes of
encoding more flexible latent cluster structures, along with the more ideal optimized local graph structure based on such latent clustered representation.
The superior multi-view consensus in terms of both latent clustered representation and optimized local graph structure for all views.
To penetrate the first finding, we illustrate the visualized consensus multi-view affinity matrix over NUS data set between our method andLRRGL in Fig. 1, which validates the advantages of our clustered orthogonal representation over low-rank similarity yielded by LRRGL.
Parameter Study: We further study the parameter (clustered orthogonal representations and optimized local graph structure) and (multi-view consensus term) in Eq.(7), and against the clustering accuracy over AwA and NUS data sets; we varied one parameter while fixed the others, and the results are illustrated in Fig. 2, where increasing either of them can improve the clustering accuracy until meet the optimal pair-wise values, followed by a slight performance decreasing. To balance Figs. 2(a) and (b), we finalize and in Eq.(7).
In this paper, we revisit the classical Low-Rank Representation (LRR) for multi-view spectral clustering, by viewing LRR as essentially a latent clustered orthogonal projection winged with its optimized local graph structure. Following this, we propose to simultaneously learn clustered orthogonal projection and optimized local graph structure for each view, while enjoy the same magnitude over them both for all views, leading to a superior multi-view spectral clustering consensus. Extensive experiments validate its strength.
- [Cai and Chen2015] Deng Cai and Xinlei Chen. Large scale spectral clustering via landmark-based sparse representation. 45(8):1669–1680, 2015.
- [Chaudhuri et al.2009] K. Chaudhuri, S. Kakade, K. Livescu, and K. Sridharan. Multi-view clustering via canonical correlation analysis. In ICML, 2009.
- [Chen et al.2011] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang. Parallel spectral clustering in distributed systems. IEEE Trans. Pattern Anal. Mach. Intell., 33(3):568–586, 2011.
- [Chua et al.2009] Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. Nus-wide: A real-world web image database from national university of singapore. In ACM CIVR, 2009.
- [Deng et al.2015] Cheng Deng, Zongting Lv, Wei Liu, Junzhou Huang, Dacheng Tao, and Xinbo Gao. Multi-view matrix decomposition:a new scheme for exploring discriminative information. In IJCAI, 2015.
- [Gao et al.2013] Jing Gao, Jiawei Han, Jialu Liu, and Chi Wang. Multi-view clustering via joint nonnegative matrix factorization. In SDM, pages 252–260, 2013.
- [Gao et al.2015] Hongchang Gao, Feiping Nie, Xuelong Li, and Heng Huang. Multi-view subspace clustering. In ICCV, pages 4238–4246, 2015.
- [Greene and Cunningham2009] D. Greene and P. Cunningham. A matrix factorization approach for integrating multiple data views. In ECMLPKDD, 2009.
- [Gui et al.2014] Jie Gui, Dacheng Tao, Zhenan Sun, Yong Luo, Xinge You, and Yuan Yan Tang. Group sparse multiview patch alignment framework with view consistency for image classification. IEEE Transactions on Image Processing, 23(7):3126–3137, 2014.
- [Huang et al.2010] Yuchi Huang, Qingshan Liu, Shaoting Zhang, and Dimitris N. Metaxas. Image retrieval via probabilistic hypergraph ranking. In CVPR, 2010.
- [Huang et al.2012] Hsin-Chien Huang, Yung-Yu Chuang, and Chu-Song Chen. Affinity aggregation for spectral clustering. In CVPR, 2012.
- [Kumar and Daume2011] Abhishek Kumar and Hal Daume. A co-training approach for multi-view spectral clustering. In ICML, 2011.
- [Kumar et al.2011] Abhishek Kumar, Piyush Rai, and Hal Daume. Co-regularized multi-view spectral clustering. In NIPS, 2011.
- [Lin et al.2011] Zhouchen Lin, Risheng Liu, and Zhixun Su. Linearized alternating direction method with adaptive penalty for low-rank representation. In NIPS, 2011.
- [Liu et al.2010] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by low-rank representation. In ICML, 2010.
[Ng et al.2001]
Andrew Y. Ng, Michael I. Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.In NIPS, 2001.
- [Nie et al.2014] Feiping Nie, Xiaoqian Wang, and Heng Huang. Clustering and projected clustering with adaptive neighbors. In KDD, pages 977–986, 2014.
- [Recht et al.2008] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Journal on Optimization, 20(4):1956–1982, 2008.
- [Recht et al.2010] Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 45(8):1669–1680, 2010.
- [Wang et al.2013] Yang Wang, Xuemin Lin, and Qing Zhang. Towards metric fusion on multi-view data: a cross-view based graph random walk approach. In ACM CIKM, pages 805–810, 2013.
- [Wang et al.2014] Yang Wang, Xuemin Lin, Qing Zhang, and Lin Wu. Shifting hypergraphs by probabilistic voting. In PAKDD, pages 234–246, 2014.
- [Wang et al.2015a] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. Effective multi-query expansions: Robust landmark retrieval. In ACM Multimedia, pages 79–88, 2015.
- [Wang et al.2015b] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, and Qing Zhang. Lbmch: Learning bridging mapping for cross-modal hashing. In ACM SIGIR, 2015.
- [Wang et al.2015c] Yang Wang, Xuemin Lin, Lin Wu, Wenjie Zhang, Qing Zhang, and Xiaodi Huang. Robust subspace clustering for multi-view data by exploiting correlation consensus. IEEE Trans. Image Processing, 24(11):3939–3949, 2015.
- [Wang et al.2016] Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, Meng Fang, and Shirui Pan. Iterative views agreement: An iterative low-rank based structured optimization method to multi-view spectral clustering. In IJCAI, 2016.
- [Wang et al.2017a] Yang Wang, Xuemin Lin, Lin Wu, and Wenjie Zhang. Effective multi-query expansions: Collaborative deep networks for robust landmark retrieval. IEEE Trans. Image Processing, 26(3):1393–1404, 2017.
[Wang et al.2017b]
Yang Wang, Wenjie Zhang, Lin Wu, Xuemin Lin, and Xiang Zhao.
Unsupervised metric fusion over multiview data by graph random
walk-based cross-view diffusion.
IEEE Trans. Neural Networks and Learning Systems, 28(1):57–70, 2017.
- [Wu and Wang2017] Lin Wu and Yang Wang. Robust hashing for multi-view data: Jointly learning low-rank kernelized similarity consensus and hash functions. Image Vision Comput, 57:58–66, 2017.
- [Wu et al.2008] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, and Dan Steinberg. Top 10 algorithms in data mining. Knowledge and Information Systems., 14:1–37, 2008.
- [Wu et al.2013] Lin Wu, Yang Wang, and John Shepherd. Efficient image and tag co-ranking: a bregman divergence optimization method. In ACM Multimedia, 2013.
- [Xia et al.2014] Rongkai Xia, Yan Pan, Lei Du, and Jian Yin. Robust multi-view spectral clustering via low-rank and sparse decomposition. In AAAI, 2014.
- [Xu et al.2015] Chang Xu, Dacheng Tao, and Chao Xu. Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach. Intell., 2015.