Clustering is one of the fundamental techniques in machine learning. It has been widely applied to various research fields, such as gene expression[DBLP:journals/tkde/JiangTZ04], face analysis[DBLP:journals/pami/ElhamifarV13], image annotation[DBLP:journals/pami/WangZLM08], and recommendation[clusterrecommend1, clusterrecommend2]. In the past decades, many clustering approaches have been developed, such as k-means[DBLP:journals/J.B.MacQueen]DBLP:conf/nips/NgJW01, DBLP:journals/tcyb/CaiC15],[yang2017discrete],[hu2017robust],[yang2015multitask] spectral embedded clustering[DBLP:journals/tnn/NieZTXZ11] and normalized cut[DBLP:conf/cvpr/ShiM97].
K-means identifies cluster centroids that minimize the within cluster data distances. Due to the simpleness and efficiency, it has been extensively applied as one of the most basic clustering methods. Nevertheless, k-means suffers from the problem of the curse of dimensionality and its performance highly depends on the initialized cluster centroids. As an alternative promising clustering method, spectral clustering and its extensions learn a low-dimensional embedding of the data samples by modelling their affinity correlations with graph[CUI2018, 8171745, 8425080, LU2019217, 8352779, li2018transfer, li2018heterogeneous, li2017low]. These graph based clustering methods,[DBLP:journals/tcyb/LiCD17],[shen2018tpami],[shen2017tmm],[TIP2016binary], [DBLP:journals/tip/XuSYSL17] generally work in three separate steps: similarity graph construction, clustering label relaxing and label discretization with k-means. Their performance is largely determined by the quality of the pre-constructed similarity graph, where the similarity relations of samples are simply calculated with a fixed distance measurement, which cannot fully capture the inherent local structure of data samples. This unstructured graph may lead to sub-optimal clustering results. Besides, they rely on k-means to generate the final discrete cluster labels, which may result in unstable clustering solution as k-means.
Recently, clustering with adaptive neighbors (CAN)[DBLP:conf/kdd/NieWH14] is proposed to automatically learn a structured similarity graph by considering the clustering performance. Projective clustering with adaptive neighbors (PCAN)[DBLP:conf/kdd/NieWH14] improves its performance further by simultaneously performing subspace discovery, similarity graph learning and clustering. With structured graph learning, CAN and PCAN enhance the performance of graph based clustering further. However, they simply drop the discrete constraint of cluster labels to solve an approximate continuous solution. This strategy may lead to significant information loss and thus reduce the quality of the constructed graph structure. Moreover, to generate the final discrete cluster labels, graph cut should be exploited in them on the learned similarity graph. To obtain the cluster labels of out-of-sample data, the whole algorithm should be run again. This requirement will bring consideration computation cost in real practice.
In this paper, we propose an effective discrete optimal graph clustering (DOGC) method. We develop a unified learning framework, where the optimal graph structure is adaptively constructed, the discrete cluster labels are directly learned, and the out-of-sample extension can be well supported. In DOGC, a structured graph is adaptively learned from the original data with a guidance of reasonable rank constraint for pursuing the optimal clustering structure. Besides, to avoid the information loss in most graph based clustering methods, a rotation matrix is learned in DOGC to rotate the intermediate continuous labels and directly obtain the discrete ones. Based on the discrete cluster labels, we further integrate a robust prediction module into the DOGC to compensate the unreliability of cluster labels and learn a prediction function for out-of-sample data clustering. To solve the formulated discrete clustering problem, an alternate optimization strategy guaranteed with convergence is developed to iteratively calculate the clustering results. The key advantages of our methods are highlighted as follows:
|The feature dimension of data|
|The number of clusters|
|The nearest neighbor number of data points|
An identity matrix of size
|A||The fixed similarity matrix calculated directly by Gaussian kernel function|
|S||The learned similarity matrix corresponding to X|
|W||The projection matrix for dimension reduction|
|P||The mapping matrix from original data to discrete cluster labels|
|F||The continuous cluster labels|
|Y||The discrete cluster labels|
|Q||The rotation matrix|
|Degree matrix of similarity matrix S|
|Laplacian matrix of similarity matrix S|
Rather than exploiting a fixed similarity matrix, a similarity graph is adaptively learned from the raw data by considering the clustering performance. With reasonable rank constraint, the dynamically constructed graph is forced to be well structured and theoretically optimal for clustering.
Our model learns a proper rotation matrix to directly generate discrete cluster labels without any relaxing information loss as many existing graph based clustering methods.
With the learned discrete cluster labels, our model can accommodate the out-of-sample data well by designing a robust prediction module. The discrete cluster labels of database samples can be directly obtained, and simultaneously the clustering capability for new data can be well supported.
The rest of this paper is organized as follows. Section ii@ revisits several representative graph based clustering methods. Section iii@ describes the details of the proposed methods. Section iv@ introduces the experimental setting. The experimental results are presented in Section v@. Section vi@ concludes the paper.
Ii Graph Clustering Revisited
For the data matrix , the entry of X and the sample of X are denoted by and respectively. The trace of X is denoted by . The Frobenius norm of matrix X is denoted by X. The similarity matrix corresponds to X is denoted by S, whose entry is . An identity matrix of size is represented by and 1
denotes a column vector with all elements as 1. The main notations used in this paper are summarized in TableI.
Ii-B Spectral Clustering
Spectral clustering[DBLP:conf/nips/NgJW01] requires Laplacian matrix as an input. It is computed as , where is a diagonal matrix with the diagonal element as . Supposing there are clusters in the dataset X, spectral clustering solves the following problem:
where is the clustering indicator matrix and means that the clustering label vector of each sample contains only one element 1 and the others are 0. As the discrete constraint is imposed on Y, Eq.(1) becomes a NP-hard problem. To tackle it, most existing methods first relax Y to continuous clustering indicator matrix , and then calculate the relaxed solution as
Where the orthogonal constraint is adopted to avoid trivial solutions. The optimal solution of F is comprised of eigenvectors of corresponding to the
smallest eigenvalues. OnceF is obtained, k-means is applied to generate the final clustering result. For presentation convenience, we denote F and Y as continuous labels and discrete labels respectively.
|Methods||Projective Subspace Learning||Information Loss||Optimal Graph||Discrete Optimization||Out-of-Sample Extension|
Ii-C Clustering and Projective Clustering with Adaptive Neighbors
Clustering and projective clustering with adaptive neighbors learns a structured graph for clustering. Given a data matrix X, all the data points are connected to
as neighbors with probability. A smaller distance is assigned with a large probability and vice versa. To avoid the case that only the nearest data point is the neighbor of with probability 1 and all the other data points are excluded from the neighbor set of , a nature solution is to determine the probabilities by solving
The second term is a regularization term, and is a regularization parameter.
In the clustering task that partitions the data into clusters, an ideal neighbor assignment is that the number of connected components is the same as the number of clusters . In most cases, all the data points are connected as just one connected component. In order to achieve an ideal neighbor assignment, the probability is constrained such that the neighbor assignment becomes an adaptive process and the number of connected components is exact . The formula to calculate S in them is:
where is large enough, is forced to be zero. Thus, the constraint can be satisfied [alavi1991graph].
Ii-D The Constrained Laplacian Rank for Graph based Clustering
The constrained Laplacian rank for graph based clustering (CLR)[DBLP:conf/aaai/NieWJH16] follows the same idea of clustering and projective clustering with adaptive neighbors. Differently, it learns a new data matrix S based on the given data matrix A such that S is more suitable for the clustering task. In CLR, the corresponding Laplacian matrix is also constrained as . Under this constraint, all data points can be directly partitioned into exact clusters[alavi1991graph]. Specifically, CLR solves the following optimization problem
Ii-E Key Differences between Our Methods and Existing Works
Our work is an advocate of discrete optimization of cluster labels, where the optimal graph structure is adaptively constructed, the discrete cluster labels are directly learned, and the out-of-sample extension can be well supported. Existing clustering methods, k-means (KM), normalized-cut (N-cut)[DBLP:journals/tcad/HagenK92], ratio-cut (R-cut)[DBLP:conf/cvpr/ShiM97], CLR, spectral embedding clustering (SEC), CAN and PCAN, suffer from different problems. Our methods aim to tackle them in a unified learning framework. The main differences between the proposed methods and existing clustering methods are summarized in Table II.
Iii The Proposed Methodology
In this section, we present the details of the proposed methods and introduce an alternative optimization for solving the problems.
Iii-a Overall Formulation
Most existing graph based clustering methods separate the graph construction and clustering into two independent processes. The unguided graph construction process may lead to sub-optimal clustering result. CAN and PCAN can alleviate the problem. However, they still suffer from the problems of information loss and out-of-sample extension.
In this paper, we propose a unified discrete optimal graph clustering (DOGC) framework to address their problems. DOGC exploits the correlation between similarity graph and discrete cluster labels when performing the clustering. It learns a similarity graph with optimal structure for clustering and directly obtains the discrete cluster labels. Under this circumstance, our model can not only take the advantage of the optimal graph learning, but also obtain discrete clustering results. To achieve above aims, we derive the overall formulation of DOGC as
where and are penalty parameters, Q is a rotation matrix that rotates continuous labels to discrete labels. The can be determined during the iteration. In each iteration, we can initialize , then adaptively increase if the number of connected components of S is smaller than and decrease if it is greater than .
In Eq.(6), we learn an optimal structured graph and discrete cluster labels simultaneously from the raw data. The first term is to learn the structured graph. To pursue optimal clustering performance, S should theoretically have exact connected components if there are clusters. Equivalently, to ensure the quality of the learned graph, the Laplacian matrix should have zero eigenvalues and the sum of the smallest eigenvalues, , should be zero. According to Ky Fan theorem[Fan1949On], . Hence, the second term guarantees that the learned S is optimal for subsequent clustering. The third term is to find a proper rotation matrix Q that makes FQ close to the discrete cluster labels Y. Ideally, if data points and belong to different clusters, we should have and vice versa. That is, we have if and only if data points and are in the same cluster, or equivalently and .
The raw features may be high-dimensional and they may contain adverse noises that are detrimental for similarity graph learning. To enhance the robustness of the model, we further extend Eq.(6) as
is a projection matrix. It maps high-dimensional data into a proper subspace to remove the noises and accelerate the similarity graph learning.
Iii-B Optimization Algorithm for Solving Problem (7)
In this subsection, we adopt alternative optimization to solve problem (7) iteratively. In particular, we optimize the objective function with respective to one variable while fixing the remaining variables. The key steps are as follows
Update S: For updating S, the problem is reduced to
Since , the problem (8) can be rewritten as
In problem (9), can be solved separately as follows
where and .
The optimal solution can be obtained by solving the convex quadratic programming problem , and .
Update F: For updating F, it is equivalent to solve
The above problem can be efficiently solved by the algorithm proposed by [DBLP:journals/mp/WenY13].
Update W: For updating W, the problem becomes
which can be rewritten as
We can solve W using the Lagrangian multiplier method. The Lagrangian function of problem (13) is
where is the Lagrangian multipliers. Taking derivative w.r.t W and setting it to zero, we have
We denote that . The solution of W in problem (15) is formed by eigenvectors corresponding to the smallest eigenvalues of the matrix V. In optimization, we first fix W in V. Then we update W by , and assign the obtained after updating to W in V[DBLP:conf/ijcai/WangLNH15]. We iteratively update it until condition[Lemar2006S] in Eq.(15) is satisfied.
Update Q: For updating Q, we have
It is the orthogonal Procrustes problem [DBLP:journals/chinaf/Nie0L17], which admits a closed-form solution.
Iii-C Determine the value of [DBLP:conf/kdd/NieWH14]
In practice, regularization parameter is difficult to tune since its value could be from zero to infinite. In this subsection, we present an effective method to determine the regularization parameter in problem (8). For each , the objective function in problem (9) is equal to the one in problem (10). The Lagrangian function of problem (10) is
where and are the Lagrangian multipliers.
According to the condition, it can be verified that the optimal solution should be
In practice, we could achieve better performance if we focus on the locality of data. Therefore, it is preferred to learn a sparse , i.e., only the nearest neighbors of have chance to connect to . Another benefit of learning a sparse similarity matrix S is that the computation burden can be alleviated significantly for subsequent processing.
Without loss of generality, suppose are ordered from small to large. If the optimal has only nonzero elements, then according to Eq.(21), we know and . Therefore, we have
According to Eq.(21) and the constraint , we have
Therefore, in order to obtain an optimal solution to the problem (10) that has exact nonzero values, we could set to be
The overall could be set to the mean of . That is, we could set the to be
The number of neighbors is much easier to tune than the regularization parameter since is an integer and it has explicit meaning.
Iii-D Out-of-Sample Extension
Recall that most existing graph based clustering methods can hardly generalize to the out-of-sample data, which is widely existed in real practice. In this paper, with the learned discrete labels and mapping matrix, we can easily extend DOGC for solving the out-of-sample problem. Specifically, we design an adaptive robust module with loss[DBLP:journals/tmm/YangZGZC14] and integrate them into the above discrete optimal graph clustering model, to learn prediction function for unseen data. In our extended model (DOGC-OS), discrete labels are simultaneously contributed by the original data through the mapping matrix P and the continuous labels F though the rotation matrix Q. Specifically, DOGC-OS is formulated as follows
where is the prediction function learning module. It is calculated as
is the projection matrix and the loss function isloss, which is capable of alleviating sample noise
is the row of matrix M. The above loss not only suppresses the adverse noise but also enhances the flexibility for adapting different noise levels.
Iii-E Optimization Algorithm for Solving Problem (27)
Due to the existence of loss, directly optimizing the model turns out to be difficult. Hence, we transform it to an equivalent problem as follows
where D is a diagonal matrix with its diagonal element computed as and which is denoted as the loss residual, is the row of R.
The steps of updating S, F, Q, W are similar to that of DOGC except the updating of P and Y.
Update P: For updating P, we arrive at
With the other variables fixed, we arrive at the optimization rule for updating P as
Update Y: For updating Y, we arrive at
Given the facts that and , we can rewrite the above sub-problem as below
where . The above problem can be easily solved as
In this subseciton, we discuss the relations of our method DOGC with main graph based clustering methods.
Connection to Spectral Clustering[DBLP:conf/aaai/HuangNH13a]. In our model, controls the transformation from continuous cluster labels to discrete labels, and is adaptively updated with the number of connected components in the dynamic graph S. When W is a unit matrix, the process of projective subspace learning with W becomes an identity transformation. When S is fixed, it is not a dynamic structure any more and will remain unchanged. When , the effect of the third item in Eq.(7) is invalid. Under these circumstances, Eq.(7) is equivalent to . Thus our model degenerates to the spectral clustering.
Connection to Optimal Graph Clustering[DBLP:conf/kdd/NieWH14]. In DOGC, when W is a unit matrix and , the effects of W and are the same as above. Differently, when S is dynamically constructed, Eq.(7) is equivalent to , where S contains a specific connected components and is adjusted by the value of . Under these circumstances, our model degenerates to the optimal graph clustering.
Iii-G Complexity Analysis
As for DOGC, with our optimization strategy, the updating of S requires . Solving Q involves SVD and its complexity is . To update F, we need . To update W, two layers of iterations should be performed to achieve convergence. The number of internal iterations is generally a constant, so the time complexity of updating W is . Optimizing Y consumes . In DOGC-OS, we need to consider another updating process of D and P which both consume . Hence, the whole time complexity of the proposed methods are all . The computation complexity is comparable to many existing graph-based clustering methods.
Iii-H Convergence Analysis
In this subsection, we prove that the proposed iterative optimization in Algorithm 1 will converge. Before that, we introduce three lemmas.
For any positive real number and , we can have the following inequality [DBLP:journals/tip/NieCLL18]:
Let be the row of the residual R in previous iteration, and be the row of the residual in current iteration, it has been shown in [DBLP:conf/aaai/KangPCX18] that the following inequality holds:
Given , then we have the following conclusion:
By summing up the inequalities of all , according to Lemma 2, we can easily reach the conclusion of Lemma 3. ∎
In DOGC-OS, updating will decrease the objective value of problem (27) until converge.
Let are the optimized solution of the alternative problem (27), and we denote
It is easy to know that:
According to Lemma 1, we have
We also denote
Then, we have