Clustering is an important unsupervised learning task which aims to group a set of data objects into clusters in such a way that objects in the same cluster are more similar to each other than those in different clusters. For complex datasets, Spectral Clustering and its many variants[37, 26, 13] are particularly popular due to their ability of discovering highly non-convex clusters. Such algorithms make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction before grouping objects in a low-dimensional space [20, 1, 35, 4, 5, 22, 24]. Typically, the algorithms in the Spectral Clustering family consist of multiple separate stages as follows:
Construct a pairwise similarity matrix, e.g., according to the -nearest-neighbors graph of the data;
Compute the corresponding Laplacian matrix and normalize it;
The above multi-stage approach may lead to sub-optimal clustering results due to the possible mismatch between different stages. Moreover, there is still much room for improvement in the optimization methods. For example, normalizing the similarity matrix into a doubly stochastic matrix111A doubly stochastic matrix is a square matrix that satisfies , , and , where is a column vector with all elements to be and represents element-wise non-negativity. has been found to be beneficial , and imposing some global priors (like Laplacian rank) could help to reveal the underlying clustering structure of the dataset [22, 36].
In light of the above analysis, we extend Spectral Clustering and propose an end-to-end single-stage learning framework for clustering named “Regularized Non-negative Spectral Embedding”, or RNSE in short. It does not rely on a predefined similarity matrix but learn the similarity matrix in a data-driven self-adaptive manner. Furthermore, it introduces two global priors, i.e., the doubly stochastic constraint (for normalizing the similarity matrix) and the non-negative low-rank constraint (for capturing the intrinsic clustering structure), to facilitate the optimization. The effectiveness of RNSE for clustering has been confirmed by extensive experiments on both synthetic and real-world datasets.
Ii Related Work
Our proposed clustering technique RNSE is closely connected to several research areas.
First of all, RNSE obviously has its roots in the Spectral Clustering  and also its many variants including Laplacian Eigenmap (LE) , Locality Preserving Projections (LPP) , and Spectral Regression (SR) . Those techniques start from a predefined pairwise similarity matrix and perform clustering (and other tasks) via spectral decomposition. They typically consist of several separate stages. In contrast, our proposed RNSE technique carries out the whole process from the given data straight to the clustering result in just one stage, with the similarity matrix automatically learned.
Next, the formulation of RNSE imposes non-negative constraints, so it is related to the series of Non-negative Matrix Factorization (NMF)  techniques including Non-negative Matrix Tri-factorization (NM3F)  and Graph regularized NMF (GNMF) . Those techniques decompose the data matrix into two or more low-rank non-negative matrices from which the clustering structures of the data could be read out. However, our proposed RNSE technique is different from them as it contains two sub-problems of optimization with non-negative constraints which are combined in a unified optimization framework. Thus RNSE could obtain the clustering results directly after solving the optimization problem, while those NMF-based methods need some post-processing such as using the K-Means algorithm  to get the final clustering results.
Last but not the least, RNSE utilizes structural regularization in its learning algorithm. Generally speaking, it is useful to incorporate appropriate priors into the learning process as the priors could help to find the intrinsic structure of data. Locality Preserving Projections (LPP)  tries to maintain the -nearest-neighbors graph while performing linear dimensionality reduction of the data. Similarly, Graph Regularized Non-negative Matrix Factorization (GNMF)  adds a -nearest-neighbors graph based regularizer term to the vanilla NMF algorithm. Those two techniques both use the local (manifold) structure of data for regularization. There also exist techniques with global structural regularization. For example, Doubly Stochastic Normalization (DSN)  enforces the doubly stochastic condition on the similarity matrix before carrying out Spectral Clustering. Besides, Structured Doubly Stochastic Matrix (SDS) , Clustering with Adaptive Neighbors (CAN)  and Constrained Laplacian Rank (CLR)  borrow the idea of -connected components (cf. Theorem 1) from the spectral graph theory to form the regularization for their learning algorithms. Inspired by the above methods, our proposed RNSE technique utilizes both global structures (i.e., doubly stochastic matrix and non-negativity constrained -connected components) as regularizers for clustering.
Iii The Proposed Approach
Given a data matrix , where is the dimension of a sample, marks the number of the total samples, and denotes the -th sample (). Let be the similarity matrix and corresponds to the similarity between and . Besides, consider that be a feature mapping from onto a reproducing kernel Hilbert space and there exists . Classic clustering methods usually pre-compute the similarity matrix based on the Euclid distance between pairwise samples. However, we formulate it as a data-based learning problem:
where is constrained to become a doubly stochastic matrix for better clustering [35, 29]. Note that we utilize the distance between the kernel mappings instead of the original vectors in Eq. (1) because measures the Euclid space and kernel mappings in the Hilbert space would match such characteristic. Based on the similarity matrix , the classic spectral embedding methods [1, 20, 35, 22, 24] could be formulated as below:
where is the spectral embedding for the data points and is the dimension of the embedding vectors. Thereafter, K-Means is adopted to cluster the data samples as a popular post-processing technique. However, we further put the non-negativity on , set to (the number of clusters), and finally arrive at:
Noticeably, each column of will be one-hot vector; in other words, could be treated as an indicator matrix for clustering. Besides, from the following Theorem 1 and Theorem 2, we can conclude that the optimization (3) actually captures the intrinsic structures, i.e., -collected clusters.
Theorem 1 (-connected clusters ).
“The multiplicity of the eigenvalue of the Laplacian matrix is equal to the number of connected clusters/components in the graph associated with .”, which implies:
where , and are the eigenvalues of in an ascending order.
Theorem 2 (Ky Fan’s Theorem ).
Given a matrix , the following optimization problem:
is equivalent to .
Generally, learning with multi-stages, e.g., Spectral Clustering, would usually lead to sub-optimal solutions for clustering. Therefore, we build a marriage between (1) and (3) into a joint learning approach:
Obviously, is a set of doubly stochastic matrices, is a set of non-negative low-rank matrices, and and are two positive hyper-parameters. The philosophy of the optimization (6) is an end-to-end single-stage learning for clustering based on non-negative spectral embedding. Therefore, we call our method “Regularized Non-negative Spectral Embedding (RNSE)” for Clustering.
Iv Optimization Methods
Regarding the objective function , there are two coupled variables to be learned which indicates that it’s a non-convex optimization problem. Thus, we adopt the classic strategies to address such optimization problem with alternative iterations [14, 15, 3], i.e., updating while keeping fixed and vice versa, until a local minima is achieved. The learning process is narrated in Algorithm 1. Subsequently, we depict the detailed ideas for solving the two subproblems.
Iv-a Optimizing while keeping fixed
When clustering indicator matrix is fixed, the subproblem for optimizing similarity matrix can be written as:
Set , then we can transform (10) into:
which is further equivalent to the simplified formalization:
with . Essentially, this is an optimization problem to find a doubly stochastic matrix nearest to the given matrix , which could be converted into the “metric projection optimizations” and solved with alternating projection methods.
Definition 1 (Metric Projection).
Given a set and a point , the metric projection (if exists) of onto is a point such that:
Additionally, if for any , there exists such a unique , then the metric projection onto is rewritten as the following operator:
Theorem 3 (Projection Theorem).
Set be a closed convex set. For any , there exists a unique such that for all , which is formally denoted as:
See Ref. . ∎
Theorem 4 (Dykstra’s Method).
Let , , , be closed convex sets and . If , then given iterated by:
with initial values , there holds:
Let and , then is a closed convex set.
This can be easily verified. ∎
Now, let’s go back to subproblem (12). Firstly, Theorem 3 and Theorem 5 together tell us that there must be one global and unique for the optimization problem (12). Then, inspired by the Theorem 4, we could turn (12) into:
Given any point , the global optimal solution to is
See Ref. . ∎
Given any point , the global optimal solution to is
where is an element-wise non-negative operation.
This can be easily verified. ∎
Iv-B Optimizing while keeping fixed
It is straightforward to see that, when similarity matrix is fixed, the optimization problem (6) could be reduced as:
Regarding that is constrained to be both orthogonal and non-negative in (22), it seems quite challenging to deal with such problem. Subtly, since the similarity matrix is a doubly stochastic matrix, then we could draw the following Theorem 8.
Theorem 8 (-Transformer).
Given is a doubly stochastic matrix, the subproblem (22) could be converted to the following optimization problem:
Since , then there is , which is followed by:
In light of the optimization problem (23), its augmented Lagrange function is displayed as:
where denotes the Lagrange multiplier for the constraint . With respect to the non-negative constraints, they could be ignored in the augmented Lagrange function when the multiplicative update philosophy [15, 9, 31] is adopted, because the non-negativity for variable is naturally maintained during iterative updatings. More specifically, the derivative of w.r.t. can be formulated into two parts, i.e.,
where and both represent element-wise non-negative matrices. Then the iterative formula for updating could be written as:
where and corresponds to the element-wise multiplication and element-wise division respectively . Obviously, if the factors in Equation (27) are all non-negative, then the result will also hold non-negativity. In order to figure out the detailed formula for Equation (27), we have to derive .
Taking the partial derivatives of w.r.t. and respectively, and setting them to zero by the Karush-Kuhn-Tucker conditions, we arrive at:
Then the specific formula for updating is expressed in the following:
or in an element-wise version:
This section would empirically evaluate the effectiveness of RNSE for clustering on both synthetic and real-world datasets.
V-a Experiments on Synthetic Datasets
The first synthetic dataset we constructed is a matrix with four block matrices diagonally arranged (Fig. 1). The data in each block denotes the affinity of any two points within one cluster while the data outside all blocks denotes noise. The affinity data within each block is randomly generated in the range from to , while the noise is randomly generated in the range from to , which is set as , and respectively during the experiments.
Fig. 1 exhibits the original graphs and their corresponding clustering results under different noise (e.g. ) settings. We can see that RNSE overall presents good performances w.r.t. clustering task. Specifically, RNSE successfully learns a structured doubly stochastic matrix with explicit block structures, which divided the data samples into four clear clusters. As the noise increases, the block structure in the original graph blurs, but RNSE is still able to detect the intrinsic structures of the data, which indicates the robustness of the RNSE method for potential practical applications.
The second synthetic dataset is a randomly generated two-moon dataset. There are two clusters with each being a volume of samples distributed in the moon shape (Fig. 2). Here, we tested K-Means, Ncuts[27, 20] and RNSE on such dataset. Note that in this figure, the color of the two clusters are set to be red and blue, respectively; and the green lines denote the affinity of any two points. Obviously on the whole, Fig. 2 tells us the RNSE’s effectiveness.
More specifically, some analysis could be drawn as follows. First, there are some points split into the wrong cluster w.r.t K-Means (Fig. 2(b)). It’s easy to understand that K-Means mainly deals with ball-like distributed data points, which is obviously not fit for the manifold (e.g. two-moon) data points. Second, From Fig. 2(c), NCuts could well divide the data points into two separate clusters, but the green lines (affinities) between samples are mixed across different clusters, which indicates that the classic spectral clustering methods tends to mis-recognize neighbors. Third, RNSE (Fig. 2(d)), extended from the spectral family, could tell the differences between both classes and neighbors. This implies that RNSE potentially holds stronger abilities than the classic spectral methods (e.g. NCuts) in handling complex datasets.
V-B Experiments on Real-world Datasets
Datasets. Eight real-world datasets are selected in the clustering experiments. More specifically, the “diabetes”, “arcene”,“yeast_uni”, “waveform21” and “gisette” are 5 publicly available collections from website333https://archive.ics.uci.edu/ml/datasets.php; the “PCMAC” datasete is available from website444http://featureselection.asu.edu/datasets.php; while the “mnist” and “alpha-digit” datasets are collected from Sam Roweis’ page555https://cs.nyu.edu/home/index.html. Note that we just select the top-100 samples of each digit (“0”“9”) in “mnist” dataset for our experiments. The detailed statistics are summarized in Table I.
Competing Methods. To demonstrate the effectiveness of RNSE, we compare it with several popular clustering algorithms, i.e., (1) Canonical K-Means (K-Means)666http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans
, (2) Principal Component Analysis (PCA)777http://www.cad.zju.edu.cn/home/dengcai/Data/DimensionReduction.html, (3) K-Means clustering in the Low-Rank subspaces (LRR)888https://sites.google.com/site/guangcanliu/, (4) Non-negative Matrix Factorization (NMF)[14, 33]999http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF, (5) Normalized Cut (NCuts)[27, 20]101010https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering, (6) Structured Doubly Stochastic Matrix (SDS), (7) Clustering with Adaptive Neighbors (CAN) and (8) Constrained Laplacian Rank Algorithm for Graph-based Clustering (CLR)111111SDS, CAN and CLR are available at http://www.escience.cn/people/fpnie/index.html.
|Note: “—” denotes that the mixed signed matrices for those datasets are not suitable for the MU algorithms employed by NMF.|
Among these algorithms, NCuts, SDS, CAN, CLR and RNSE are approaches that consider the graph-based structures. PCA and LRR seek the low-rank principal components for data compression. NMF is a model with non-negative constraints and thus could learn additive parts-based components. Note that K-Means directly splits the original high-dimensional data points into clusters; while our RNSE method could also cluster the datasets directly owing to the end-to-end single-stage learning for indicator matrix. However, other methods (excluding K-Means, CAN, CLR and RNSE) mainly conduct two-stage learning for data clustering, i.e., low-dimensional embedding and K-Means clustering.
Evaluation Metrics. Accuracy (ACC) and Purity are widely accepted and therefore employed here to assess the clustering performance. The higher the performance scores, the better the clustering results, based on the ground-truth class labels presented in the datasets. For more details, please refer to Refs.  and .
Experimental Settings. In light of the experimental settings, all the baseline methods adopt the best parameter configurations as suggested in their corresponding papers. We use the widely used self-tune Gaussian method 
to construct the affinity matrix with kernel function(the value of is self-tuned, for not only RNSE but also SDS, CLR, and CAN as suggested in their corresponding papers).
As to our RNSE approach, we empirically achieve the competitive and according to the grid searches with and , respectively. Generally speaking, the best clustering performances on different datasets correspond to slightly different parameter settings. But around the parameters and , all have achieved competitive experimental results.
In addition, the maximum iterations for Algorithm 2 and Algorithm 3 are set to and respectively, and the convergence precision is configured as for these two sub-algorithms. Based on such settings, the number of outer cycles for Algorithm 1 is set to for convergence.
Clustering Results. In this part, we collect the average clustering results in terms of ACC and Purity for all the algorithms on these datasets and show them in Table II and Table III. Note that for all datasets, and in RNSE are both set to , we repeat the experiments for times and average the metric values as the final results. Broadly speaking, different clustering approaches perform differently on various datasets. From Table II, we can easily figure out that LRR presents high ACC values on “PCMAC” dataset while performs quite poorly on “yeast_uni” and “alpha_digit” datasets. This phenomenon also appears in Table III
w.r.t. Purity. It’s probably due to the complex structures of “yeast_uni” (the cellular localization sites of proteins) and “alpha_digit” (containing both letters and numbers), which are not well match for the low-rank assumptions. In terms of NCuts, a spectral-based method, always performs quite well on various datasets, which is reasonable because spectral-based algorithms could capture the local structures by keeping the neighborhood similarities and therefore preserve the nonlinear manifolds on complex datasets. However, it’s usually inferior to the best performers, and this is probably owing to the self-defined similarities and multi-stage learning. In light of other competitors (taking SDS for example), they also display the similar patterns, i.e., yielding high values on some datasets (“yeast_uni” or “waveform-21”), but meanwhile showing poor performance on some other datasets (“alpha_digit” or “PCMAC”). Nevertheless, as one can see clearly from all the experimental results in TableII and Table III, our RNSE method consistently achieves the best or at least comparative performances on all the datasets regarding ACC and Purity. This confirms that by designing an end-to-end single-stage learning paradigm with structured constraints, RNSE could better capture the hidden complex structures and thus learn a well-performed indicator matrix for clustering.
Convergence & Complexity Analysis. The updating rules in Algorithm 1 for minimizing the objective function in the optimization problem (6) are essentially iterative. Here, we investigate their convergence and fastness via experiments. Fig. 3 plots the loss curves of our RNSE on all the selected datasets. In each sub-graph, the -axis denotes the normalized objective function value121212Each iteration’s value is divided by the first iteration’s value. and -axis is the iteration number. Obviously, we could read that the RNSE algorithm converges quickly, usually within 20 iterations.
By the way, it’s also easy to analyze the computational complexities of the two subproblems (Algorithm 2 and 3 ) of RNSE, which corresponds to and , respectively (usually ). Therefore, the RNSE’s overall complexity is square to the number of samples which is much faster than many existing competitors’ complexity (i.e., ). However, for a larger dataset, say million-scale samples, RNSE is reasonably to loss efficiency which posits a big challenge for future work.
To sum up, the main contributions of this paper are twofold: First, we prove the advantage of performing spectral-style clustering in an end-to-end single-stage fashion where the similarity matrix is not prefixed but learned adaptively from the data. Second, we show that the difficult optimization problem of our proposed RNSE technique can be decomposed into two subproblems (i.e., metric projection and orthogonal symmetric non-negative matrix factorization) and then solved by successive alternating projection and strategic multiplicative update respectively.
-  (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. NeurIPS, pp. 585–591. Cited by: §I, §II, §III.
-  (1986) A method for finding projections onto the intersection of convex sets in hilbert spaces. Lecture Notes in Statistics, pp. 28–47. Cited by: §IV-A.
-  (2011) Graph regularized nonnegative matrix factorization for data representation. TPAMI, pp. 1548–1560. Cited by: §II, §II, §IV.
-  (2007) Spectral regression: a unified approach for sparse subspace learning. ICDM, pp. 73–82. Cited by: §I, §II.
-  (2011) Large scale spectral clustering with landmark-based representation. AAAI, pp. 313–318. Cited by: §I.
-  (1997) Spectral graph theory. CBMS Regional Conference Series in Mathematics, No. 92, American Mathematical Society. Cited by: Theorem 1.
-  (1969) Optimization by vector space methods. John Wiley and Sons, Inc., New York. Cited by: §IV-A.
On a theorem of weyl concerning eigenvalues of linear transformations. National Academy of Sciences of the United States of America, pp. 652–655. Cited by: Theorem 2.
-  (2015) Multi-view concept learning for data representation. TKDE 27 (11), pp. 3016–3028. Cited by: §IV-B.
-  (2004) Locality preserving projections. NeurIPS, pp. 153–160. Cited by: §II, §II.
Symmetric nonnegative matrix factorization: algorithms and applications to probabilistic clustering.
IEEE Trans. Neural Networks22 (12), pp. 2117–2131. Cited by: §IV-B.
-  (2013) Robust manifold nonnegative matrix factorization. TKDD, pp. 1–21. Cited by: §V-B.
-  (2017) Deep subspace clustering networks. NeurIPS, pp. 24–33. Cited by: §I.
-  (1999) Learning the parts of objects by non-negative matrix factorization. Nature, pp. 788–791. Cited by: §II, §IV, §V-B.
-  (2001) Algorithms for non-negative matrix factorization. NeurIPS, pp. 556–562. Cited by: §IV-B, §IV.
-  (2013) Robust recovery of subspace structures by low-rank representation. TPAMI, pp. 171–184. Cited by: §V-B.
-  (1982) Least squares quantization in PCM. IEEE Trans. Information Theory, pp. 129–136. Cited by: item d, §II, §V-A, §V-B.
-  (2010) Orthogonal nonnegative matrix tri-factorization for semi-supervised document co-clustering. PAKDD, pp. 189–200. Cited by: §II.
-  (2004) Self-tuning spectral clustering. NeurIPS, pp. 1601–1608. Cited by: §V-B.
On spectral clustering: analysis and an algorithm. NeurIPS, pp. 849–856. Cited by: §I, §II, §III, §V-A, §V-B.
-  (2014) Clustering and projected clustering with adaptive neighbors. KDD, pp. 977–986. Cited by: §II, §V-B.
-  (2016) The constrained laplacian rank algorithm for graph-based clustering. AAAI, pp. 1969–1976. Cited by: §I, §II, §III, §V-B.
-  (2012) Initialization independent clustering with actively self-training method. IEEE Trans. Systems, Man, and Cybernetics, pp. 17–27. Cited by: §V-B.
-  (2017) Unsupervised large graph embedding. AAAI, pp. 2422–2428. Cited by: §I, §III.
-  (2011) Alternating projection methods. Fundamentals of Algorithms Society for Industrial and Applied Mathematics. Cited by: §IV-A.
-  (2018) Spectralnet: spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587. Cited by: §I.
-  (2000) Normalized cuts and image segmentation. TPAMI, pp. 888–905. Cited by: §V-A, §V-B.
-  (2012) The method of alternating projections. Doctoral dissertation, University of Newcastle, Australia. Cited by: §IV-A.
-  (2010) Learning a bi-stochastic data similarity matrix. ICDM, pp. 551–560. Cited by: §III.
-  (2016) Structured doubly stochastic matrix for graph based clustering. KDD, pp. 1245–1254. Cited by: §II, §IV-A, §V-B.
-  (2013) Nonnegative matrix factorization: A comprehensive review. TKDE 25 (6), pp. 1336–1353. Cited by: §IV-B.
-  (1987) Principal component analysis. Chemometrics and Intelligent Laboratory Systems, pp. 37–52. Cited by: §V-B.
-  (2003) Document clustering based on non-negative matrix factorization. SIGIR, pp. 267–273. Cited by: §V-B.
-  (2008) Orthogonal nonnegative matrix factorization: multiplicative updates on stiefel manifolds. IDEAL, pp. 140–147. Cited by: §IV-B.
-  (2007) Doubly stochastic normalization for spectral clustering. NeurIPS, pp. 1569–1576. Cited by: §I, §II, §III.
-  (2017) Adaptive manifold regularized matrix factorization for data clustering. IJCAI, pp. 3399–3405. Cited by: §I.
-  (2019) Robust unsupervised flexible auto-weighted local-coordinate concept factorization for image clustering. ICASSP, pp. 2092–2096. Cited by: §I.