, subspace clustering is regarded as an important technique in the data mining community and for various computer vision applications[5, 6, 7, 8]
. Traditional subspace clustering methods approximate a set of high-dimensional data samples into a union of lower-dimensional linear subspaces, where each subspace usually contains a subset of the samples.
In recent years, spectral clustering based methods have achieved state-of-the-art performance, taking a two-step framework as follows. First, by optimizing a self-representation problem [4, 10], a similarity matrix (and also a similarity graph) is constructed to depict the relationship (or connection) among samples. Second, spectral clustering  is employed for calculating the final assignment based on eigen-decomposition of the affinity graph. Note that in practice, both the number of subspaces and their dimensionalities are always unknown [4, 12]. Hence, the goals of subspace clustering include finding the appropriate number of clusters and grouping data points to them [13, 14].
Nevertheless, it is challenging to estimate the number of clusters in a unified optimization framework, since the definition of clusters is subjective, especially in the high-dimensional ambient space . Also, for samples which are of different clusters but near the intersection of two subspaces, they may be even closer than samples from same cluster. This may lead to a wrong estimation with several redundant clusters, namely over-segmentation problem. Therefore, most of the spectral based subspace clustering algorithms depend on a manually given and fixed number of clusters, which cannot be generalized for multiple applications .
Most clustering schemes group similar patterns into the same cluster by jointly minimizing the inter-cluster similarity and the intra-cluster dissimilarity . Considering the complexity of the high-dimensional ambient data space, an effective way for estimating the number of clusters is to first map the raw samples into an intrinsic correlation space, namely similarity matrix, followed by an iterative optimization according to the local and global similarity relationships derived from the projection. Elhamifar et al.  propose that the permuted similarity matrix can be block-diagonal, where the number of blocks is identical to the number of clusters. Moreover, Peng et al.  verify the intra-subspace projection dominance (IPD) of such a similarity matrix, which can be applied to self-representation optimizations with various kinds of regularizations. The IPD theory says that for two arbitrary samples from the same subspace and one from another, the generated similarities between the former samples are always larger than between latter ones in a noise-free system.
Accordingly, considering an affinity graph derived from the similarity matrix [17, 18], where the vertices denote data samples and the edge weights denote similarities, an automatic sub-graph segmentation can be greedily conducted via the following two steps inspired by the density based algorithms : 1) constructing a proper number of initialized cluster centers by minimizing the weighted sum of all inter-cluster connections and maximizing the intra-cluster ones; 2) merging the remaining samples to an existing cluster by maximizing the weighted connections between the sample and the cluster.
Yet, there are also difficulties for these greedy iterative schemes. Since the ambient space can also be very dense , any two points which are close by evaluating the pairwise distance may not belong to the same subspace, especially for samples near the intersection of two subspaces. Consequently, a hypergraph where each edge can be connected to more than two samples [21, 22] is proposed to solve the problem in traditional pairwise graphs. In this paper, we further introduce a novel data structure termed as the triplet relationship, to explore the local geometry structure in projected space with hyper-correlations. Each triplet consists of three points and their correlations, which are considered as a meta-element for clustering. We require that all correlations are large enough, which indicates that the three points are strongly connected according to the IPD property .
In contrast to evaluating similarities using pairwise distances, the proposed triplet relationship demonstrates favorable performance due to the following two reasons. On one hand, it is more robust when partitioning the samples near the intersection of two subspaces since the mutual relevance among multiple samples can provide complementary information when calculating the local segmentation. On the other hand, the triplet evokes a hyper-similarity by efficiently counting the frequency of intra-triplet samples, which enables a greedy way to calculate the assignments.
Based on the newly defined triplet, in this paper, we further propose a unified framework termed as the autoSC to jointly estimate the number of clusters and group the samples by exploring the local density derived from triplet relationships. Specifically, we first calculate the self-representation for each sample via an off-the-shelf optimization scheme, followed by extracting the triplet relationship for all samples. Then, we greedily initialize a proper number of clusters via optimizing a new model selection reward, which is achieved by maximizing inter-cluster dissimilarity among triplets. Finally, we merge each of the remaining samples into an existing cluster to maximize the intra-cluster similarity by optimizing a new fusion reward. We also fuse groups to avoid over-segmentation.
The main contributions of this paper are summarized as follows:
First, we define a hyper-dimensional triplet relationship which ensures a high relevance and density among three samples to reflect their local similarity. We also validate the effectiveness of triplets and distinguish them against the standard pairwise relation.
Second, we design a unified framework, i.e., autoSC, based on the intrinsic geometrical structures depicted by our triplet relationships. The proposed autoSC can be used for simultaneous estimating the number of clusters and subspace clustering in a greedy way.
Extensive experiments on benchmark datasets indicate that our autoSC outperforms the state-of-the-art methods in both effectiveness and efficiency.
This paper is an extended version of our earlier conference paper , to which we enrich the contributions in the following five aspects: (1) We add detailed analysis of the proposed algorithm to distinguish it from comparative methods, for example, we add analysis and experimental validation on the computational complexity. (2) We provide a visualized illustration of the proposed autoSC for clearer presentation. (3) We propose a relaxation termed as the neighboring based autoSC (autoSC-N), which directly calculates the neighborhood relationship from raw data space and is more efficient than autoSC. (4) We conduct experiments on evaluating the influence of the parameter (number of preserved neighbors for each sample). (5) We experimentally evaluate our method on real-world application, i.e., motion segmentation, which also demonstrates the benefits of the proposed method.
Ii Related Work
Automatically approximating samples in high-dimensional ambient space by a union of low-dimensional linear subspaces is considered to be a crucial task in computer vision [9, 23, 24, 25, 26]. In this section, we review the related contributions in the following three aspects, i.e., self-representation calculation, estimating the number of clusters and hyper-graph clustering.
Ii-a Calculating Self-Representation
To separate a collection of data samples which are drawn from a high-dimensional space according to the latent low-dimensional structure, traditional self-expressiveness based subspace clustering method calculates a linear representation for each sample using the remaining samples as a basis set or a dictionary [27, 28]. Subspace clustering assumes that the set of data samples are drawn from a union of multiple subspaces, which can best fit the ambient space . There are numerous real applications satisfying this assumption with varying degrees of exactness , e.g
., face recognition, motion segmentation,etc.
By solving an optimization problem with self-representation loss and regularizations, subspace clustering [30, 31] calculates a similarity matrix where each entry indicates the relevance between two samples. Different regularizing schemes with various norms of the similarity matrix, e.g., , , elastic net  or nuclear norm , can explore different intrinsic properties of the neighborhood space. There are mainly three types of the regularization terms, including sparse-oriented, densely-connected and mixed norms.
Algorithms based on sparse-type norms [10, 35], e.g., and norms, eliminate most of the non-zero values in the similarity matrix to ensure that there are no connections between samples from different clusters. Elhamifar and Vidal  propose the sparse representation based on norm optimization. The obtained similarity matrix recovers a sparse subspace representation but may not satisfy the graph connectivity if the dimension of the subspace is greater than three . In addition, the based subspace clustering methods aim to compute a sparse and subspace-preserving representation for each data sample. Yang et al.  present a sparse clustering method with a regularizer based on the norm by using the proximal gradient descent method. Numerous alternative methods have been proposed for minimization while avoiding non-convex problems, e.g., orthogonal matching pursuit  and nearest subspace neighbor . The scalable sparse subspace clustering by orthogonal matching pursuit (SSC-OMP) method  compares elements in each column of the dot product matrix to determine which positions of the similarity matrix should be non-zero. However, this general pairwise relationship does not reflect the sample correlation well, especially for data pairs in the intersection of two subspaces .
In contrast, dense connection based methods, such as smooth representation  with norm and low rank representation with nuclear norm based methods [39, 40], propose to preserve many non-zero values in the similarity matrix to ensure the connectivity among intra-cluster samples [41, 42, 43]. For these densely connected frameworks [44, 45], the similarity matrix is interpreted as a projected representation of raw samples. Each column of the matrix is considered as the self-representation of a sample, and should be dense for mapping invariance (also termed as the grouping effect [46, 32]). Low-rank clustering methods [47, 48] solve a nuclear norm based optimization problem with the aim of generating a block diagonal solution with dense connections. However, the nuclear norm does not enforce subset selection well when noise exists, and the self-representation is too dense to be an efficient feature.
Neither a sparse nor dense similarity matrix reveals a comprehensive correlation structure among samples due to their conflicted nature [49, 50, 51]. Consequently, to achieve trade-off between sparsity and the grouping effect, numerous mixed norms, e.g., trace Lasso  and elastic net , have been integrated into the optimization function. Nevertheless, the structure of the data correlations depends on the data matrix, and the mixed norm is not effective for structure selection. Therefore, this method does not perform consistently well on different applications.
Recently, many frameworks that incorporate various constraints into the optimization function have been proposed to detect different intrinsic properties of the subspace [34, 53, 54]. For instance, to handle sequential samples, Guo et al.  explore the neighboring relationship by incorporating a new penalty, i.e., a lower triangular matrix with on the diagonal and on the second diagonal, to force consecutive columns in the similarity matrix to be closer. In this paper, based on the intrinsic neighboring relevance and geometrical structures depicted in the similarity matrix, we calculate triplet relationships to form a hyper-correlation constraint of the clustering system. We validate the robustness of the proposed triplet relationship on top of different similarity matrices with various intrinsic properties.
Ii-B Estimating the Number of Clusters
Most of the real applications in computer vision require estimating the number of clusters, according to the latent distribution of data samples . To solve this problem, three main techniques exists: singular-based Laplacian matrix decomposition, density-based greedy assignment and hyper-graph based segmentation.
propose a heuristic estimator inspired by the block-diagonal structure of the similarity matrix
. Specifically, they estimate the number of clusters by counting the small singular values of a normalized Laplacian matrix which should be smaller than a given cut-off threshold. These singular based methods[59, 9] are dependent on a large gap between singular values, which is limited to applications in which the subspaces are sparsely distributed in the ambient space. Meanwhile, the matrix decomposition process is time-consuming when extended to large scale problems. Recently, Li et al. propose SCAMS [60, 29] which estimates the number of clusters by minimizing the rank of a binary relationship matrix encoding the pairwise relevance among all data samples. Simultaneously, they incorporate a penalty term on the clustering cost by minimizing the Frobenius inner product of the similarity matrix and binary relationship matrix.
Density based methods  greedily discover both the optimal number of clusters and the assignments of data to the clusters according to the local and global densities which are calculated by the pairwise distances in ambient space. Rodriguez et al.  automatically cluster samples based on the assumption that each cluster center is characterized by a higher density in the weight space than all its neighbors, while different centers should be far apart enough to avoid redundancy. Specifically, for each sample, its Euclidean-based local density and the distance to any points with higher densities are iteratively calculated and updated. In each iteration, the algorithm finds a trade-off between the density of cluster centers and the inter-cluster distance to update the assignments. Wang et al. 
employ the Bayesian nonparametric method based on a Dirichlet process, and propose DP-space, which exploits a trade-off between data fitness and model complexity. DP-space is more tolerate to noisy and outlier values than the alternative algebraic and geometric solutions. Recently, correlation clustering (CC) first constructs an undirected graph with positive and negative edge weights, followed by minimizing the sum of cut weights during the segmenting process. Sequentially, the clustering assignments can be optimized with a greedy scheme. Nevertheless, most of these density based algorithms are limited to pairwise correlation when evaluating the similarity of data samples, which is not robust for densely distributed subspaces.
Ii-C Hyper-graph Clustering
To tackle the limitations of the pairwise relation based methods, the hyper-graph relation [63, 64, 65] is proposed and the related literature follows two different directions. Some transform the hyper-correlation into a simpler pairwise graph [21, 66], followed by a standard graph clustering method, e.g., normalized cut , to calculate the assignments. Besides, other methods [39, 13]
explore a generalized way of extending the pairwise graph to the hyper-graph or hyper-dimensional tensor analysis. For instance, Liet al. propose a tensor affinity variant of SCAMS, i.e., SCAMSTA 
, which exploits the higher order mathematical structures by providing multiple groups of nodes in the binary matrix derived from an outer product operation on multiple indicator vectors. However, estimating the number of clusters from the rank of the affinity matrix only works well for the ideal case, and can hardly be extended to complex applications since the noise can have a significant impact on the rank of affinity matrix.
In this paper, we estimate the number of clusters by initializing the cluster centers with maximum inter-cluster dissimilarities and also maximum local densities. We calculate the initialization according to the local correlations reflected by the proposed triplet relationships, where each of them depicts a hyper-dimensional similarity among three samples and easily-evaluated relevances to other triplets. Both theoretical analysis on triplets as well as the experimental results demonstrate the effectiveness of the proposed method.
The main notations in the manuscript and the corresponding descriptions are shown in Table I. Given a set of data samples lying in subspaces where denotes the dimensionality of each sample, spectral based subspace clustering usually takes a two-step approach to calculate the clustering assignment. First, it learns a self-representation for each sample to disentangle the subspace structure. The algorithm then employs spectral clustering  on the learned similarity graph derived from for final assignments. Note that in practice, both the number of subspaces and their dimensions are always unknown [4, 12]. Hence, the goals of subspace clustering include finding the appropriate and assigning data points into clusters [16, 13].
In this paper, inspired by the block-diagonal structure of the similarity matrix , we propose to simultaneously estimate the number of clusters and assign samples into each cluster in a greedy manner. We design a novel meta-sample that we call a triplet relationship, followed by optimizing both a model selection reward and a fusion reward for clustering.
Iii-B Learning the Self-Representation
To explore the neighboring relationship in the ambient space , typical subspace clustering methods first optimize a linear representation of each data sample using the remaining dataset as a dictionary. Specifically, spectral based subspace clustering calculates a similarity matrix by solving a self-representation optimization problem as follows:
where denotes the reconstruction loss, is the trade-off parameter and denotes the regularization term where different ’s lead to various norms [4, 10], e.g., [9, 16], [58, 42], , or many kinds of mixed norms like trace Lasso  or elastic net .
The in (1) can be explained as a new representation of , where each sample is mapped to . Furthermore, is a pairwise distance matrix where each entry reflects the similarity between two samples and . Nevertheless, the pairwise distance reflects poor discriminative capacity on partitioning samples near the intersection of two subspaces. To handle this problem, in this paper, we explore a higher-dimensional similarity called Triplet relationship, which is based on a greedy combination of pairwise distances reflected by .
Iii-C Discovering Triplet Relationships
|data sample, a set of samples|
|samples in without|
|a subspace in the ambient space|
|triplet, a set of triplets|
|dimensionality of subspace, dimensionality of sample|
|number of triplets, number of samples|
|real number of clusters, initialized number of cluster centers, estimated number of clusters|
|similarity matrix derived from subspace representation method|
|binary similarity matrix preserving the top values in each row of and modifying them to|
|-th entry of the similarity matrix|
|-th column of|
|set of nearest neighbors of|
|cluster center which contains part of samples in a subspace|
|set of triplets which are already/not assigned into clusters in the -th iteration|
|set of samples which are already/not assigned into clusters in the -th iteration, preserving the frequency|
|model selection reward of|
|fusion reward that being fused into|
|connection score of toward|
|local density of against|
|one of the result groups, set of the result groups|
|NC||deviation rate between the estimated and|
|error rate of the triplets|
Given the similarity matrix where each entry reflects the pairwise relationship between and , we propose to find the neighboring structure in a greedy way. For each data sample , subspace clustering algorithms calculate a projected adjacent space based on the self-expressive property, i.e., each data sample can be reconstructed by a linear combination of other points in the dataset [16, 68]. Therefore, is represented as
where includes samples except for , which is considered as a self-expression dictionary for representation. In addition, records the coefficients of such combination system. With the regularization from various well-designed norms on , the optimized result of (2) is capable of preserving only linear combinations of samples in while eliminating others. Inspired by [3, 37], for each sample , we first collect its nearest neighbors, i.e. those with the top coefficients in . The nearest neighbors are defined as follows.
( Nearest Neighbors) Let denote the nearest neighbors for data point . Let:
where denotes the set of indices for the nearest neighbors, and denotes the coefficient between and .
According to Definition 1, we obtain for which contains the samples with the largest coefficients in . The number of preserved neighbors, i.e., the parameter , reflects the intrinsic dimension of the low-dimensional subspaces [69, 70], which we empirically evaluated in the experiment section. Based on the nearest neighbors, we define the triplet relationship to explore the local hyper-correlation among samples.
(Triplet Relationship) A triplet includes three samples, i.e., , and their relationships, if and only if and satisfy:
where denotes the indicator function which equals if and otherwise.
Based on Definition 2, we obtain triplets where we always have , i.e., each sample is included in multiple triplet relationships. For clarity of presentation, we define a triplet matrix for data samples , where each row of records the indices of a samples in a triplet .
Compared against the traditional pairwise relationship evoked from , the triplet incorporates complementary using the constraint in (4), which shows more robust capacity in partitioning samples near the intersection of two subspaces. Each triplet depicts a local geometrical structure which enables a better performance to estimate the density of each sample. Furthermore, the overlapped samples in multiple triplets reflect a global hyper-similarity among each other, which can be measured efficiently. Therefore, based on the triplet relationship, we can jointly estimate the number of subspaces and calculate the clustering assignment in a greedy manner.
Iii-D Modeling Clustering Rewards
Given , we iteratively group data samples into clusters, i.e., , where denotes the estimated number of subspaces. According to the greedy strategy, in the -th iteration, the triplet is divided into two subsets, i.e., “in-cluster” triplets which are already assigned into clusters, and “out-of-cluster” triplets which are still to be assigned in the subsequent iterations. For clearer presentation, we reshape both matrices and to vectors and . In each iteration, we propose to optimize two new rewards, i.e., the model selection and the fusion reward, to simultaneously estimate the number of clusters and merge samples into respective cluster.
(Model Selection Reward) Given and in the -th iteration, the model selection reward for each initialized cluster in is defined as:
where is a counting function on the frequency that for all , denotes the trade-off parameter.
By maximizing the model selection reward , we generate the initialized cluster which has the following two advantages, where is the estimated number of clusters (see Fig. 1 for visualization). Firstly, the local density of sample is high, i.e., has a large amount of correlated samples in , which enables many to be merged in the next iteration. Secondly, each has little correlation with samples in , which eliminates the overlap of any inter-clusters. Consequently, we can simultaneously estimate and initialize the clusters by optimizing the model selection reward .
(Fusion Reward) Given the initialized clusters , the fusion reward is defined as the probability that
, the fusion reward is defined as the probability thatis assigned into :
where denotes the nearest neighbors of and denotes the set of nearest neighbors of samples in , denotes the trade-off parameters.
In the optimization procedure, we calculate fusion rewards for each , which represent the probabilities that is assigned into clusters , respectively. We then merge into the cluster with the largest fusion reward, and move from to .
Iii-E Automatic Subspace Clustering Algorithm
The first triplet for initializing a new cluster is chosen to have maximal local density. The local density is defined as follows.
(Local Density) The local density of the triplet regarding to the is defined as follows:
where denotes the sample in the current triplet and is the set of their indexes, denotes the scale of .
Also, to measure the hyper-similarity between samples and determine the optimal triplet to merge into the initialized clusters, we define the connection score as follows.
(Connection Score) The connection score between samples and is defined as:
where is equal to when and otherwise, is the number of all triplets in .
We greedily optimize the proposed model selection reward and fusion reward in autoSC to simultaneously estimate the number of clusters and generate the segmentation among samples:
where denotes the set of the result groups, is the estimated number of clusters and denotes the universal ordinal set of samples.
We present the proposed autoSC in Fig. 1 and Algorithm 1. Specifically, the optimization includes three steps: 1) generating the triplet relationships from the similarity matrix ; 2) estimating the number of clusters and initializing the clusters ; 3) assigning the samples into proper cluster.
Calculating Triplets: The similarity matrix reflects the correlations among samples , where larger values demonstrate stronger belief for the correlation between samples. For instance, indicates a larger probability for and being in the same cluster over and . Accordingly, we explore the intrinsic local correlations among samples by the proposed triplets derived from .
Many subspace representations guarantee the mapping invariance via a dense similarity matrix . However, the generation of triplets relies only on the strongest connections to avoid the wrong assignment. Therefore, for each column of , i.e., , we preserve only the top values which are then modified to for a new binary similarity matrix .
Then, we extract each triplet from by the following function:
where denotes the -th value of . Note each sample can appear in many triplets. Therefore, we consider each as a meta-element in the clustering, which improves the robustness due to the complementarity constraints.
Initializing Clusters: In the -th iteration, we first determine an initial triplet (termed as ) from to initialize the cluster , followed by merging the most correlated samples of into each .
Following , we initialize a new cluster using with highest local density:
where calculates the local density defined in Definition 5. The high local density of the triplet reflects the most connections between and other triplets, which produces the most connections between and other samples in .
Once the initialized triplet is determined, we iteratively extend the initialized cluster by fusing the most confident triplets. For each triplet in , we calculate the sum of the connection score regarding the samples in to greedily determine whether the samples in should be assigned into or not:
where denote the set of indexes for the samples in and , respectively. We iteratively update the auxiliary sets , , and in the iterations.
Terminating: We terminate the process of estimating the number of clusters and get clusters if and only if satisfies:
Specifically, if the samples in are of high frequency in , i.e., the triplet with the highest local density in is already contained in , we consider that the clusters are sufficient for modeling the intrinsic subspaces.
Avoiding Over-Segmentation: We also introduce an alternative step to check the redundancy among initialized clusters to avoid over-segmentation. We calculate the connection scores for small-scale clusters against others, and merge the highly correlated clusters and if we have
where denotes the number of samples in . We then get the initialized clusters , where is the estimated number of clusters and .
Assigning Rest Samples: Given , we assign each of the remaining samples into which evokes an optimal fusion reward. For , we find its optimal cluster by the following equation:
where is the fusion reward defined by (6).
Iii-F An Extension: Neighboring based AutoSC Algorithm
In Definition 1, we collect nearest neighbors according to the magnitude of similarities between sample and all other samples in . These similarities are depicted in by optimizing (1) which is composed of a reconstruction loss term and a regularization term. In this subsection, we extend the autoSC with an alternative technique to find for each based on greedy search.
For each data sample , we let be the subspace spanned by and its neighbors in the -th iteration, where the neighbor set is initialized as , and . In each iteration, we measure the projected similarity between and other non-neighbor samples by calculating the orthonormal ordinates in the spanned subspace. For example, to calculate the similarity between and in the -th iteration, we have
where denotes the Frobenius norm and . Consequently, for in the -th iteration, we find the closest neighbor and update as follows:
Here, we find one neighbor in each iteration and update the spanned subspace accordingly. The newly spanned subspace reflects more local structure of the ambient space which is assumed to cover the current sample. The neighbor set is also updated by adding the new neighbor which is found in the -th iteration. Finally, with iterations for each sample, we get an alternative nearest neighbor set .
Given the neighbor matrix , we propose the neighboring based autoSC algorithm (autoSC-N) to directly discover the triplet relationship among data samples, followed by optimizing both model selection and fusion rewards for clustering. The main steps of autoSC-N are summarized in Algorithm 2.
Iii-G Computational Complexity Analysis
In traditional subspace clustering system, the calculation of self-representation requires solving convex optimization problems over constraints 
. Spectral clustering is based on an eigen-decomposition operation on the Laplacian matrix followed by conducting K-means on the eigenvectors, both of which are time-consuming, involving a complex algebraic decomposition and iterative optimization, respectively[68, 71]. The overall computational complexity can be more than . For the proposed autoSC, it takes to collect the triplet relationships for samples in the space spanned by the nearest neighbors. Here, since we have , the complexity of collecting the triplet relationships is . The optimization of both model selection and fusion rewards takes where the number of triplets has the same order of magnitude as . Specifically, we have , and thus the complexity of clustering is .
For the extension, i.e., autoSC-N, collecting the neighbor matrix takes where the basic operation is a dot product of the -dimensional tensors. This avoids the calculation of any convex optimization problem.
|Clustering||Metrics||extended Yale B||COIL-20|
|LSR ||SCAMS [13, 29]||NC||5.21||14.00||17.12||21.25||23.00||4.36||9.00||18.32||21.00|
|SMR ||SCAMS [13, 29]||NC||9.26||23.60||41.39||76.22||81.00||8.48||19.72||32.40||37.00|
|LSR ||DP ||NC||7.90||98.38||127.92||308.00||341.00||10.90||14.70||301.05||228.00|
|SMR ||DP ||NC||3.06||7.84||14.62||24.76||29.00||2.22||5.30||9.72||11.00|
|LSR ||SVD ||NC||7.00||9.42||21.04||41.23||44.00||2.76||9.00||12.05||14.00|
|SMR ||SVD ||NC||2.40||9.06||11.65||24.00||28.00||0.48||2.58||8.36||12.00|
Iv-a Experimental Setup
In the experiments, we compare the automatic methods on the benchmark datasets, i.e., the extended Yale B  and the COIL-20  dataset, followed by verifying the robustness of the proposed method to different
derived from various self-representation schemes along with combinations of different methods for estimating the number of clusters and segmenting the samples. We design comprehensive evaluation metrics to validate the clustering performance,i.e., the error rate of the number of clusters and the triplets. For all experiments on subsets, the reported results are the average of trials. We also conduct experiments on a motion segmentation task using the Hopkins 155 dataset.
The extended Yale B  dataset is a widely used face clustering dataset which contains face images with different illumination of subjects, each subject has images.
The COIL-20  dataset consists of different real subjects, including cups, bottles and so on. For each subject, there are images with different camera viewpoints.
The Hopkins 155 dataset  consists of video sequences. For each video sequence, there are or motions.
Iv-A2 Comparative Methods
, a singular value decomposition based method (SVD) and DP-space . Besides, we utilize the following subspace representation methods to generate different coefficient matrices : LRR , CASS , LSR , SMR  and ORGEN . The similarity matrix is then used to calculate the triplet relationships for autoSC.
Iv-A3 Evaluation Metrics
To evaluate the performance of the proposed triplets, we define the error rate as follows:
where denotes the number of the triplets and is the counting function on the frequency that for all . Here the output of ranges from to . The dynamic set consists of samples in one subspace according to the ground truth, where contains as many samples in as possible.
We introduce the error rate of the number of clusters (NC) as the primary evaluation metric for the clustering methods which estimate the number of clusters automatically:
where is the real number of clusters, is the number of trials and is the estimated number of clusters in the -th trial. We also use the standard normalized mutual information (NMI)  to measure the similarity between two clustering distributions, i.e., the prediction and the ground truth. With respect to NMI, the entropy illustrates the nondeterminacy of one clustering to the other, and the mutual information quantifies the amount of information that one variable obtains from the other.
The parameter in Definition 1, i.e., the number of preserved neighbors for each sample, is related to the intrinsic dimension of the subspaces. We empirically evaluate the influence of on both extended Yale B and COIL-20 datasets with 15 subjects. Besides, we use subspace representations derived from SMR. The results are shown in Table III. As shown in the table, the proposed method achieves best performance when we have for most cases. Actually, the parameter is robust since the performance is stable when .
|extended Yale B||COIL-20|
Iv-B Comparisons among Automatic Clustering
We conduct experiments on the extended Yale B and COIL-20 datasets with different numbers of subjects, and compare four methods with the proposed autoSC and autoSC-N on the metrics of NC and NMI. For SCAMS [13, 29], DP , SVD  and our autoSC, the optimization module in SMR  is employed to generate the similarity matrix . The DP-space method simultaneously estimates and finds the subspaces without the requirement of a similarity matrix. All parameters of the contrasted methods are tuned to provide the best performance.
Fig. 2 and Table II report the performance. As shown in Table II, when combining SMR, the averaged NC of autoSC is smaller than other comparative methods on all experimental configurations, indicating that it gives a close estimation on the number of clusters. For example, the estimated on the extended Yale B with subjects has a deviation of less than , and produces a NMI higher than . The autoSC-N gets the second best performance on most configurations, which demonstrates the effectiveness of both triplet relationship and reward optimization. In contrast, SVD achieves comparable results on the small-scale configuration of each dataset, but the performance becomes poor when the number of samples increases. It is mainly because the largest gap between the pair of singular values decreases when the number of clusters becomes larger. When combining SMR, SCAMS performs comparably according to NMI on both datasets, however, as is illustrated in Fig. 2 (a) and Table II, it provides a much larger than the ground truth, e.g., when on the extended Yale B dataset. NMI does not strongly penalize over-segmentation, making the metric NC be the primary evaluation of the SCAMS method. The DP-space performs well on NC, but has poor performance on the NMI. This is because most samples are assigned into one cluster, and the other clusters are small. In addition, when combining LSR, as shown in Table II, the performance of all methods decrease, while the proposed autoSC still achieves best performance on most configurations. It demonstrates the generalization ability of our autoSC.
Iv-C Robustness to Self-Representations Schemes
The methods including SCAMS [13, 29], DP , SVD  and the proposed autoSC require the similarity matrix as input. Also, for DP , the distance among samples needs to be calculated. We calculate the distance between samples and by rather than the simple Euclidean distance. To verify the robustness of the proposed autoSC regarding various subspace representations, we calculate the similarity matrix using subspace representation modules, followed by the combinations with the methods which automatically estimate the number of clusters and segment the samples.
Table IV shows the evaluation results of on both datasets with the combinations of subspace representations, while the NC and NMI on the extended Yale B dataset with subjects are reported in Fig. 3. Moreover, we visualize the similarity matrix derived from subspace representation modules in Fig. 4. We can see from Fig. 3 that the SCAMS, DP and SVD methods are sensitive to the choice of the subspace representation module. For example, DP estimates as a relatively close value to the ground truth when combined with SMR (NC), but generates a totally wrong estimation when combined with LRR (NC). Different subspace representation modules generate coefficient matrices with various intrinsic properties , thus the parameter for truncation error needs to be tuned carefully.
For the proposed autoSC, it is stable on different combinations considering the metric of NC and , which demonstrates the complementary ability of the proposed method. For all combinations, the error rate of the triplets obtained from (10) is less than , which guarantees the consistency of the proposed autoSC with different kinds of . Furthermore, it shows better performance when combined with CASS, LSR and SMR than other combinations on both metrics in Fig. 3. The reason lies on the guarantee of the mapping invariance which is termed as the grouping effect [52, 46, 32], together with the filtering of weak connections and the self-constraint among samples within triplets. As shown in Fig. 4 (b), (c), (d), the coefficient matrices are dense while it shows block-diagonal structure in Fig. 4 (d) and each block corresponds to one cluster. Therefore, the nearest neighbors which are used to generate the triplets can be chosen precisely. The performance decreases when combined with ORGEN since the similarity matrix derived from ORGEN is sparse with less locations for constructing effective triplets.
Iv-D Time Efficiency
Table V shows the run-time of comparative methods using subsets from the extended Yale B dataset. The experiments are conducted on a machine with a GHz CPU and GB RAM. AutoSC-N requires the least run-time compared to all comparative methods due to the following two reasons. First, autoSC-N explores the neighborhood relationship in the raw data space rather than solving a convex optimization problem. Second, it employs a greedy optimization scheme to estimate the number of clusters and calculate the clustering assignment rather than a complex optimizing method such as computing the singular value decomposition. Note the proposed autoSC method achieves the second best result among comparative methods.
Iv-E Real Application: Motion segmentation
Motion segmentation refers to the task of segmenting multiple video sequence. The candidate video is composed of multiple foreground objects, which are rapidly moving and required to be clustered into spatiotemporal regions corresponding to specific motions. Following the traditional scheme , we consider the Hopkins 155 dataset  and solve the motion segmentation problem by first extracting a set of feature points for each frame followed by clustering them based on the motions. Table VI reports the comparison against four automatic clustering methods. For SCAMS [13, 29], DP  and SVD , the SMR  is firstly conducted to calculate the similarity matrix. As shown in the table, the proposed autoSC achieves best performance on both metrics, indicating that the autoSC is effective at both estimating the number of motions (about error rate) and segmenting the feature points (obtains NMI of more than ). In addition, it shows favorable efficiency on the motion segmentation task. The autoSC-N is the most efficient method ( per sequence) with second best performance on NC and NMI. The SVD method obtains the best result among other comparative methods, but it consumes much more time (about more than per sequence) due to the singular value decomposition process.
In this paper, we propose a joint model to estimate the number of clusters and segment the samples in a data set. Based on the self-representation of dataset, we first design a hyper-correlation oriented meta-element termed as the triplet relationship, which indicates a compact local structure among three samples. The triplet is more robust than pairwise relationships when partitioning samples near the intersection of two subspaces due to the complementarity of mutual restrictions. Accordingly, we propose the autoSC method to optimize two reward functions simultaneously, of which the model selection reward constrains the number of clusters and the fusion reward facilitates the clustering assignment of the samples. Both functions are greedily maximized during the clustering process. In addition, we provide an extension of autoSC which automatically calculates the neighboring relationship in the raw data space rather than a similarity space spanned by self-representation. Experimental results on face clustering, synthetic dataset clustering and motion segmentation tasks demonstrate the effectiveness and efficiency of our approaches.
J. Yang, J. Liang, K. Wang, Y.-L. Yang, and M.-M. Cheng, “Automatic model
selection in subspace clustering via triplet relationships,” in
AAAI Conference on Artificial Intelligence, 2018.
-  X. Wang, X. Guo, Z. Lei, C. Zhang, and S. Z. Li, “Exclusivity-consistency regularized multi-view subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 923–931.
-  X. Peng, Z. Yu, Z. Yi, and H. Tang, “Constructing the L2-graph for robust subspace learning and subspace clustering,” IEEE Transactions on Cybernetics, vol. 47, no. 4, pp. 1053–1066, 2017.
-  R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, 2011.
H. Jia and Y.-M. Cheung, “Subspace clustering of categorical and numerical
data with an unknown number of clusters,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 8, pp. 3308–3325, 2018.
X. Xu, Z. Huang, D. Graves, and W. Pedrycz, “A clustering-based graph laplacian framework for value function approximation in reinforcement learning,”IEEE Transactions on Cybernetics, vol. 44, no. 12, pp. 2613–2625, 2014.
P. Zhu, W. Zhu, Q. Hu, C. Zhang, and W. Zuo, “Subspace clustering guided unsupervised feature selection,”Pattern Recognition, vol. 66, pp. 364–374, 2017.
-  X. Cao, C. Zhang, C. Zhou, H. Fu, and H. Foroosh, “Constrained multi-view video face clustering,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4381–4393, 2015.
-  E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 2790–2797.
-  Y. Yang, J. Feng, N. Jojic, J. Yang, and T. S. Huang, “-sparse subspace clustering,” in European Conference on Computer Vision, 2016, pp. 731–747.
-  J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
Y. Wang and J. Zhu, “DP-space: Bayesian nonparametric subspace clustering with small-variance asymptotics,” in
International Conference on Machine Learning, 2015, pp. 862–870.
-  Z. Li, S. Yang, L. F. Cheong, and K. C. Toh, “Simultaneous clustering and model selection for tensor affinities,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5347–5355.
-  S. Javed, A. Mahmood, T. Bouwmans, and S. K. Jung, “Background-foreground modeling based on spatiotemporal sparse subspace clustering,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5840–5854, 2017.
-  D. Kumar, J. C. Bezdek, M. Palaniswami, S. Rajasegarar, C. Leckie, and T. C. Havens, “A hybrid approach to clustering in big data,” IEEE Transactions on Cybernetics, vol. 46, no. 10, pp. 2372–2385, 2016.
-  E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.
-  B. Nasihatkon and R. Hartley, “Graph connectivity in sparse subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 2137–2144.
-  K. Zhan, C. Zhang, J. Guan, and J. Wang, “Graph learning for multiview clustering,” IEEE transactions on cybernetics, no. 99, pp. 1–9, 2017.
-  A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
-  X. Peng, L. Zhang, and Z. Yi, “Scalable sparse subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 430–437.
-  S. Gao, I. W. Tsang, and L. T. Chia, “Laplacian sparse coding, hypergraph laplacian sparse coding, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 92–104, 2013.
-  S. Kim, D. Y. Chang, S. Nowozin, and P. Kohli, “Image segmentation usinghigher-order correlation clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 9, pp. 1761–1774, 2014.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in International Conference on Computer Vision, 2015, pp. 815–823.
-  H. Wang, T. Li, T. Li, and Y. Yang, “Constraint neighborhood projections for semi-supervised clustering,” IEEE Transactions on Cybernetics, vol. 44, no. 5, pp. 636–643, 2014.
-  C.-G. Li, C. You, and R. Vidal, “Structured sparse subspace clustering: A joint affinity learning and subspace clustering framework,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2988–3001, 2017.
-  C. Zhang, H. Fu, Q. Hu, P. Zhu, and X. Cao, “Flexible multi-view dimensionality co-reduction,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 648–659, 2017.
-  C. You, D. Robinson, and R. Vidal, “Scalable sparse subspace clustering by orthogonal matching pursuit,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3918–3927.
-  Y. Cheng, Y. Wang, M. Sznaier, and O. Camps, “Subspace clustering with priors via sparse quadratically constrained quadratic programming,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5204–5212.
-  Z. Li, L.-F. Cheong, S. Yang, and K.-C. Toh, “Simultaneous clustering and model selection: Algorithm, theory and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1964–1978, 2018.
-  F. Wu, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Ordered subspace clustering with block-diagonal priors,” IEEE Transactions on Cybernetics, vol. 46, no. 12, pp. 3209–3219, 2016.
-  C. G. Li and R. Vidal, “Structured sparse subspace clustering: A unified optimization framework,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 277–286.
-  H. Hu, Z. Lin, J. Feng, and J. Zhou, “Smooth representation clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3834–3841.
-  C. You, C. G. Li, D. P. Robinson, and R. Vidal, “Oracle based active set algorithm for scalable elastic net subspace clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3928–3937.
-  X. Fang, Y. Xu, X. Li, Z. Lai, and W. K. Wong, “Robust semi-supervised subspace clustering via non-negative low-rank representation,” IEEE Transactions on Cybernetics, vol. 46, no. 8, pp. 1828–1838, 2016.
-  M. Rahmani and G. Atia, “Innovation pursuit: A new approach to the subspace clustering problem,” in International Conference on Machine Learning, 2017, pp. 2874–2882.
-  E.-L. Dyer, A.-C. Sankaranarayanan, and R.-G. Baraniuk, “Greedy feature selection for subspace clustering,” Journal of Machine Learning Research, vol. 14, no. 1, pp. 2487–2517, 2013.
-  D. Park, C. Caramanis, and S. Sanghavi, “Greedy subspace clustering,” in Advances in Neural Information Processing Systems, 2014, pp. 2753–2761.
-  P. Purkait, T.-J. Chin, A. Sadri, and D. Suter, “Clustering with hypergraphs: the case for large hyperedges,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1697–1711, 2017.
-  H. Liu, L. Latecki, and S. Yan, “Robust clustering as ensembles of affinity relations,” in Advances in Neural Information Processing Systems, 2010, pp. 1414–1422.
-  C.-G. Li and R. Vidal, “A structured sparse plus structured low-rank framework for subspace clustering and completion,” IEEE Transactions on Signal Processing, vol. 64, no. 24, pp. 6557–6570, 2016.
-  Y. Guo, J. Gao, and F. Li, “Spatial subspace clustering for drill hole spectral data,” Journal of Applied Remote Sensing, vol. 8, no. 1, p. 083644, 2014.
-  J. Feng, Z. Lin, H. Xu, and S. Yan, “Robust subspace segmentation with block-diagonal prior,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3818–3825.
-  B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin, “Product Grassmann manifold representation and its LRR models,” in AAAI Conference on Artificial Intelligence, 2016, pp. 2122–2129.
-  B. Liu, X.-T. Yuan, Y. Yu, Q. Liu, and D.-N. Metaxas, “Decentralized robust subspace clustering,” in AAAI Conference on Artificial Intelligence, 2016, pp. 3539–3545.
-  S. Xiao, W. Li, D. Xu, and D. Tao, “FaLRR: A fast low rank representation solver,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4612–4620.
-  C. Y. Lu, H. Min, Z. Q. Zhao, L. Zhu, D. S. Huang, and S. Yan, “Robust and efficient subspace segmentation via least squares regression,” in European Conference on Computer Vision, 2012, pp. 347–360.
-  G. Liu, H. Xu, J. Tang, Q. Liu, and S. Yan, “A deterministic analysis for LRR,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 417–430, 2016.
-  C. Xu, Z. Lin, and H. Zha, “A unified convex surrogate for the Schatten-p norm.” in AAAI Conference on Artificial Intelligence, 2017, pp. 926–932.
-  Y.-X. Wang, H. Xu, and C. Leng, “Provable subspace clustering: When LRR meets SSC,” in Advances in Neural Information Processing Systems, 2013, pp. 64–72.
-  H. Lai, Y. Pan, C. Lu, Y. Tang, and S. Yan, “Efficient k-Support matrix pursuit,” in European Conference on Computer Vision, 2014, pp. 617–631.
-  E. Kim, M. Lee, and S. Oh, “Robust Elastic-Net subspace representation,” IEEE Transactions on Image Processing, vol. 25, no. 9, pp. 4245–4259, 2016.
-  C. Lu, J. Feng, Z. Lin, and S. Yan, “Correlation adaptive subspace segmentation by Trace Lasso,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1345–1352.
-  J. Xu, K. Xu, K. Chen, and J. Ruan, “Reweighted sparse subspace clustering,” Computer Vision and Image Understanding, vol. 138, pp. 25–37, 2015.
-  A. A. Abin, “Querying beneficial constraints before clustering using facility location analysis,” IEEE Transactions on Cybernetics, vol. 48, no. 1, pp. 312–323, 2018.
-  Y. Guo, J. Gao, and F. Li, “Spatial subspace clustering for hyperspectral data segmentation,” in International Conference on Digital Information Processing and Communications, 2013, pp. 180–190.
-  J. Wang, X. Wang, F. Tian, C. H. Liu, and H. Yu, “Constrained low-rank representation for robust subspace clustering,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4534–4546, 2017.
-  Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spectral clustering by exploring intertask correlation,” IEEE Transactions on Cybernetics, vol. 45, no. 5, pp. 1083–1094, 2015.
-  G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma, “Robust recovery of subspace structures by low-rank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 171–184, 2013.
-  P. Favaro, R. Vidal, and A. Ravichandran, “A closed form solution to robust subspace estimation and clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1801–1807.
-  Z. Li, L. F. Cheong, and S. Z. Zhou, “SCAMS: Simultaneous clustering and model selection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 264–271.
-  M. Ester, H. P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in International Conference on Knowledge Discovery and Data Mining, 1996, p. 226–231.
-  T. Beier, F. A. Hamprecht, and J. H. Kappes, “Fusion moves for correlation clustering,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3507–3516.
C. Lu, J. Feng, Y. Chen, W. Liu, Z. Lin, and S. Yan, “Tensor robust principal component analysis: Exact recovery of corrupted low-rank tensors via convex optimization,” inIEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2080–2088.
-  P. Purkait, T. J. Chin, H. Ackermann, and D. Suter, “Clustering with hypergraphs: The case for large hyperedges,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 9, pp. 1697–1711, 2017.
-  X. Li, G. Cui, and Y. Dong, “Graph regularized non-negative low-rank matrix factorization for image clustering,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 3840–3853, 2017.
-  B. Schölkopf, J. Platt, and T. Hofmann, “Learning with hypergraphs: Clustering, classification, and embedding,” in Advances in Neural Information Processing Systems, 2006, pp. 1601–1608.
-  M. Lee, J. Lee, H. Lee, and N. Kwak, “Membership representation for detecting block-diagonal structure in low-rank or sparse subspace clustering.” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1648–1656.
-  M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Advances in Neural Information Processing Systems, 2002, pp. 585–591.
-  E. Elhamifar and R. Vidal, “Sparse manifold clustering and embedding,” in Advances in Neural Information Processing Systems, 2011, pp. 55–63.
-  C. Li, J. Guo, and H. Zhang, “Learning bundle manifold by double neighborhood graphs,” in Asian Conference on Computer Vision, 2009, pp. 321–330.
-  U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007.
-  A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2001.
-  S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (COIL-20),” Columbia Universty, Tech. Rep. CUCS-005-96, 1996.
-  R. Tron and R. Vidal, “A benchmark for the comparison of 3-D motion segmentation algorithms,” in IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
-  M. Li, X. Chen, X. Li, and B. Ma, “Clustering by compression,” in IEEE International Symposium on Information Theory, vol. 51, no. 4, 2003, pp. 1523–1545.