In many computer vision applications, such as face recognition[2, 9], texture recognition  and motion segmentation [5, 7], visual data can be well characterized by subspaces. Moreover, the intrinsic dimension of high-dimensional data is often much smaller than the ambient dimension . This has motivated the development of subspace clustering techniques which simultaneously cluster the data into multiple subspaces and also locate a low-dimensional subspace for each class of data.
Many subspace clustering algorithms have been developed during the past decade, including algebraic [4, 24], iterative [1, 3], statistical [18, 20], and spectral clustering methods [5, 9, 11, 10, 12, 13, 14, 15, 16, 28]. Among these approaches, spectral clustering methods have been intensively studied due to their simplicity, theoretical soundness, and empirical success. These methods are based on the self-expressiveness property of data lying in a union of subspaces. This states that each point in a subspace can be written as a linear combination of the remaining data points in that subspace. Two typical methods falling into this category are sparse subspace clustering (SSC)  and low-rank representation (LRR) . SSC uses the norm to encourage the sparsity of the self-representation coefficient matrix. LRR uses nuclear norm minimization to make the coefficient matrix low-rank.
Motivated by SSC and LRR, some self-representation based methods have been developed, which use different regularization terms on the coefficient matrix. For example, least squares regression (LSR)  uses regularization on the coefficient matrix. Correlation adaptive subspace segmentation (CASS)  uses a mixture of and regularization. Low-rank sparse subspace clustering (LRSSC)  and non-negative low-rank sparse (NNLRS)  construct regularization term as a blend of
and the nuclear norms. Because the nuclear norm does not achieve the accuracy in estimating the rank of real world data, subspace clustering with log-determinant approximation (SCLA)
replaces the nuclear norm used in LRR by non-convex rank approximations. Feature selection embedded subspace clustering (FSC) reveals that not all features are equally important in the recovery of the low-dimensional subspaces. With feature selection both nuclear norm and non-convex rank approximations may give enhanced performance. Latent space sparse subspace clustering (LS3C)  seeks a linear projection of the data and learns a sparse representation in the projected latent low-dimensional space.
Despite the fact that SSC, LRR and their variants have achieved encouraging results in practice, they have critical limitations. In these approaches, the key idea is to learn a coefficient matrix which denotes the correlation between the data points. As the size of the coefficient matrix is for data points, the SVD decomposition operation for solving the coefficient matrix has computational complexity of . This is time consuming when the size of the data is large, so the efficiency of these approaches can not be guaranteed. Experiments in  and also in this paper show that some existing methods need to run for several hours on a normal computer when the number of test data reaches , which constrains the feasibility of these methods.
To overcome this limitation, we propose an efficient subspace clustering model based on the Kronecker product which achieves a significant reduction of computational complexity over quadratic . Using the fact that each data point in a subspace can be written as a linear combination of all other points in that subspace, we can obtain points lying in the same subspace by learning the sparsest combination. Hence, in our model, we first learn a self-representation coefficient matrix formed by the Kronecker product of a series of small sparse matrices. Then we can constract a similarity matrix based on the coefficient matrix. Finally, a segmentation of the data can be obtained by spectral clustering on the similarity matrix.
The main contributions of this paper are as follows:
We propose an efficient subspace clustering model based on the Kronecker product. Our model uses the Kronecker product of a set of small matrices to build the self-representation coefficient matrix, which leads to a significant reduction of space and computational complexity.
Experimental results on large scale synthetic data and real world public datasets show that our method leads to a significant improvement in the clustering efficiency compared with the state-of-the-art methods while also achieving competitive accuracy.
2 Related Work
In this section, we review some classical and state-of-the-art methods for subspace clustering.
2.1 Sparse Subspace Clustering (SSC)
Given a data matrix that contains data points drawn from subspaces . SSC  aims to find a sparse representation matrix showing the mutual similarity of the points, i.e., . Since each point in can be expressed in terms of the other points in , such a sparse representation matrix always exists. The SSC algorithm finds by solving the following optimization problem:
where eliminates the trivial solution.
2.2 Low-Rank Representation (LRR)
As pointed out in 
, SSC finds the sparsest representation of each data vector individually. There is no global constraint on its solution, so the SSC method may be inaccurate at capturing the global structures of data. Liuet al.  proposed that low rank can be a more appropriate criterion. Similar to SSC, LRR aims to find a low-rank representation of by solving the following optimization problem, since the nuclear norm is the best convex approximation of over the unit ball of matrices:
is the sum of the singular values of.
2.3 Thresholding Ridge Regression (TRR)
The SSC and LRR methods solve the robust subspace clustering problem by removing the errors from the original data space and obtaining a good affinity matrix based on a clean dataset. Thus they need prior knowledge of the structure of the errors, which usually is unknown in practice. Penget al. 
proposed a robust subspace clustering method which overcomes this limitation by eliminating the effect of errors from the projection space with a model based on thresholding ridge regression (TRR):
where is a balancing parameter and small values in are truncated to zero by thresholding.
Based on TRR, a 2D nonlinear variance regularized ridge regression (NVR3) was proposed to directly use 2D data, and thus the spatial information is maximally retained.
Each of these related works learns the coefficient matrix with computational complexity . This has limited the scalability of these methods on large scale datasets. Due to the effectiveness of the Kronecker product in reducing the computational complexity of matrix operations, we present a Kronecker product based subspace clustering model which can significantly improve the efficiency of the existing methods.
3 Kronecker Product Based Model
In this section, we describe our subspace clustering model based on the Kronecker product and develop an associated optimization scheme.
We first introduce the Kronecker product. Let , , the Kronecker product of matrices and is which is defined as:
where is the element at the -th row and -th column of .
3.1 Problem Statement and Formulation
Let be a collection of data points drawn from different subspaces. The goal of subspace clustering is to find the segmentation of the points according to the subspaces. Based on the self-expressiveness property of data lying in a union of subspaces, i.e., each point in a subspace can be written as a linear combination of the remaining points in that subspace, we can obtain points lying in the same subspace by learning the sparsest combination. Therefore, we need to learn a sparse self-representation coefficient matrix , where , and if the -th and -th data points are from different subspaces.
As our model aims to reduce the computational complexity with data size , we rewrite as , where denotes matrix transpose and is the -th dimension of the data points. Without loss of generality, we assume that the self-representation matrix is formed by the Kronecker product of two smaller matrices and , where and , where and . Here we use the important property that the Kronecker product of a block diagonal matrix with any other matrix is still a block diagonal matrix (as shown in Figure 1). We follow  to minimize the loss of self-representation. The optimization problem can be written as:
where is a balancing parameter, and is the Frobenius norm.
We solve problem (4) by updating each small matrix at a time, while keeping the other one fixed. Considering updating , while fixed, we start by rewriting as:
Since is a constant, let
then, the problem that minimizing is equivalent to minimizing .
According to the block property of Kronecker product :
where and forms a vector by column-wise stacking of the matrix into a vector, and reshapes a dimensional vector to a matrix by extracting column from the vector . Then
Let , . Then, using the property of trace that and ,
Since is a constant, the optimization objective function of can be written as:
where , . Eq. (8) is a well known ridge regression problem  whose optimal solution is . We can solve in a similar manner to , when is fixed. As , and , the computational complexity for this solution is .
When the number of small matrices is , we can also solve it by updating one small matrix at a time, while keeping the remaining matrices fixed. In this situation, the problem is the same as solved above. As , , then the computational complexity of the whole optimization is .
We have obtained the optimal solution of self-representation coefficient matrix , where if the -th and -th data points are from different subspaces. Hence, the affinity matrix can be defined as , where denotes the absolute value matrix of . Then the segmentation of the data in different subspaces can be obtained by applying a spectral clustering algorithm to the affinity matrix . The whole Kronecker product based subspace clustering model is summarized in Algorithm 1.
4 Theoretical Analysis
In this section, we give a theoretical analysis of our Kronecker product based model, including a) the adaptivity on different regularizations, b) theoretical convergence analysis, c) complexity analysis.
4.1 Adaptivity on Different Regularizations
Since many self-representation based methods use different regularizations on the coefficient matrix, we show that our model can be applied to a variety of different regularizations. We refer to our subspace clustering method described in Section 3 as KrTRR (Kronecker product based TRR). It utilizes the Frobenius norm to regularize the coefficient matrix. In Eq. (8), we simplify the sparsity constraint from to , using the Kronecker product lemma:
Let , then .
Assume is the -th column -th row element in , , , . Then .
Here we introduce two additional Kronecker product lemmas to show that our model can be applied to alternative regularizations.
Let , then .
Assume is the -th column -th row element in , , , . Then .
Let , then .
Assume the SVD decompositions of and are and , respectively. Then is the sum of nonzero entries in the diagonal matrix , is the sum of nonzero entries in the diagonal matrix . . Because is a diagonal matrix, then the SVD decomposition of is . So that is the sum of nonzero entries in the diagonal matrix which is the product of the sum of nonzero entries in the diagonal matrix and . Then .
Based on these two lemmas, the norm and nuclear norm regularizations on the coefficient matrix , can be simplified to and as shown in Eq. (8). So we can also utilize the norm and nuclear norm on the self-representation coefficient matrix with a manner similar to SSC and LRR, i.e.
We refer to these two methods as KrSSC and KrLRR. Following , we can preprocess the data by 2DPCA  to retain the spatial information in the 2D data. Then we can use the KrTRR method to learn the coefficient matrix as done in . We refer to this method as KrNVR3. The optimization of these variants of the Kronecker product based method are essentially the same as KrTRR.
In summary, we can leverage the Kronecker product to reduce the computational complexity of learning the coefficient matrix with different regularization options, e.g. Frobenius norm, norm and nuclear norm. We present four methods KrSSC, KrLRR, KrTRR and KrNVR3 based on different regularizations and compare them with baseline methods in Section 5.
4.2 Theoretical Convergence Analysis
Here, we prove the reliability of Kronecker product approximation using a theoretical convergence analysis.
According to the idea of mathematical induction, we consider the special condition that to approximate a matrix by , where is a matrix. The matrix is partitioned into matrices with dimension , i.e.
Then, we can denote the approximate loss function by:
where . Since
Let be the vector with non-duplicate elements of and , here is the duplication matrix. Then, the first differential of is
The first derivative is
Then, we obtain the first-order condition
This is an eigenvalue problem in terms of . The vector minimizing Eq. (11) must be proportional to the eigenvector corresponding to the largest eigenvalue of . In other words, for an arbitrary matrix with any dimension, we can partition it based on the dimensions of small matrices needed to approximate the large matrix via Kronecker product. moreover, the small matrices always have a convergent solution through the largest eigenvector of the partitioned large matrix. This means that the technique used to approximate the large self-representation matrix by the Kronecker product of small matrices in our model is reliable.
4.3 Complexity Analysis
Here we discuss the space memory requirement and computational complexity of our Kronecker product based methods and compare it to the relevant methods in the literature. When the data size is , methods in [5, 9, 17, 14] need to solve the self-representation coefficient matrix with the dimension , i.e., the memory space complexity of these methods is . But in our work, we leverage the Kronecker product of a set of small matrices to approximate the self-representation coefficient matrix . When the number of small matrices is , the size of small matrices is . Thus, the space complexity of our methods is .
For learning process the self-representation coefficient matrix with size , existing methods use a SVD decomposition operation whose computational complexity is . As our methods update one small matrix at a time, and the size of the small matrix is , we achieve computational complexity. Since when , there is significant reduction in both the memory space and computational complexity compared with the existing methods. This efficiency gain is achieved by using the Kronecker product.
We have conducted three sets of experiments on both real and synthetic datasets to verify the effectiveness of the proposed methods. Several state-of-the-art or classical spectral subspace clustering methods were taken as the baseline algorithms. These included sparse subspace clustering (SSC) , low-rank representation (LRR) , thresholding ridge regression (TRR) , and nonlinear variance regularized ridge regression (NVR3) . In the experiments, we used the codes provided by the respective authors for computing the self-representation matrix , where the parameters were tuned to give the best clustering accuracy. Then we applied the normalized spectral clustering in  to the affinity matrix .
Evaluation criteria: we used both the clustering accuracy and running time of the whole clustering process to evaluate the performance of the subspace clustering methods, where the clustering accuracy is calculated as
In all our experiments, the clustering accuracy and running time were averaged over 10 trials. All experiments were implemented with MATLAB code and ran on a PC with Intel Core-i7 3.6GHz CPU, 32GB RAM.
|No. Objects||5 Objects||10 Objects||20 Objects||40 Objects||60 Objects|
5.1 Face Clustering
As subspaces are commonly used to capture the appearance of faces under varying illuminations, we test the performance of our method on face clustering with the CMU PIE database . The CMU PIE database contains 41,368 images of 68 people under 13 different poses, 43 different illumination conditions, and 4 different expressions. In our experiment, we used the face images in five near frontal poses (P05, P07, P09, P27, P29). Then each people has 170 face images under different illuminations and expressions. Each image was manually cropped and normalized to a size of pixels. In each experiment, we randomly picked individuals to investigate the performance of the proposed method. For our models, we set the number of small matrices and . For different number of objects , we randomly chose people with 10 trials and took all the images of them as the subsets to be clustered. Then we conducted experiments on all 10 subsets and report the average running time and clustering accuracy with a different number of objects in Table 1.
In the original work, SSC, LRR, TRR, and NVR3 all test on a small subset which consists of no more than 1,000 data points. Because of the memory and time limit, these methods can not run on a dataset of size . In our experiment, the data size is in the range of , corresponding to 5-60 objects per face. As shown in Table 1, the efficiency of all alternative methods degrades drastically when increases. When (60 objects), the space and computational complexity of these methods are unacceptable for our PC. In contrast, the computational time of Kronecker product based methods is significantly lower compared with the corresponding approaches. Our methods can easily handle more than 10,000 data points with an acceptable computing time. Further, we can see from Table 1 that the Kronecker product based methods also obtain competitive clustering accuracy (down 2 percent at most). This suggests that our model is potentially more suitable than previous methods on large scale dataset for real world applications.
5.2 Handwritten Digit Clustering
Database of handwritten digits is also widely used in subspace learning and clustering. We test the proposed methods on handwritten digit clustering with the MNIST dataset . This dataset contains 10 clusters, including handwritten digits 0-9. Each cluster contains 6,000 images for training and 1,000 images for testing, with a size of pixels in each image. We used all the 70,000 handwritten digit images for subspace clustering. Different from the experimental settings for face clustering, we fixed the number of clusters and chose different number of data points for each cluster with 10 trials. Each cluster contains data points randomly chosen from corresponding 7,000 images, where , so that the number of points . Then we applied all methods on this dataset for comparison. For our models, we set the number of small matrices and . The average running time and clustering accuracy with different number of data points are shown in Table 2.
It can be seen that the efficiency of KrSSC, KrLRR, KrTRR and KrNVR3 significantly outperform the corresponding baseline methods, which indicates the effectiveness of the Kronecker product method proposed in this paper. Table 2 also shows that our method and its variants obtain competitive clustering accuracy compared with the corresponding baseline methods.
5.3 Large-Scale Experiment
To verify the scalability of our method on large scale datasets, we also ran experiments on synthetic data. Following , we randomly generated subspaces, each of dimension in an ambient space of dimension . Each subspace contains data points randomly generated on the unit sphere, where , so that the number of points . Due to the memory and time limit, SSC, LRR, TRR and NVR3 were run for . For our models, , the number of small matrices for and for . With different number of sample points, we conducted experiments on all methods and report the average running time and clustering accuracy in Table 3.
As shown in Table 3, the advantage of our method and its variants over the baseline methods is more marked on large scale datasets. When the dataset size reaches 10,000, the computational running time of the alternate methods under comparison are about two hours each, but our Kronecker product based methods only need a few thousand seconds even for 100,000 data points. From Table 3, it is also clear that when increases from 2 to 3 for , the running time decreases significantly. The clustering accuracy can also be guaranteed compared with existing methods. Due to the limitations of memory space and computational complexity, the alternative methods can not be applied to a dataset of larger than 10,000 points. This again suggests that our methods are potentially more suitable for large real world applications.
|average running time (seconds):|
|average clustering accuracy:|
5.4 Parameter Sensitivity
Here, we report experimental results on a synthetic dataset to illustrate the sensitivity of the Kronecker product based methods to parameter variations. As the parameters (number of the small matrices) and (the balancing parameter of Eq. (4)) in our model are both related to the dataset size , we fix . Table 4 shows the average running time and clustering accuracy of our methods with different . We can see that when increases, the running time significantly decreases but with the sacrifice of clustering accuracy. This implies that the number of small matrices should be determined by the size of dataset with a compromise between efficiency and accuracy. Figure 2 shows the clustering accuracy of our methods with different balance parameter . It is evident that the clustering accuracy is insensitive when .
We have presented a fast subspace clustering model based on the Kronecker product. Due to the property that the Kronecker product of a block diagonal matrix and any other matrix is still a block diagonal matrix, we learn the representation matrix of spectral clustering using the Kronecker product of a set of smaller matrices. Thanks to the superiority of the Kronecker product in reducing the computational complexity of matrix operations, the memory space and computational complexity of our methods achieve significant efficiency gain compared with several baseline approaches (SSC, LRR, TRR, and NVR3). We have presented four variants of the Kronecker product based method, namely KrSSC, KrLRR, KrTRR and KrNVR3. Experimental results on face clustering and handwriting digit clustering show that our methods achieve significantly improvement in efficiency compared with the state-of-the-art methods. Moreover, we have presented results on synthetic data which has verified the scalability of our methods on large scale datasets.
-  Agarwal, P.K., Mustafa, N.H.: K-means projective clustering. In: Symposium on Principles of Database Systems. pp. 155–165 (2004)
-  Basri, R., Jacobs, D.W.: Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(2), 218–233 (2003)
-  Bradley, P.S., Mangasarian, O.L.: K-plane clustering. Journal of Global Optimization 16(1), 23–32 (2000)
-  Costeira, J.P., Kanade, T.: A multibody factorization method for independently moving objects. International Journal of Computer Vision 29(3), 159–179 (1998)
-  Elhamifar, E., Vidal, R.: Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(11), 2765–2781 (2013)
-  Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
-  Kanatani, K.i.: Motion segmentation by subspace separation and model selection. In: International Conference on Computer Vision. vol. 2, pp. 586–591 (2001)
-  L cun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
-  Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(1), 171–184 (2013)
-  Lu, C.Y., Min, H., Zhao, Z.Q., Zhu, L., Huang, D.S., Yan, S.: Robust and efficient subspace segmentation via least squares regression. In: European Conference on Computer Vision. pp. 347–360 (2012)
-  Lu, C., Feng, J., Lin, Z., Yan, S.: Correlation adaptive subspace segmentation by trace lasso. In: International Conference on Computer Vision. pp. 1345–1352 (2014)
-  Patel, V.M., Van Nguyen, H., Vidal, R.: Latent space sparse subspace clustering. In: International Conference on Computer Vision. pp. 225–232 (2013)
-  Patel, V.M., Vidal, R.: Kernel sparse subspace clustering. In: International Conference on Image Processing. pp. 2849–2853 (2014)
Peng, C., Kang, Z., Cheng, Q.: Subspace clustering via variance regularized ridge regression. In: Computer Vision and Pattern Recognition (2017)
-  Peng, C., Kang, Z., Li, H., Cheng, Q.: Subspace clustering using log-determinant rank approximation. In: International Conference on Knowledge Discovery and Data Mining. pp. 925–934 (2015)
-  Peng, C., Kang, Z., Yang, M., Cheng, Q.: Feature selection embedded subspace clustering. IEEE Signal Processing Letters 23(7), 1018–1022 (2016)
Peng, X., Yi, Z., Tang, H.: Robust subspace clustering via thresholding ridge regression. In: AAAI Conference on Artificial Intelligence. pp. 3827–3833 (2015)
-  Rao, S.R., Tron, R., Vidal, R., Ma, Y.: Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In: Computer Vision and Pattern Recognition. pp. 1–8 (2008)
-  Sim, T., Baker, S., Bsat, M.: The cmu pose, illumination, and expression (pie) database of human faces. Tech. Rep. CMU-RI-TR-01-02, Pittsburgh, PA (January 2001)
-  Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural computation 11(2), 443–482 (1999)
-  Van Loan, C.F.: The ubiquitous kronecker product. Journal of computational and applied mathematics 123(1), 85–100 (2000)
-  Van Loan, C.F., Pitsianis, N.: Approximation with kronecker products. In: Linear algebra for large scale and real-time applications, pp. 293–314. Springer (1993)
-  Vidal, R.: Subspace clustering. IEEE Signal Processing Magazine 28(2), 52–68 (2011)
Vidal, R., Ma, Y., Sastry, S.: Generalized principal component analysis (gpca). IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12), 1945–1959 (2005)
-  Von Luxburg, U.: A tutorial on spectral clustering. Statistics and computing 17(4), 395–416 (2007)
-  Wang, Y.X., Xu, H., Leng, C.: Provable subspace clustering: when lrr meets ssc. In: Advances in Neural Information Processing Systems. pp. 64–72 (2013)
-  Yang, J., Zhang, D., Frangi, A.F., Yang, J.y.: Two-dimensional pca: a new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1), 131–137 (2004)
-  You, C., Robinson, D., Vidal, R.: Scalable sparse subspace clustering by orthogonal matching pursuit. In: Computer Vision and Pattern Recognition. pp. 3918–3927 (2016)
Zhuang, L., Gao, H., Lin, Z., Ma, Y., Zhang, X., Yu, N.: Non-negative low rank and sparse graph for semi-supervised learning. In: Computer Vision and Pattern Recognition. pp. 2328–2335 (2012)