High dimensional data analysis is a widespread problem in many applications of machine learning, computer vision, and bioinformatics[1, 2, 3, 4, 5, 6]. However, in many real-world datasets, the intrinsic dimension of high-dimensional data is much smaller than the dimension of the ambient space and data can be well represented as lying close to a union of low-dimensional subspaces. The problem of segmenting data according to the low-dimensional subspaces they are drawn from is known as subspace clustering . Thanks to their capability to handle arbitrarily shaped clusters and their well-defined mathematical principles, spectral based methods [8, 9]
are widely used approaches to subspace clustering. These methods solve the subspace clustering problem by relying on the spectral graph theory and cluster eigenvectors of the graph Laplacian matrix corresponding to the smallest eigenvalues.
One of the main challenges in subspace clustering is the construction of the affinity matrix that captures well (di)similarities between data points. Among various approaches proposed in the literature, methods based on sparse[11, 12] and low-rank representations [13, 14, 15] have been among the most successful in many applications . These methods exploit the self-expressiveness property of the data and represent each data point as a linear combination of other data points in the dataset. Low-rank representation (LRR) [13, 14, 17]
captures the global structure of the data by imposing a low-rank constraint on the data representation matrix. Low-rank implies that representation matrix is described by a weighted sum of small number of outer products of left and right singular vectors. In order to ensure convexity of the related optimization problem, the rank minimization is relaxed as the nuclear or Schatten-1 norm minimization problem[18, 19, 20]. Different from LRR, Sparse Subspace Clustering (SSC) [21, 11] captures local linear relationships by constraining representation matrix to be sparse. Using the tightest convex relaxation of the quasi-norm, the SSC model solves sparsity maximization problem as norm minimization problem [22, 23]. Both LRR and SSC guarantee exact clustering when subspaces are independent, but the independence assumption is overly restrictive for many real-world datasets [24, 25]. Under appropriate conditions , SSC also succeeds for disjoint subspaces. However, when the number of dimensions is higher than three, SSC can face connectivity problems resulting in a disconnected graph within a subspace . A natural way to construct adaptive model able to capture the global and the local structure of the data is to constrain representation matrix to be low-rank and sparse. In [28, 29, 16, 30] that is done by combining nuclear and norms as the measures of rank and sparsity, respectively. The motivation lies in the fact that minimization of these norms results in a convex optimization problem.
Although convex, nuclear and norms are not exact measures of rank and sparsity. Therefore, optimal solution of the nuclear and norms regularized objective is only approximate solution of the original problem 
. Proximity operator associated with the nuclear norm overpenalizes large singular values, leading to biased results in low-rank constrained optimization problems[32, 33]. Similarly, in sparsity regularized problems norm solution systematically underestimates high amplitude components of sparse vectors . Nonconvex regularizations based on quasi-norms () or their approximations have been proposed for various low-rank [32, 33, 35, 36, 37, 38, 39] and sparsity regularized problems [40, 34, 41, 42, 43, 44, 45, 46, 47]. Recently, nonconvex approximations of rank and sparsity have also been introduced in subspace clustering problem [48, 49, 50, 51, 52]. Specifically, -induced sparse subspace clustering is introduced in . The corresponding optimization problem is solved using proximal gradient descent which under assumptions on the sparse eigenvalues converges to a critical point. In  authors replaced the nuclear norm regularizer with the nonconvex Ky Fan - norm  and proposed proximal iteratively reweighted optimization algorithm to solve the problem. In [50, 54] rank is approximated using Schatten- quasi-norm regularization (). The optimization problem in  is solved using generalized matrix soft thresholding algorithm . Schatten- quasi-norm minimization with tractable and is proposed in . The Schatten- () quasi-norm for is equivalent to quasi-norm on vector of singular values. Compared to the nuclear norm, it makes a closer approximation of the rank function. In this regard, quasi-norm can be seen as an equivalent to the quasi-norm and stands for the definition of the rank function. Furthermore,  combines regularizer () for low-rank and quasi-norm regularizer () for sparsity constraint. However, recent results in  show that in regularized least squares () smaller values of lead to more accurate solutions. If norm is also considered, authors show that for large measurement noises outperforms , . However, for small measurement noises quasi-norm regularization outperforms , .
Motivated by the limitations discussed above, we introduce two / quasi-norm based nonconvex regularizations for low-rank sparse subspace clustering (LRSSC). First, we propose regularization based on multivariate generalization of the minimax-concave penalty function (GMC), introduced in  for sparsity regularized linear least squares. Here, this approach is extended to the rank approximation. The GMC penalty enables to maintain the convexity of low-rank and sparsity constrained subproblems, while achieving better approximation of rank and sparsity than nuclear and norms. Importantly, this regularization is closely related to the continuous exact penalty which contains the global minimizers of quasi-norm regularized least-squares objective[57, 34]. GMC penalty yields solutions of low-rank and sparsity constrained subproblems based on firm thresholding of singular values and coefficients of representation matrix, respectively. The firm thresholding function is is defined as :
Next, we propose the direct solution of and quasi-norms regularized objective function. The solution of corresponding low-rank and sparsity constrained subproblems is based on iterative application of hard thresholding operator [59, 60, 61] on the singular values and coefficients of the representation matrix, respectively. The hard thresholding function is defined as :
Simultaneous rank and sparsity regularization is handled using the proximal average method, introduced in  for convex problems and extended recently to nonconvex and nonsmooth functions [63, 64]. Proximal average allows us to approximate the proximal map of joint solution by averaging solutions obtained separately from low-rank and sparsity subproblems, leading to a problem with a low computational cost in each iteration. Furthermore, using proximal average method enables us to establish global convergence guarantee for regularized LRSSC.
Better approximation of rank and sparsity is a consequence of the properties of firm and hard thresholding operators associated with GMC and regularizations. As opposed to them, the soft thresholding operator underestimates high amplitude coefficients in norm based sparsity regularized objective, as well as large singular values in low-rank approximation problem. As an example, Fig. 1 shows soft, firm and hard thresholding operators used in LRSSC, GMC-LRSSC and -LRSSC, respectively.
To solve corresponding optimization problems we derive algorithms based on computationally efficient Alternating Direction Method of Multipliers (ADMM) . Although ADMM has been successfully applied for many nonconvex problems [66, 67, 68], only recent theoretical results establish convergence of ADMM for certain nonconvex functions [69, 70, 71, 72]. For GMC regularization, we show that the sequence generated by the algorithm is bounded and prove that any limit point of the iteration sequence is a stationary point. For regularization with proximal average approach, based on the property that and quasi-norms belong to a class of semialgebraic functions and satisfy the Kurdyka-Łojasiewicz inequality , the global convergence of the algorithm can be guaranteed.
Experimental results on synthetic and four real-world datasets demonstrate that the proposed based low-rank sparse subspace clustering algorithms converge fast and to a point with lower or similar clustering error than the convex approximations with nuclear and norms. Compared to the state-of-the-art subspace clustering methods, the proposed algorithms perform better on four benchmark datasets.
The contributions of this paper are summarized as follows:
We introduce nonconvex generalized minimax-concave penalty in the low-rank sparse subspace clustering problem, such that the global minimizers of the proposed objective coincide with that of a convex function defined using the continuous exact penalty . The introduced penalty maintains the convexity of the sparsity and low-rank constrained subproblems. The proximal operator of the related GMC penalty function is the firm thresholding function .
We derive ADMM based optimization algorithms for LRSSC constrained either with a GMC penalty or with / quasi-norms. Iterative firm or hard thresholding of singular values and coefficients of representation matrix is used to obtain the solution of rank and sparsity constrained subproblems.
We prove that the sequence generated by the GMC regularized LRSSC algorithm is bounded and that any limit point of the iteration sequence is a stationary point that satisfies Karush-Kuhn-Tucker (KKT) conditions.
We establish the convergence property of the / regularized approach with proximal average and show that the algorithm converges regardless of the initialization. To the best of our knowledge, we are the first to show convergence with and penalties in the low-rank and sparsity constrained optimization problem.
The remainder of this paper is organized as follows. Section II gives a brief overview of the related work. In Section III and IV we introduce GMC and regularized low-rank sparse subspace clustering methods, respectively. We formulate the problem, present optimization algorithms, and analyze convergence and computational complexity. The experimental results on synthetic and four real-world datasets are presented in Section V. Finally, Section VI concludes this paper.
I-B Main Notation
Scalars are denoted by lower case letters, vectors by bold lower-case letters, matrices are denoted by bold capital and subspaces by calligraphic letters. denotes Frobenius norm defined as the square root of the sum of the squares of matrix elements. denotes norm defined as the sum of absolute values of matrix elements. denotes nuclear norm defined as the sum of singular values of a matrix. quasi-norm is denoted by and for matrix defined as:
where denotes cardinality function. Schatten- quasi norm is denoted by and defined as:
is the singular value decomposition (SVD) of matrix. Since quasi-norm does not satisfy homogeneous property it is not a norm, but with a slight abuse of notation we will refer to it as the norm in the rest of the paper. Null vector is denoted by and is the vector of diagonal elements of a matrix. Table I summarizes some notations used in the paper.
|Number of data points|
|Dimension of data points|
|Number of subspaces|
|Singular value decomposition of|
|Vector of singular values of|
Consider the data matrix the columns of which are data points drawn from a union of linear subspaces of unknown dimensions in . Let be a submatrix of of rank , and . Given data matrix , subspace clustering segments data points according to the low-dimensional subspaces. The first step is the construction of the affinity matrix
whose elements represent the similarity between data points. An ideal affinity matrix is block diagonal (up to a permutation): non-zero distance is assigned to the points in the same subspace and zero distance to the points from different subspaces. Spectral clustering algorithm[8, 9] is then applied to the affinity matrix to obtain memberships of data points to the subspaces.
Ii-a Related Work
When data points are contaminated by additive white Gaussian noise (AWGN), the following minimization problem is solved:
where , and . Matrices are partitioned according to the sets and , where denotes th singular value of .
Sparse Subspace Clustering (SSC)  represents each data point as a sparse linear combination of other data points and solves the following convex optimization problem:
where constraint is used to avoid trivial solution of representing a data point as a linear combination of itself.
For data contaminated by the AWGN, the following minimization problem is solved to approximate sparse representation matrix :
Low-Rank Sparse Subspace Clustering (LRSSC)  requires that the representation matrix is simultaneously low-rank and sparse. LRSSC solves the following problem:
where and are rank and sparsity regularization constants, respectively. For the AWGN corrupted data the following problem needs to be solved to approximate :
After representation matrix
is estimated, the affinity matrixis calculated as follows:
In the next two sections, we introduce two nonconvex regularizers for the low-rank sparse subspace clustering. We formulate low-rank sparse subspace clustering problem in the following general form:
where and are functions that, respectively, measure rank and sparsity of the data representation matrix . The convex formulation used in (10) implies and .
Iii GMC-LRSSC Algorithm
Iii-a Problem Formulation
We propose to regularize rank and sparsity using multivariate GMC penalty function, introduced in  for sparse regularized least-squares. We start with some definitions and results that will be used throughout the paper.
Definition 1 ()
Let and . The GMC penalty function is defined as:
where is the generalized Huber function defined as:
Lemma 1 ()
Let , , and . Define as:
where is the GMC penalty. If is positive semidefinite matrix, is a convex function. The convexity condition is satisfied by setting:
The parameter controls the nonconvexity of the penalty . Larger values of increase the nonconvexity of the penalty. norm can be seen as a special case of this penalty by setting .
Lemma 2 ()
Function is an absolutely symmetric function, if:
holds for any permutation of .
Let be a diagonal matrix and be the GMC penalty function defined in (13). The subdifferential of singular value function of a matrix is given by the following equation:
where is the SVD of .
Proof: It follows from  that if is a diagonal matrix, the GMC penalty is separable, comprising a sum of scalar MC penalties:
Therefore, according to Definition 2, is an absolutely symmetric function. The proof of the proposition then follows from the property of the singular value function , where is an absolutely symmetric function.
Proposition 1 allows us to use GMC penalty for rank approximation. We formulate GMC penalty regularized objective for low-rank sparse subspace clustering. Let , and let denote vector of singular values of . By choosing as a rank function, and as a sparsity function in equation (12), we define the following nonconvex objective function:
where denotes GMC penalty defined in (13), regularized by matrix . In the next section we will show that by solving the objective (23) with ADMM, both sparsity and low-rank subproblems can be reduced to the equation (15) with diagonal . In this case, GMC penalty is closely related to the continuous exact penalty [34, 57], that approximates the convex hull of the least squares with regularization. Furthermore, diagonal reduces both subproblems to element-wise firm thresholding function, defined in (1). In low-rank minimization subproblem, the firm thresholding operator needs to be applied to the vector of singular values.
Iii-B Optimization Algorithm
To solve optimization problem in (23), we introduce auxiliary variables , and to split variables and solve subproblems independently. The reformulated objective for GMC penalty in (23) is equivalent to:
The augmented Lagrangian function of (24) is:
where , are penalty parameters and , are Lagrange multipliers.
Update rule for : Given , , , , , , we minimize the Lagrangian function in (25) with respect to :
The optimal solution of (26) is given by the following update:
Update rule for : Given , , , we minimize the Lagrangian function in (25) with respect to :
It can be seen that (28) corresponds to the least squares problem in (15) with and therefore, diagonal. It follows from the condition (16) that in order to maintain convexity of the subproblem, we need to set , . Using Lemma 2 and Proposition 1, (28) can be solved by element-wise firm thresholding of singular values of matrix .
Specifically, let denote the SVD of matrix . The closed-form solution of (28) is given by:
where is the firm thresholding function defined in (1).
Update rule for : Given , , , we minimize the objective (25) with respect to :
with subtraction of diagonal elements of :
Similarly as the update for , matrix is diagonal matrix and we can ensure convexity of the subproblem (30) by setting , . The problem (30) is then solved by firm thresholding elements of matrix and given by:
Update rules for Lagrange multipliers : Given , , , , , Lagrange multipliers are updated with the following equations:
Penalty parameters , are are in each step updated according to:
where is step size for adaptively changing . Due to numerical reasons , are bounded with , while in the convergence proof we use formulation , .
The main steps of the proposed algorithm are summarized in Algorithm 1.
Iii-C Convergence Analysis
Although choosing guarantees convexity of the low-rank and sparsity subproblems and convergence of related subproblems, the objective in (23) is nonconvex. In this section, we analyze convergence of the proposed method and show that any limit point of iteration sequence satisfies Karush-Kuhn-Tucker (KKT) conditions .
The sequences generated by Algorithm 1 are all bounded.
We now state the main theorem related to convergence property of GMC-LRSSC algorithm.
Let be a sequence generated by Algorithm 1. Suppose that . Then, any accumulation point of the sequence satisfies the Karush-Kuhn-Tucker (KKT) conditions for problem (24). In particular, whenever converges, it converges to a point that satisfies KKT conditions.
Iii-D Stopping Criteria and Computational Complexity
The steps in Algorithm 1 are repeated until convergence or until the maximum number of iterations is exceeded. We check the convergence by verifying the following inequalities at each iteration : , , . We found that setting error tolerance to works well in practice. In each step we normalize columns of matrix . This normalization is frequently applied to stabilize convergence of non-negative matrix factorization algorithms .
The computational complexity of Algorithm 1 is , where denotes the number of iterations. In the experiments, we set the maximal to , but on all datasets the algorithm converged within less than 15 iterations. Note that the computational complexity of spectral clustering step is .
Iv -LRSSC Algorithm
Iv-a Problem Formulation
In addition to the GMC penalty, we propose to directly use and as constrains for low-rank and sparsity. Specifically, by choosing as a rank function, and as a measure of sparsity in formulation (12), we obtain the following nonconvex optimization problem:
The proximity operator of is defined entry-wise as:
Iv-B Optimization Algorithm
To solve minimization problem in (35), we split original problem into two variables and . That leads to the following objective function:
The augmented Lagrangian function of (37) is:
where is penalty parameter and is Lagrange multiplier.
Update rule for : Given , , , minimization of the Lagrangian function in (38) yields the following update:
Update rule for : Given , , , the following problem needs to be solved:
When and , the proximal map reduces to:
Let denote the SVD of matrix . The closed-form solution of (41) is given by:
where is the hard thresholding function defined in (2) and applied entry-wise to . Similarly, when and the proximal map is given by:
Closed-form solution of (43) is obtained by the hard thresholding operator applied entry-wise to matrix :
Proximal average, introduced recently in  and generalized to nonconvex and nonsmooth setting in [63, 64], allows us to efficiently solve problem in (40) when and . In particular, given that the proximal maps and can be easily solved using the hard thresholding operator, we approximate the proximal map by averaging solutions of proximal maps of low-rank and sparse regularizers:
where parameters and are set such that .
Update rule for Lagrange multiplier : Given , , , Lagrange multiplier is updated with the following equation:
The main steps of the proposed algorithm are summarized in Algorithm 2.
Iv-C Convergence Analysis
Let be a sequence generated by Algorithm 2. Then, for any sufficiently large , Algorithm 2 converges globally111That is, regardless of the initialization, it generates a bounded sequence that has at least one limit point which is a stationary point of (38)..