ℓ_0-Motivated Low-Rank Sparse Subspace Clustering

12/17/2018 ∙ by Maria Brbic, et al. ∙ Ruđer Bošković Institute 0

In many applications, high-dimensional data points can be well represented by low-dimensional subspaces. To identify the subspaces, it is important to capture a global and local structure of the data which is achieved by imposing low-rank and sparseness constraints on the data representation matrix. In low-rank sparse subspace clustering (LRSSC), nuclear and ℓ_1 norms are used to measure rank and sparsity. However, the use of nuclear and ℓ_1 norms leads to an overpenalized problem and only approximates the original problem. In this paper, we propose two ℓ_0 quasi-norm based regularizations. First, the paper presents regularization based on multivariate generalization of minimax-concave penalty (GMC-LRSSC), which contains the global minimizers of ℓ_0 quasi-norm regularized objective. Afterward, we introduce the Schatten-0 (S_0) and ℓ_0 regularized objective and approximate the proximal map of the joint solution using a proximal average method (S_0/ℓ_0-LRSSC). The resulting nonconvex optimization problems are solved using alternating direction method of multipliers with established convergence conditions of both algorithms. Results obtained on synthetic and four real-world datasets show the effectiveness of GMC-LRSSC and S_0/ℓ_0-LRSSC when compared to state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

High dimensional data analysis is a widespread problem in many applications of machine learning, computer vision, and bioinformatics

[1, 2, 3, 4, 5, 6]. However, in many real-world datasets, the intrinsic dimension of high-dimensional data is much smaller than the dimension of the ambient space and data can be well represented as lying close to a union of low-dimensional subspaces. The problem of segmenting data according to the low-dimensional subspaces they are drawn from is known as subspace clustering [7]. Thanks to their capability to handle arbitrarily shaped clusters and their well-defined mathematical principles, spectral based methods [8, 9]

are widely used approaches to subspace clustering. These methods solve the subspace clustering problem by relying on the spectral graph theory and cluster eigenvectors of the graph Laplacian matrix corresponding to the smallest eigenvalues

[10].

One of the main challenges in subspace clustering is the construction of the affinity matrix that captures well (di)similarities between data points. Among various approaches proposed in the literature, methods based on sparse

[11, 12] and low-rank representations [13, 14, 15] have been among the most successful in many applications [16]. These methods exploit the self-expressiveness property of the data and represent each data point as a linear combination of other data points in the dataset. Low-rank representation (LRR) [13, 14, 17]

captures the global structure of the data by imposing a low-rank constraint on the data representation matrix. Low-rank implies that representation matrix is described by a weighted sum of small number of outer products of left and right singular vectors. In order to ensure convexity of the related optimization problem, the rank minimization is relaxed as the nuclear or Schatten-1 norm minimization problem

[18, 19, 20]. Different from LRR, Sparse Subspace Clustering (SSC) [21, 11] captures local linear relationships by constraining representation matrix to be sparse. Using the tightest convex relaxation of the quasi-norm, the SSC model solves sparsity maximization problem as norm minimization problem [22, 23]. Both LRR and SSC guarantee exact clustering when subspaces are independent, but the independence assumption is overly restrictive for many real-world datasets [24, 25]. Under appropriate conditions [26], SSC also succeeds for disjoint subspaces. However, when the number of dimensions is higher than three, SSC can face connectivity problems resulting in a disconnected graph within a subspace [27]. A natural way to construct adaptive model able to capture the global and the local structure of the data is to constrain representation matrix to be low-rank and sparse. In [28, 29, 16, 30] that is done by combining nuclear and norms as the measures of rank and sparsity, respectively. The motivation lies in the fact that minimization of these norms results in a convex optimization problem.

Although convex, nuclear and norms are not exact measures of rank and sparsity. Therefore, optimal solution of the nuclear and norms regularized objective is only approximate solution of the original problem [31]

. Proximity operator associated with the nuclear norm overpenalizes large singular values, leading to biased results in low-rank constrained optimization problems

[32, 33]. Similarly, in sparsity regularized problems norm solution systematically underestimates high amplitude components of sparse vectors [34]. Nonconvex regularizations based on quasi-norms () or their approximations have been proposed for various low-rank [32, 33, 35, 36, 37, 38, 39] and sparsity regularized problems [40, 34, 41, 42, 43, 44, 45, 46, 47]. Recently, nonconvex approximations of rank and sparsity have also been introduced in subspace clustering problem [48, 49, 50, 51, 52]. Specifically, -induced sparse subspace clustering is introduced in [48]. The corresponding optimization problem is solved using proximal gradient descent which under assumptions on the sparse eigenvalues converges to a critical point. In [49] authors replaced the nuclear norm regularizer with the nonconvex Ky Fan - norm [53] and proposed proximal iteratively reweighted optimization algorithm to solve the problem. In [50, 54] rank is approximated using Schatten- quasi-norm regularization (). The optimization problem in [50] is solved using generalized matrix soft thresholding algorithm [55]. Schatten- quasi-norm minimization with tractable and is proposed in [51]. The Schatten- () quasi-norm for is equivalent to quasi-norm on vector of singular values. Compared to the nuclear norm, it makes a closer approximation of the rank function. In this regard, quasi-norm can be seen as an equivalent to the quasi-norm and stands for the definition of the rank function. Furthermore, [52] combines regularizer () for low-rank and quasi-norm regularizer () for sparsity constraint. However, recent results in [56] show that in regularized least squares () smaller values of lead to more accurate solutions. If norm is also considered, authors show that for large measurement noises outperforms , . However, for small measurement noises quasi-norm regularization outperforms , .

Motivated by the limitations discussed above, we introduce two / quasi-norm based nonconvex regularizations for low-rank sparse subspace clustering (LRSSC). First, we propose regularization based on multivariate generalization of the minimax-concave penalty function (GMC), introduced in [34] for sparsity regularized linear least squares. Here, this approach is extended to the rank approximation. The GMC penalty enables to maintain the convexity of low-rank and sparsity constrained subproblems, while achieving better approximation of rank and sparsity than nuclear and norms. Importantly, this regularization is closely related to the continuous exact penalty which contains the global minimizers of quasi-norm regularized least-squares objective[57, 34]. GMC penalty yields solutions of low-rank and sparsity constrained subproblems based on firm thresholding of singular values and coefficients of representation matrix, respectively. The firm thresholding function is is defined as [58]:

(1)

Next, we propose the direct solution of and quasi-norms regularized objective function. The solution of corresponding low-rank and sparsity constrained subproblems is based on iterative application of hard thresholding operator [59, 60, 61] on the singular values and coefficients of the representation matrix, respectively. The hard thresholding function is defined as [59]:

(2)

Simultaneous rank and sparsity regularization is handled using the proximal average method, introduced in [62] for convex problems and extended recently to nonconvex and nonsmooth functions [63, 64]. Proximal average allows us to approximate the proximal map of joint solution by averaging solutions obtained separately from low-rank and sparsity subproblems, leading to a problem with a low computational cost in each iteration. Furthermore, using proximal average method enables us to establish global convergence guarantee for regularized LRSSC.

Better approximation of rank and sparsity is a consequence of the properties of firm and hard thresholding operators associated with GMC and regularizations. As opposed to them, the soft thresholding operator underestimates high amplitude coefficients in norm based sparsity regularized objective, as well as large singular values in low-rank approximation problem. As an example, Fig. 1 shows soft, firm and hard thresholding operators used in LRSSC, GMC-LRSSC and -LRSSC, respectively.

Fig. 1: Proximity operators for threshold value . (a) Soft-thresholding operator is associated with norm. (b) Firm-thresholding operator defined in (1) and associated with the scaled MC penalty and used in GMC-LRSSC formulation. Parameter is for visualization proposes set to . (c) Hard-thresholding operator defined in (2) and associated with quasi-norm.

To solve corresponding optimization problems we derive algorithms based on computationally efficient Alternating Direction Method of Multipliers (ADMM) [65]. Although ADMM has been successfully applied for many nonconvex problems [66, 67, 68], only recent theoretical results establish convergence of ADMM for certain nonconvex functions [69, 70, 71, 72]. For GMC regularization, we show that the sequence generated by the algorithm is bounded and prove that any limit point of the iteration sequence is a stationary point. For regularization with proximal average approach, based on the property that and quasi-norms belong to a class of semialgebraic functions and satisfy the Kurdyka-Łojasiewicz inequality [73], the global convergence of the algorithm can be guaranteed.

Experimental results on synthetic and four real-world datasets demonstrate that the proposed based low-rank sparse subspace clustering algorithms converge fast and to a point with lower or similar clustering error than the convex approximations with nuclear and norms. Compared to the state-of-the-art subspace clustering methods, the proposed algorithms perform better on four benchmark datasets.

I-a Contributions

The contributions of this paper are summarized as follows:

  1. We introduce nonconvex generalized minimax-concave penalty in the low-rank sparse subspace clustering problem, such that the global minimizers of the proposed objective coincide with that of a convex function defined using the continuous exact penalty [57]. The introduced penalty maintains the convexity of the sparsity and low-rank constrained subproblems. The proximal operator of the related GMC penalty function is the firm thresholding function [34].

  2. We introduce and pseudo-norm regularizations for LRSSC. Using the proximal average method [63, 62], we average the solutions of proximal maps of low-rank and sparsity subproblems, with the hard thresholding function as a proximity operator of the related penalties [59].

  3. We derive ADMM based optimization algorithms for LRSSC constrained either with a GMC penalty or with / quasi-norms. Iterative firm or hard thresholding of singular values and coefficients of representation matrix is used to obtain the solution of rank and sparsity constrained subproblems.

  4. We prove that the sequence generated by the GMC regularized LRSSC algorithm is bounded and that any limit point of the iteration sequence is a stationary point that satisfies Karush-Kuhn-Tucker (KKT) conditions.

  5. We establish the convergence property of the / regularized approach with proximal average and show that the algorithm converges regardless of the initialization. To the best of our knowledge, we are the first to show convergence with and penalties in the low-rank and sparsity constrained optimization problem.

The remainder of this paper is organized as follows. Section II gives a brief overview of the related work. In Section III and IV we introduce GMC and regularized low-rank sparse subspace clustering methods, respectively. We formulate the problem, present optimization algorithms, and analyze convergence and computational complexity. The experimental results on synthetic and four real-world datasets are presented in Section V. Finally, Section VI concludes this paper.

I-B Main Notation

Scalars are denoted by lower case letters, vectors by bold lower-case letters, matrices are denoted by bold capital and subspaces by calligraphic letters. denotes Frobenius norm defined as the square root of the sum of the squares of matrix elements. denotes norm defined as the sum of absolute values of matrix elements. denotes nuclear norm defined as the sum of singular values of a matrix. quasi-norm is denoted by and for matrix defined as:

where denotes cardinality function. Schatten- quasi norm is denoted by and defined as:

where

is the singular value decomposition (SVD) of matrix

. Since quasi-norm does not satisfy homogeneous property it is not a norm, but with a slight abuse of notation we will refer to it as the norm in the rest of the paper. Null vector is denoted by and is the vector of diagonal elements of a matrix. Table I summarizes some notations used in the paper.

Notation Definition
Number of data points
Dimension of data points
Number of subspaces
Data matrix
Representation matrix
Affinity matrix
Singular value decomposition of
Vector of singular values of
TABLE I: Notations and abbreviations

Ii Background

Consider the data matrix the columns of which are data points drawn from a union of linear subspaces of unknown dimensions in . Let be a submatrix of of rank , and . Given data matrix , subspace clustering segments data points according to the low-dimensional subspaces. The first step is the construction of the affinity matrix

whose elements represent the similarity between data points. An ideal affinity matrix is block diagonal (up to a permutation): non-zero distance is assigned to the points in the same subspace and zero distance to the points from different subspaces. Spectral clustering algorithm

[8, 9] is then applied to the affinity matrix to obtain memberships of data points to the subspaces.

Ii-a Related Work

Low-Rank Representation (LRR) [13, 14] aims to find a low-rank representation matrix for input data matrix by solving the following convex optimization problem:

(3)

where the nuclear norm is used to approximate the rank of . Let be the SVD of . The closed form solution of problem (3) is given by [14]:

(4)

When data points are contaminated by additive white Gaussian noise (AWGN), the following minimization problem is solved:

(5)

where is the rank regularization constant. The optimal solution of problem (5) is given by [17, 74]:

(6)

where , and . Matrices are partitioned according to the sets and , where denotes th singular value of .

Sparse Subspace Clustering (SSC) [11] represents each data point as a sparse linear combination of other data points and solves the following convex optimization problem:

(7)

where constraint is used to avoid trivial solution of representing a data point as a linear combination of itself.

For data contaminated by the AWGN, the following minimization problem is solved to approximate sparse representation matrix :

(8)

where is the sparsity regularization constant. This problem can be solved efficiently using ADMM optimization procedure [65, 11].

Low-Rank Sparse Subspace Clustering (LRSSC) [29] requires that the representation matrix is simultaneously low-rank and sparse. LRSSC solves the following problem:

(9)

where and are rank and sparsity regularization constants, respectively. For the AWGN corrupted data the following problem needs to be solved to approximate :

(10)

After representation matrix

is estimated, the affinity matrix

is calculated as follows:

(11)

In the next two sections, we introduce two nonconvex regularizers for the low-rank sparse subspace clustering. We formulate low-rank sparse subspace clustering problem in the following general form:

(12)

where and are functions that, respectively, measure rank and sparsity of the data representation matrix . The convex formulation used in (10) implies and .

Iii GMC-LRSSC Algorithm

Iii-a Problem Formulation

We propose to regularize rank and sparsity using multivariate GMC penalty function, introduced in [34] for sparse regularized least-squares. We start with some definitions and results that will be used throughout the paper.

Definition 1 ([34])

Let and . The GMC penalty function is defined as:

(13)

where is the generalized Huber function defined as:

(14)
Lemma 1 ([34])

Let , , and . Define as:

(15)

where is the GMC penalty. If is positive semidefinite matrix, is a convex function. The convexity condition is satisfied by setting:

(16)

The parameter controls the nonconvexity of the penalty . Larger values of increase the nonconvexity of the penalty. norm can be seen as a special case of this penalty by setting .

Lemma 2 ([34])

Let , , and . If is diagonal with positive entries and is given by (16), then for the minimizer of is given by element-wise firm thresholding. Formally, if

(17)

then

(18)

where stands for the firm thresholding function [58] defined entry-wise in (1).

Definition 2 ([75, 76])

Function is an absolutely symmetric function, if:

(19)

holds for any permutation of .

Proposition 1

Let be a diagonal matrix and be the GMC penalty function defined in (13). The subdifferential of singular value function of a matrix is given by the following equation:

(20)

where is the SVD of .

Proof: It follows from [34] that if is a diagonal matrix, the GMC penalty is separable, comprising a sum of scalar MC penalties:

(21)

where is the scaled MC penalty [77, 34] defined as:

(22)

Therefore, according to Definition 2, is an absolutely symmetric function. The proof of the proposition then follows from the property of the singular value function [76], where is an absolutely symmetric function.

Proposition 1 allows us to use GMC penalty for rank approximation. We formulate GMC penalty regularized objective for low-rank sparse subspace clustering. Let , and let denote vector of singular values of . By choosing as a rank function, and as a sparsity function in equation (12), we define the following nonconvex objective function:

(23)

where denotes GMC penalty defined in (13), regularized by matrix . In the next section we will show that by solving the objective (23) with ADMM, both sparsity and low-rank subproblems can be reduced to the equation (15) with diagonal . In this case, GMC penalty is closely related to the continuous exact penalty [34, 57], that approximates the convex hull of the least squares with regularization. Furthermore, diagonal reduces both subproblems to element-wise firm thresholding function, defined in (1). In low-rank minimization subproblem, the firm thresholding operator needs to be applied to the vector of singular values.

Iii-B Optimization Algorithm

To solve optimization problem in (23), we introduce auxiliary variables , and to split variables and solve subproblems independently. The reformulated objective for GMC penalty in (23) is equivalent to:

(24)

The augmented Lagrangian function of (24) is:

(25)

where , are penalty parameters and , are Lagrange multipliers.

Update rule for : Given , , , , , , we minimize the Lagrangian function in (25) with respect to :

(26)

The optimal solution of (26) is given by the following update:

(27)

Update rule for : Given , , , we minimize the Lagrangian function in (25) with respect to :

(28)

It can be seen that (28) corresponds to the least squares problem in (15) with and therefore, diagonal. It follows from the condition (16) that in order to maintain convexity of the subproblem, we need to set , . Using Lemma 2 and Proposition 1, (28) can be solved by element-wise firm thresholding of singular values of matrix .

Specifically, let denote the SVD of matrix . The closed-form solution of (28) is given by:

(29)

where is the firm thresholding function defined in (1).

Update rule for : Given , , , we minimize the objective (25) with respect to :

(30)

with subtraction of diagonal elements of :

(31)

Similarly as the update for , matrix is diagonal matrix and we can ensure convexity of the subproblem (30) by setting , . The problem (30) is then solved by firm thresholding elements of matrix and given by:

(32)

Update rules for Lagrange multipliers : Given , , , , , Lagrange multipliers are updated with the following equations:

(33)

Penalty parameters , are are in each step updated according to:

(34)

where is step size for adaptively changing . Due to numerical reasons , are bounded with , while in the convergence proof we use formulation , .

The main steps of the proposed algorithm are summarized in Algorithm 1.

0:  Data points as columns in , ,
0:  Assignment of the data points to clusters
1:  Initialize: , ,
2:  Compute for later use
3:  while not converged do
4:     Update by (27)
5:     Normalize columns of to unit norm
6:     Update by (29)
7:     Update by (32)
8:     Update , by (33)
9:     Update ,
10:  end while
11:  Calculate affinity matrix
12:  Apply spectral clustering [9] to
Algorithm 1 GMC-LRSSC by ADMM optimization

Iii-C Convergence Analysis

Although choosing guarantees convexity of the low-rank and sparsity subproblems and convergence of related subproblems, the objective in (23) is nonconvex. In this section, we analyze convergence of the proposed method and show that any limit point of iteration sequence satisfies Karush-Kuhn-Tucker (KKT) conditions [78].

Proposition 2

The sequences generated by Algorithm 1 are all bounded.

We now state the main theorem related to convergence property of GMC-LRSSC algorithm.

Theorem 1

Let be a sequence generated by Algorithm 1. Suppose that . Then, any accumulation point of the sequence satisfies the Karush-Kuhn-Tucker (KKT) conditions for problem (24). In particular, whenever converges, it converges to a point that satisfies KKT conditions.

The proofs of Proposition 2 and Theorem 1 are given in the Appendix.

Iii-D Stopping Criteria and Computational Complexity

The steps in Algorithm 1 are repeated until convergence or until the maximum number of iterations is exceeded. We check the convergence by verifying the following inequalities at each iteration : , , . We found that setting error tolerance to works well in practice. In each step we normalize columns of matrix . This normalization is frequently applied to stabilize convergence of non-negative matrix factorization algorithms [79].

The computational complexity of Algorithm 1 is , where denotes the number of iterations. In the experiments, we set the maximal to , but on all datasets the algorithm converged within less than 15 iterations. Note that the computational complexity of spectral clustering step is .

Iv -LRSSC Algorithm

Iv-a Problem Formulation

In addition to the GMC penalty, we propose to directly use and as constrains for low-rank and sparsity. Specifically, by choosing as a rank function, and as a measure of sparsity in formulation (12), we obtain the following nonconvex optimization problem:

(35)

The proximity operator of is defined entry-wise as:

(36)

The closed form solution of (36) at is the hard thresholding function defined in (2). The proximity operator of is the hard thresholding function applied entry-wise to the vector of singular values [60, 61].

Iv-B Optimization Algorithm

To solve minimization problem in (35), we split original problem into two variables and . That leads to the following objective function:

(37)

The augmented Lagrangian function of (37) is:

(38)

where is penalty parameter and is Lagrange multiplier.

Update rule for : Given , , , minimization of the Lagrangian function in (38) yields the following update:

(39)

Update rule for : Given , , , the following problem needs to be solved:

(40)

When and , the proximal map reduces to:

(41)

Let denote the SVD of matrix . The closed-form solution of (41) is given by:

(42)

where is the hard thresholding function defined in (2) and applied entry-wise to . Similarly, when and the proximal map is given by:

(43)

Closed-form solution of (43) is obtained by the hard thresholding operator applied entry-wise to matrix :

(44)

Proximal average, introduced recently in [62] and generalized to nonconvex and nonsmooth setting in [63, 64], allows us to efficiently solve problem in (40) when and . In particular, given that the proximal maps and can be easily solved using the hard thresholding operator, we approximate the proximal map by averaging solutions of proximal maps of low-rank and sparse regularizers:

(45)

where parameters and are set such that .

Furthermore, since and norms belong to the class of semi-algebraic functions [73], the proximal average function is also a semi-algebraic function [63].

Update rule for Lagrange multiplier : Given , , , Lagrange multiplier is updated with the following equation:

(46)

The main steps of the proposed algorithm are summarized in Algorithm 2.

0:  Data points as columns in , ,
0:  Assignment of the data points to clusters
1:  Initialize: , ,
2:  Compute for later use
3:  while not converged do
4:     Update by (39)
5:     Normalize columns of to unit norm 
6:     Calculate rank regularized proximal map by (42)
7:     Calculate sparsity regularized proximal map by (44)
8:     Update defined in (45), (41), (43)
9:     Update by (46)
10:     Update
11:  end while
12:  Calculate affinity matrix
13:  Apply spectral clustering [9] to
Algorithm 2 /-LRSSC by ADMM optimization

Iv-C Convergence Analysis

Theorem 2

Let be a sequence generated by Algorithm 2. Then, for any sufficiently large , Algorithm 2 converges globally111That is, regardless of the initialization, it generates a bounded sequence that has at least one limit point which is a stationary point of (38)..

Proof: The results in [63] guarantee convergence of the proximal average method. To guarantee global convergence of the Algorithm 2, we rewrite the problem (37) using the following more general form:

(47)

where , , ,