 # Self-Expressive Decompositions for Matrix Approximation and Clustering

Data-aware methods for dimensionality reduction and matrix decomposition aim to find low-dimensional structure in a collection of data. Classical approaches discover such structure by learning a basis that can efficiently express the collection. Recently, "self expression", the idea of using a small subset of data vectors to represent the full collection, has been developed as an alternative to learning. Here, we introduce a scalable method for computing sparse SElf-Expressive Decompositions (SEED). SEED is a greedy method that constructs a basis by sequentially selecting incoherent vectors from the dataset. After forming a basis from a subset of vectors in the dataset, SEED then computes a sparse representation of the dataset with respect to this basis. We develop sufficient conditions under which SEED exactly represents low rank matrices and vectors sampled from a unions of independent subspaces. We show how SEED can be used in applications ranging from matrix approximation and denoising to clustering, and apply it to numerous real-world datasets. Our results demonstrate that SEED is an attractive low-complexity alternative to other sparse matrix factorization approaches such as sparse PCA and self-expressive methods for clustering.

## Code Repositories

### SEED

Form a sparse a low rank factorization of a dataset using samples from the data

### SEED

sparse self-expressive decompositions

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Data-driven methods for sparse matrix factorization, such as sparse PCA (SPCA)  and dictionary learning [2, 3], approximate data vectors as sparse linear combinations of a small set of basis elements. While these simple approaches provide an extremely efficient representation of a dataset, the bases learned often “mix” points from different low-dimensional geometric structures and thus lead to degraded classification/clustering performance .

An alternative approach for revealing low-dimensional structure is to let the data “express itself”—to represent each element in a dataset in terms of a small subset of samples. Self-expression has already been successfully used in the context of classification , clustering [6, 7, 8], and low-rank matrix approximation [4, 9].

In contrast to learning a basis, self expression provides a provable means by which low-dimensional subspace structures can be discovered [6, 10, 11, 8]. The idea underlying self-expressive approaches for clustering, such as sparse subspace clustering (SSC)  and low-rank representations (LRR) , is to represent the dataset in terms of all of the signals in the collection. For a dataset , containing data vectors of dimensions, one computes a representation , where is a sparse matrix with zeros along the diagonal. The resulting sparse matrix

can be interpreted as an affinity matrix, where data vectors that lie in the same subspace (or cluster) are presumed to use one another in their sparse representations. The dataset is then clustered by applying spectral clustering methods

 to the graph Laplacian of .

While self-expressive methods like SSC and LRR provide a principled segmentation of into low-dimensional subspaces (subspace clustering), applying these methods to big datasets is challenging. Both SSC and LRR require the construction and storage of an affinity matrix for a dataset of size . Even when low-complexity greedy methods are used to populate the affinity matrix, as in SSC-OMP 

, clustering the data requires solving an eigenvalue problem for the entire affinity matrix, which is intractable for large

As such, the development of efficient solutions for decomposing large datasets is essential for both clustering and discovering low-dimensional structures in the data.

### I-a Our contributions: SEED

In this paper, we develop a scalable approach for sparse matrix factorization that is built upon the idea of using samples from the data to “express itself”. Our approach, which we refer to as a SElf-Expressive Decomposition (SEED), consists of two main steps. In the first step, we select data samples by sequentially selecting columns from that are incoherent (uncorrelated) from columns selected at previous iterations. To do this, we use a method called oASIS (Accelerated Sequential Incoherence Selection) . oASIS operates on a subset of the Gram matrix and can thus be used to quickly select columns from without computing the entire Gram matrix. In the second step, we use the vectors selected in the first step as a basis by which we compute a sparse representation of the dataset using a faster variant of the orthogonal matching pursuit (OMP) method . We describe SEED in detail in Sec. III and provide pseudocode in Alg. 2.

We demonstrate that SEED provides an effective strategy for matrix approximation, both in theory (Thm. 2) and in practice (Fig. 3). In particular, Thm. 2 provides a sufficient condition for oASIS to return a subset of columns that captures the full range of the data, i.e., . This condition, called exact matrix recovery

, highlights two attractive properties of our proposed approach for column selection: (i) it naturally selects linearly independent columns and thus provides a highly efficient representation of low rank matrices, and (ii) it provides an estimate of the representation error remaining in the dataset after each iteration. This error estimate can be used to stop the algorithm when explicit evaluation of the error is intractable.

Following our analysis, we demonstrate how SEED can be applied to aid in and/or solve numerous problems including: matrix approximation, clustering, denoising, and outlier detection (Sec.

V). We evaluate the performance of SEED for these applications on several real-world datasets, including three image datasets and a collection of neural signals from the motor cortex. Our results demonstrate that SEED provides a scalable alternative to other sparse decomposition methods such as SPCA, SSC, and nearest neighbor (NN) methods at a fraction of the computational cost.

### I-B Paper Organization

This paper is organized as follows. In Sec. II we provide background on column subset selection, sparse recovery, and sparse subspace clustering. In Sec. III, we introduce SEED and then provide motivating examples and a complexity analysis. In Sec. IV, we develop a sufficient condition for exact matrix recovery with SEED for low rank matrices and datasets living on unions of independent subspaces. In Sec. V, we study the performance of SEED for four applications: matrix approximation, (ii) denoising, (iii) sparse representation-based learning, and (iv) outlier detection. Finally, we end with concluding remarks (Sec. VI) and further details of our approach for column selection in Appendix A.

### I-C Notation and Preliminaries

We denote matrices with uppercase bold script and vectors with lowercase bold script. We write the entry of a matrix as . Let denote the column-wise concatenation of and . Let denote the left pseudoinverse of . The orthogonal projection of onto the span of the columns indexed by is defined as . The Frobenius norm is defined as . The support of a vector , , indexes its nonzero elements. The sparsity of equals . We denote the columns of not indexed by the set as . We denote entry-wise multiplication of and by and “” returns the sums of the columns of its argument.

We say that a collection of signals of dimension lie on a union of subspaces in when each signal , where the dimension of the subspace equals . The matrix contains independent subspaces when and is the subspace dimension. If , then the subspaces are overlapping.

## Ii Background and Related Work

### Ii-a Column Subset Selection

Consider a dataset of vectors in each represented by a column of The task of identifying columns that best represent the entire matrix is referred to as column subset selection (CSS). The CSS problem is formulated as follows:

 (CSS)min|S|=L  ∥X−πS(X)∥F.

The CSS objective aims to find a set of columns from (indexed by the set ) that best approximate in the least-squares sense. For a collection of signals that lie on a -dimensional subspace, all invertible sub matrices of will yield exact matrix recovery, i.e.,

 ∥X−πS(X)∥F=0.

Unfortunately, (CSS) is believed to be NP-hard, since it requires a brute force search for the sub-matrix of that provides the best approximation. However, a large body of literature in random and adaptive column selection has emerged over the past few years [15, 9]. While uniform random sampling is the easiest and most well-studied sampling method, a number of adaptive selection criteria have been proposed to reduce the number of samples required to achieve a target approximation error. Adaptive approaches include: (i) leverage-based sampling [4, 16] and (ii) sequential error-based selection (SES) approaches . Leverage sampling requires computing a low-rank SVD of the data matrix to determine which columns exert the most influence over its low-rank approximation. After computing the so-called “leverage scores” for the columns of the dataset, columns are drawn randomly based upon their leverage score. SES strategies select columns based upon how well they are approximated by the current sample set 

: the probability of selecting the

column is proportional to . While SES strategies are highly effective in practice, these methods are very costly, because they require computing a residual error matrix at each selection step. In contrast, the proposed column selection strategy (oASIS) requires computing and operating on only a matrix at each step, where is the iteration number and .

### Ii-B Convex Approach for Subset Selection

Recently, a convex approach was proposed to find representative columns from . This self-expressive method selects representative columns from by solving:

 minV∈RN×N∥V∥2,1subject toX=XV,

where is the sum of the -norms of the rows of . By penalizing the rows of in this way, we minimize the number of non-zero rows in which in turn minimizes the number of columns of needed to represent the dataset. This approach is known to reveal representative columns from collections of data [17, 18] and also aid in hyperspectral unmixing . However, this approach requires solving for matrix and cannot be used directly to enforce sparsity in the entries of as in SEED because group sparsity norms are known to produce dense estimates within each group.

### Ii-C Greedy Sparse Recovery

Sparse recovery methods aim to form an approximation consisting of a small number of nonzero coefficients in . Greedy methods for sparse recovery, such as orthogonal matching pursuit (OMP) , select columns from (atoms) iteratively, subtracting the contribution of each selected atom from the current signal residual. This selection process is then repeated until a stopping criterion is satisfied: either a target sparsity is reached (Sparse), or the residual magnitude becomes smaller than a pre-specified value (Error). Pseudocode for the OMP algorithm is given in Alg. 1.

### Ii-D Sparse Subspace Clustering

A number of methods for learning multiple subspaces from data (subspace clustering) use the idea of self-expression to represent the data in terms of other signals in the collection; such methods lead to state-of-the-art clustering performance for unions of subspaces [10, 7]. For instance, sparse subspace clustering (SSC)  factorizes the dataset by solving the following -minimization problem:

 minV∈RN×N  ∥V∥1  s.t.  diag(V)=0, ∥X−XV∥F≤ϵ,

where is a user set parameter which controls the error in the self-expressive approximation. The idea underlying SSC is that that each datapoint can be represented as a linear combination of a small number of points in the dataset.

The coefficient matrix computed via SSC can be interpreted as a graph, where the entry of the matrix represents the edge between the and point in the dataset; the strength of each edge represents the likelihood that two points live in the same subspace. After forming a symmetric affinity matrix , spectral clustering is then performed on the graph Laplacian of the affinity matrix to obtain labels (indicating the subspace membership) for all the points in the dataset .

The motivation underlying SSC is that the sparse representation of a signal under consideration will consist of other signals from the same subspace. In fact, the SSC procedure leads to provable guarantees of

exact feature selection

(EFS) — a condition in which every data point is represented using only data from within its own subspace [10, 11, 8, 21]. Guarantees for EFS require that there exists at least linearly independent columns in that span each -dimensional subspace in the dataset. When this occurs, we say that provides a complete reference set for a subspace. In Sec. IV-C, we show that when lies on a union of independent subspaces, SEED can be guaranteed to return a set of columns that provides a complete reference set for all of the subspaces present in the dataset.

## Iii Sparse Self-Expressive Decomposition (SEED)

In this Section, we provide a description, pseudocode (Alg. 2), and complexity analysis of SEED.

### Iii-a SEED Method

As we described in the Introduction, SEED aims to form an approximation such that is sparse and contains as few normalized columns of as necessary to represent the data. Our solution consists of two steps, which we now describe in detail.

#### Iii-A1 Step 1: Column Selection

In Step 1 of SEED, we select columns from that form a good low-dimensional approximation to the dataset. To do this, we employ a method called Accelerated Sequential Incoherence Selection (oASIS) , an adaptive strategy that selects columns that are incoherent

(uncorrelated) from one another. oASIS was originally designed to compute low rank factorizations of positive semidefinite kernel matrices used in a wide range of machine learning applications. Here, we show how oASIS can be used in a novel way, for column selection, by finding a low rank approximation to the Gram matrix

.

As a motivating example, we show the samples (images) selected via oASIS and random sampling for a dataset consisting of faces under various illumination conditions (Fig. 1). We observe that oASIS returns a set of images that are highly varied in terms of their illumination (left) because we select columns that are incoherent from those selected at previous iterations. In contrast, random sampling (right) returns a set of images with highly redundant illumination conditions and is thus less efficient at expressing the range of the dataset. We provide details (Appendix A) and pseudocode (Alg. 3) for our implementation of oASIS.

To motivate the selection criterion used in oASIS, suppose that we have already selected columns from , indexed by the set . Without loss of generality, we reorder . Let and . The least-squares approximation of in terms of the subsampled columns (i.e., the Nyström approximation ) is given by

 ˆGk=CW+CT.

We assume that contains linearly independent columns and thus inversion of is possible. In Sec. IV-A, we will show that oASIS is guaranteed to only select linearly independent columns and thus this assumption is justified. Fig. 1: Incoherence sampling from face images. Face images from one subject selected via oASIS (left) and random sampling (right). The images selected via oASIS represent a wide range of diverse illumination conditions whereas random sampling selects a number of similar (redundant) illumination conditions.

Now, we would like to determine which column to add to our current approximation (Fig. 2). If we consider a new column , the resulting upper left by block of the approximation can be written as

 ˆGk+1 =[XTk+1Xk+1bibTidi][W†k000][XTk+1Xk+1bibTidi] =[XTSXSbibTibTiW−1bi],

where , , and .

If a new column lies in the span of , then the projection . However, the discrepancy between these two scalar quantities provides a measure of how poorly represents a candidate column . Thus, without computing the entire column and measuring its projection onto the span , we can instead approximate the influence that will have on the current approximation by measuring the discrepancy between our estimate , and . Using this insight, we employ the following greedy strategy to decide which column to add to our approximation at the iteration:

1. Permute the rows and columns of such that the first columns correspond to the columns that we have already sampled and form .

2. Let denote the first entries of column , and let denote the diagonal entry in this column. For each unsampled column, calculate the lower bound on how much it changes the Nyström approximation:

 Δi=di−bTiW+kbi.
3. Select the unsampled column with maximum and add it to .

4. If the selected value of is smaller than a user set threshold, then terminate. Otherwise, let and return to Step 1.

At each iteration, one needs to compute the coefficient matrix (represented by above). We can speed up the naïve implementation above by performing rank- updates to the coefficient matrix every time a column is added. For the sake of completeness, we provide a brief description and pseudocode for oASIS in the Appendix. See  for a full discussion of oASIS and its application to low rank kernel matrix approximation. Fig. 2: Column selection with oASIS. At each step of oASIS, we project b (in green) onto the span of W (upper left red block) and select the column that produces the largest deviation when compared its corresponding diagonal entry d, i.e., we select the column that has maximal deviation Δ=|d−bTW−1b|. Our proposed column selection strategy only depends on knowledge of the red shaded regions (a subset of columns/rows of G) and does not depend on the gray shaded region containing the inner products between X and the unsampled columns X−S in question.

#### Iii-A2 Step 2: Greedy Sparse Recovery

In Step 2 of SEED, we compute the sparse representations of the columns of in terms of the set of columns selected in Step 1, given by . To do this, we first normalize all the columns in to have unit -norm; let denote the corresponding matrix of normalized datapoints. Without loss of generality, we reorder and compute its sparse decomposition as , where and is a vector containing the -norm of the column in in its entry. The columns of are computed by solving an accelerated version of OMP designed to efficiently compute the sparse representations of a batch of signals, called batch orthogonal matching pursuit (OMP) . Unlike convex optimization-based approaches for sparse recovery that use the -norm , one can constrain either the total approximation error for each column (Error) or constrain the sparsity (Sparse).

### Iii-B Variant of Alg. 2

We now introduce a variant of SEED that can be used for outlier detection and clustering applications. This variant modifies the way in which we compute the sparse representation of the sampled signals . If we reorder , we can write the sparse matrix , where contains the sparse representations of the unsampled signals in its columns. Rather than simply setting to be a diagonal matrix containing the norm of the sampled signals (as in Alg. 2), we can set to be the solution of the following objective function:

 minW∈RL×L  ∥W∥0  s.t.  ∥XS−DW∥F≤ϵ,  diag(W)=0. (1)

The idea behind this variant of SEED is to form a sparse representation of each sampled column in terms of other selected columns.

### Iii-C Complexity of SEED

The runtime complexity required to select columns in Step 1 of SEED (oASIS) is , requiring storage of a matrix at step . In Step 2 of SEED, we must compute a sparse approximation of each datapoint, where the runtime complexity of OMP for a matrix is . Thus the complexity of computing equals , which for small , is roughly . Thus, the total complexity of both steps is given by .

To contrast the complexity of SEED with other approaches for clustering, the complexity of computing nearest neighbors for a collection of data points is . The complexity of SSC-OMP  is dominated by forming a sparse representation of each column of with respect to a matrix which has runtime complexity.

## Iv Results for Exact Matrix Recovery

In this Section, we develop sufficient conditions for exact matrix recovery: this occurs when the projection of onto the subspace spanned by the subset gives us back exactly , i.e., . To prove our main result for exact matrix recovery (Thm. 2), we begin by first proving that at each iteration of oASIS, the algorithm selects samples that are linearly independent from those selected at previous iterations (Lem. 1). We then show how the application of oASIS to the Gram matrix of is guaranteed to provide exact matrix recovery.

### Iv-a Independent Selection Property of oASIS

If we assume that is rank , then exact matrix recovery occurs when contains at least linearly independent columns from . In Lemma 1 below, we provide a sufficient condition that describes when oASIS will return a set of linearly independent columns.

###### Lemma 1

At each step of oASIS, the column of the Gram matrix is linearly independent from the previously selected columns provided that .

Proof. We proceed by induction. Let denote the set of columns from already selected at the previous iterations and let denote the square matrix consisting of the entries of at the selected row and column indices after columns have been selected. Assume that is invertible since consists of linearly independent columns from . Now, consider selecting the column of and forming a new given by

 Wk+1=[Wkbk+1bTk+1dk+1],

where is a column vector corresponding to the inner products between the newly selected column and the previously selected columns (indexed by ) and is equal to . This matrix is invertible provided the Schur complement of is non-zero. The Schur complement is Thus, if is nonzero, then contains linearly independent columns, and thus the column of from which is drawn must also be linearly independent. As long as we initialize oASIS with a set of linearly independent columns, it is guaranteed to select linearly independent columns provided that for all corresponding to unselected columns. Our result follows by induction.

Remark. Lemma 1 guarantees that oASIS will return a set of linearly independent columns in steps as long as the selection criterion holds before exact reconstruction occurs. Unfortunately, in the pathological case in which the algorithm fails with before columns have been selected, the algorithm may terminate early. While it is possible to construct pathological matrices where this occurs, we have not observed this early termination in practice. The following theorem shows that when the entries of the Gram matrix are drawn from a continuous random distribution, the algorithm succeeds with probability .

###### Theorem 1

Suppose that the entries of the Gram matrix are drawn from a continuous random distribution. Assume that oASIS is initialized by randomly selecting fewer than columns from . Then oASIS succeeds in generating linearly independent columns with probability .

Proof. We begin by noting that the randomly chosen initialization columns of the Gram matrix have full rank with probability . This is because the matrix

is a random matrix drawn from a continuous distribution, and the set of singular matrices has positive co-dimension and thus measure 0. The probability of choosing a set of linearly dependent vectors by chance is thus zero.

Suppose now that columns have already been selected, where . We wish to show that it is always possible to select column The result then follows by induction.

Observe that the algorithm fails to choose the vector only if

 di=bTiW−1k−1bi,∀i∈{1,2,⋯,N}, (2)

where denotes the diagonal entry in column By construction, condition (2) holds for (the approximation used by oASIS on iteration perfectly represents these columns). For columns through the quantity

is known (i.e., it is not a random variable because it can be computed using only values of the sampled columns

through ). However, for such columns,

is continuous random variable, and thus the probability of (

2) holding for is 0.

### Iv-B Exact Matrix Recovery

Using Lemma 1, we now prove that oASIS returns a sample subset that yields exact matrix recovery. To do this, we use the fact that the Gram matrix from which oASIS selects columns spans the same space as . Thus, when we select linearly independent columns from , this also guarantees that the same set of columns from are linearly independent. This idea is made precise in the following.

###### Lemma 2

Let be a rank matrix and be the set of columns selected via oASIS from the corresponding Gram matrix. Exact matrix recovery of occurs when .

Proof. Recall that the Gram matrix is by definition of the same rank as . The columns of indexed by equals and the Thus, when , this implies that , which by assumption equals the rank of the full dataset . Therefore, when , exact recovery of is guaranteed.

To state our main result for matrix recovery with oASIS, we must make the following assumption.

###### Assumption 1

For a rank matrix, oASIS returns a set of columns before terminating with .

We are now equipped to state our main result.

###### Theorem 2 (Exact Recovery Condition)

Let be a rank matrix. Exact matrix recovery occurs as long as Assumption 1 is satisfied.

Proof. To prove Thm. 2, we must simply combine Lemma 1 with Lemma 2. To be precise, Lemma 1 states that oASIS will return a set of linearly independent columns from provided that . Thus, as long as the algorithm does not terminate with , this implies that we return a set of linearly independent columns from indexed by the index set . Now, using Lemma 2, we have that when oASIS returns a subset of columns such that the , then exact recovery is guaranteed for based upon the corresponding subset .

### Iv-C Exact Recovery from Unions of Independent Subspaces

Guarantees for EFS for SSC  and SSC-OMP  rely on the assumption that the dataset , provides a complete reference set for each low-dimensional subspace present in the data (see Sec. II-D). To make this precise, assume that the points in are drawn from a union of subspaces, where and the . We say that provides a complete reference set for if contains at least points from for all .

It follows from the definition of exact matrix recovery that, if is a union of independent subspaces and , then whenever yields exact matrix recovery we are guaranteed that also provides a complete reference set for . This is the main condition required to prove EFS in [8, 11]; thus, when SEED returns a subset of the data with at least columns for each of its -dimensional subspaces, we can compute the corresponding covering radius of each subspace (see  for further details). Thus, as long as exact recovery occurs, we can apply the theory in [8, 11] to produce guarantees that EFS occurs for the decomposition obtained via SEED. This result follows from combining Thm. 1 in  with our condition for exact recovery in Thm. 2. Fig. 3: Approximation error versus size of factorization. The relative approximation error is displayed as a function of the factorization size (L/N) for SEED, error-based sampling (SES), random sampling for: (a) Faces , (b) Neuro, (c) MNIST, (d) HS, and (e) UoS. For the smaller datasets in (a,b,e), we also display the error for leverage sampling, PCA, and SPCA.

## V Numerical Experiments

In this Section, we first introduce the datasets used in our evaluations and then evaluate the performance of SEED for matrix approximation, clustering, denoising, and outlier detection.

### V-a Datasets and Evaluation Setup

We now describe the datasets used in our evaluations.

• The Face dataset consists of images (each with pixels) of ten subjects faces under various illumination conditions, resulting in a dataset of size .

• The hyperspectral (HS) dataset consists of images from the Salinas scene, where each image contains spatial information about the scene at a different spectral band. Each image is pixels and after removing pixels without labels, the total dataset is of size . This dataset consists of types of vegetation (classes) and the spectral signatures associated with each class are very low-dimensional (approximately dimensions).

• The MNIST dataset contains images of handwritten digits ( pixels), which results in a dataset of size .

• The Neuro dataset consists of the firing rates of neurons in motor area (M1) collected at time points, when a monkey is performing a center-out reach task moving from a center position to one of targets . The resulting dataset produces a data matrix of size .

• The union-of-subspaces UoS dataset is a synthetic dataset consisting of signals drawn from a union of two subspaces of dimension with a -dimensional overlap (three coordinates are fully shared by both subspaces). We then add a collection of outliers created by generating random Gaussian vectors. This results in a dataset of size with points in the first subspace, points in the second subspace, and outlier points.

For our evaluations on the HS and MNIST datasets, we used an OpenMPI implementation of SEED written in C++. This parallelizes both oASIS for faster column selection and batch OMP for faster sparse representation computation. The experiments utilized 72 total processor cores with 4 GB of RAM per core. For the smaller datasets, we ran all of our evaluations in MATLAB with a single desktop processor.

### V-B Matrix Approximation

A well-studied application of CSS is the approximation of low rank matrices . As SEED utilizes a fast sequential algorithm for column selection (oASIS), our approach provides an effective strategy for matrix approximation. As we will show in Sec. IV-A, oASIS selects linearly independent columns at each step and thus obtains exact recovery (to machine precision) of rank matrices using samples (this is observed in practice in Fig. 3). In contrast, random and leverage-based sampling exhibit a significantly slower decay in their approximation error.

#### V-B1 Results for Matrix Approximation

To evaluate the performance of SEED for matrix approximation, we compute the approximation error as a function of decomposition size. We compare the error when sampling the dataset with: (i) oASIS, (ii) sequential error selection (SES) , (iii) uniform random sampling, and when possible (iv) leverage score sampling . In addition, we also compute the error for PCA and SPCA using the generalized power method in . The relative approximation error is

 err(X,D)=∥X−DD+X∥2F∥X∥2F.

Figure 3 displays the approximation error for all five datasets as a function the relative factorization size oASIS achieves exact matrix recovery for (b-e) when the number of points sampled equals the rank. This is in agreement with our result for matrix recovery in Thm. 2. This result suggests that our guarantee for exact matrix recovery is not overly restrictive and that SEED produces linearly independent sample sets in a wide range of real and synthetic datasets.

Interestingly, we observe similar decay in the approximation error for both the Neuro and synthetic UoS datasets: (i) the error achieved for SEED, PCA, and SPCA are roughly equivalent (all three achieve exact recovery with the number of factors/samples equal to the rank), (ii) SES trails behind SEED but also quickly achieves exact recovery, and (iii) random and leverage sampling flatline and do not achieve exact recovery. This suggests that the Neuro dataset is likely to contain both low rank structures as well as outliers that make random sampling significantly less effective for matrix approximation than SEED.

### V-C Sparse Representation-Based Learning

Sparse representation-based approaches to classification  and clustering , leverage the fact that signals from the same class (or subspace) will use one another in their sparse representations. Using this fact, the sparsity patterns of a self-expression decomposition such as SSC can be used to cluster the data using either a spectral clustering or consensus method  (see Sec. II-D for more details). In fact, recent studies have shown that, when the dataset lives on a union of subspaces, the sparse representation of a point from a subspace will only consist of points from the same subspace [6, 10, 11, 21, 8].

Just as in SSC, we can use the decomposition provided by SEED to cluster the data using the sparsity patterns in . However, because is rectangular, standard spectral clustering approaches for square affinity matrices cannot be used. Rather, we can think of as representing the edges of a bi-partite graph and thus co-clustering methods can be used in place of standard graph clustering methods. The spectral co-clustering algorithm introduced in  provides an elegant relaxation of the problem of finding a minimum cut through a bi-partite graph. An interesting consequence of using a spectral co-clustering approach is that, when we eventually solve an eigenvalue problem to find the minimum cut, we compute the second largest singular vector of a matrix rather than the second smallesteigenvector of a matrix. This enables us to exploit simple iterative methods for leading singular vectors rather than computing the entire SVD for a matrix.

#### V-C1 Results for Sparse Representation-Based Learning

In Figure 4, we show a visualization of the embedding of a union of five overlapping subspaces; we show the first three coordinates of the embedding and plot the projection of the unsampled signals as red dots and the projection of sampled signals as blue stars. This result provides evidence that: (i) co-clustering provides a feasible and efficient strategy for clustering the data with SEED and (ii) our proposed sampling strategy is capable of separating the data with far fewer samples than random sampling.

To quantify the performance of SEED for sparse representation-based clustering, we compute the cost of a normalized cut as we vary the size of the factorization . The cost of a normalized cut is a measure of how easy it is to cluster a bi-partite graph into its correct classes. This cost is defined as follows : Let index the rows of corresponding to points in the class, index the set of columns corresponding to the points in the class, and index all points in the dataset. The cost of the normalized cut for the class is

 ncut(V)=∑i∈Rk,j∈Ck|Vij|∑i∈Rk,j∈Ω|Vij|+∑i∈Rk,j∈Ck|Vij|∑i∈Ω,j∈Ck|Vij|.

In our subsequent experiments, we compare the average cost of a normalized cut for column sampling-based approaches with SSC-OMP and the NN graph. We note that both SSC-OMP and NN exhibit complexity, which make them impractical for large datasets.

For the Neuro dataset, we achieve lower cut ratios than SSC () and NN () for all of the column sampling-based approaches after sampling of the samples. With of the samples, the column sampling-based methods achieve cut ratios of (SEED), (SES), (Lev), (Rand). For these experiments, we set and . For higher values of and lower values of , we observe that SSC-OMP provides the best cut ratios. Our results suggest that, while leverage and random sampling provide poor schemes for matrix approximation (as evidenced by Fig. 3), all of the column sampling approaches provide comparable performance in terms of their cut ratios. Fig. 4: Visualization of co-clustering with SEED. Visualization of the embedding of V for a union of five overlapping 20-dimensional subspaces in R200. Pairs of subspaces have at most a 10-dimensional intersection and the rank of the dataset is r=150. On the left, we show the embedding for L={100,200,300} samples selected via random sampling and on the right, we show the embedding for SEED. To aid in visualization, we draw ellipses around samples from each subspace (each cluster) and display the sampled points as blue (∗) and unsampled points as red (∘).

In Fig. 5, we display the normalized cut ratios for the Faces dataset for six (a) and twenty (b) different subjects. We observe similar decay in the cut ratios for both datasets, where SEED and SES achieve normalized cuts less than NN and SSC-OMP when we sample of the dataset. The gap between SSC-OMP and SEED grows as we increase the number of samples to . The performance of leverage and random sampling appears to flatline just above the cut ratio for SSC and NN methods.

In many of the datasets that we tested, we observe that subsampling-based approaches can produce smaller cut ratios than SSC-OMP and NN methods. This is likely due to the fact that, as the size of the ensemble grows relative to the dimension of the underlying cluster, the graphs generated by SSC and NN become weakly connected and thus produce smaller cut ratios [30, 31]. In contrast, the sparse representations produced by SEED are built upon self-expressive bases containing incoherent columns, and thus we observe that SEED produces sparse graphs that are more well connected (within cluster) and thus produce smaller cut ratios (Sec. V).

### V-D Denoising

SEED computes a sparse approximation of each column of in terms of a small number of columns from the same dataset. By constraining the sparsity level of each column of in OMP (solving Sparse), we obtain an approximation to the column as . This approximation scheme provides a denoised version of the original dataset that is similar in spirit to NN-based denoising. However, rather than finding the nearest neighbors and applying a simple averaging procedure (NN-denoising), SEED finds both an optimal set of “neighbors” and weights to use at each datapoint. As a motivating example (Fig. 6), we show the performance of SEED for clustering hyperspectral image data after denoising the data with (d) SEED, (c) a random subset of samples, and after (b) applying -means to the original data. Here, we observe a significant improvement in clustering after denoising the data with only samples: the -means clustering error before and after denoising the data with SEED is and respectively.

#### V-D1 Results for Denoising

Due to the noisy nature of hyperspectral data, we find that SEED is a highly effective strategy for denoising such datasets. To do this, we apply SEED to compute the approximation , where and the error tolerance for OMP to . After denoising the data, we then apply a simple -means clustering algorithm to the columns in . In Fig. 6, we display the results from clustering a subset of the image with only points selected from the entire dataset (). With only a small sample of points, we observe nearly perfect clustering of the image with SEED. We compare the performance of SEED with a random sampling-based approach (identical to the setup for SEED except the sample sets are selected randomly) and clustering the original data without denoising. The clustering error is (-means applied to original data), (denoising with random samples from dataset), and (SEED). While we do not show the clustering results obtained by denoising the data with PCA, the clustering error for PCA-based denoising based upon principal components is (significantly higher than clustering the raw data). Fig. 5: Normalized cut ratios for face image database. Normalized cut ratios vs. size of factorization for collections of (a) six subject’s faces under sixty different illumination conditions (M=4032,N=360) and (b) twenty subject’s faces under sixty different illumination conditions (M=4032,N=1200). In both cases, the data is full rank, i.e., r=360 and r=1200 for (a) and (b), respectively.

### V-E Outlier Detection

When the dataset lies on a single or multiple low-dimensional subspaces, we can use SEED to discover outliers. The idea behind using SEED for outlier detection is that we try to sparely represent a data point that lies in a low-dimensional subspace, a small number of data points are required (i.e., the representation is sparse). In contrast, when we try to sparsely represent an outlier, a large number of data points are needed (i.e., the representation is dense). For instance, when a collection of signals lie on a union of -dimensional subspaces, the sparsity of each signal in a -dimensional subspace is bounded by . We can exploit this rank revealing property of SEED to determine whether a signal lies on one of the low-dimensional subspaces in the ensemble or whether it is an outlier.

To find outliers in the dataset, we form a self-expressive basis and then form the decomposition , where . Rather than setting to a diagonal matrix as in Alg. 2, the sparse coefficients are obtained by solving the SSC objective in (1). To compute both and , we utilize batch OMP to solve (Error) by providing an error tolerance to the algorithm. Constraining the error (rather than the sparsity) is important, because our goal is to use the sparsity level of each column to determine whether it is an outlier. Once we compute a sparse factorization, we compute the number of nonzeros in each column of (sparsity level) and segment the columns of based upon a user set threshold. When a column of is dense we declare it an outlier and when is sufficiently sparse, we declare it an inlier. In some cases, setting a threshold to segment the data can be straightforward (when the sparsity levels admit a bi- or multi-modal distribution). However, in cases where setting the threshold is difficult, one can use -means to learn a threshold to split the data. Fig. 6: Clustering hyperspectral images (HSI). We display the results of clustering a section of the HSI data: (a) ground truth (b) and k-means applied to the original data, (c) denoised data from a random selection of L=30 columns, and (d) denoised data obtained via SEED for L=30 columns. The clustering error is 31% (b), 15% (c), and 0.68% (d).

#### V-E1 Results for Outlier Detection

In Fig. 7, we demonstrate the rank revealing property of SEED when applied to the UoS dataset that has been corrupted with outliers. Along the bottom, we show the sparse coefficient matrices obtained for (a) SEED (), (b) Random sampling (), and (c) SPCA (). Above these coefficient matrices, we show the number of nonzeros (sparsity) of each column. In the case of SPCA, we set because, as we increase , we observe an even smaller gap in the sparsity level between signals living in low-dimensional subspaces and the outliers. For all of the methods, we compute the sparse coefficients using OMP, where we set and do not constrain the maximum sparsity level. In this case, we can clearly separate inliers from outliers: the sparsity level of inliers and outliers is around and , respectively.

Our results suggest that SEED can provide a strategy for outlier detection by simply thresholding columns based upon their sparsity level. In general, determining an appropriate threshold to segment low rank structures from outliers can be challenging. However, in practice, we observe that the distribution of column sparsity is multi-modal; thus, instead of setting the threshold explicitly, a -means algorithm can be employed to find a good threshold to segment outliers. Fig. 7: Demonstration of rank revealing property of SEED. The sparsity level (top row) and sparse coefficient matrices (bottom row) for SEED with (a) SEED (L=160), (b) Random sampling (L=160), and (c) SPCA (L=60).

## Vi Conclusions

This paper introduced SEED, a scalable method for sparse matrix factorization that couples a new and provable method for column selection (oASIS) with greedy sparse recovery (OMP). We have demonstrated how SEED can be applied to either assist or solve numerous signal processing and machine learning problems, ranging from matrix approximation and denoising, to clustering and outlier detection. In addition, we have developed a sufficient condition for SEED to achieve exact matrix recovery when we sample the same number of columns as the rank of the dataset (Thm. 2). In numerical experiments, we have shown that this result holds for a number of real-world datasets, i.e., we obtain exact recovery after sampling a number of columns equal to the matrix rank. This is in stark contrast to random sampling, where exact recovery cannot be guaranteed.

Column sampling has been explored extensively in the machine learning literature for the task of approximating low rank matrices [22, 15, 9]

. In this paper, we applied column selection to a new class of problems, namely sparse representation-based clustering/classification and subspace clustering. Thus, an important contribution of our work is showing how self-expressive approaches used in signal processing and computer vision can benefit from column selection approaches.

We have demonstrated that SEED provides self-expressive bases amenable to solving sparse representation-based learning and subspace clustering. In the case where the dataset lies on a union of independent subspaces, we have shown that our condition for exact recovery (Thm. 2) also implies that we select at least linearly independent columns from each -dimensional subspace in the dataset. However, providing a bound on the covering radius (how well each subspace is sampled) with column sampling methods is an open problem that must be solved to prove stronger results for feature selection similar to those in [10, 11, 8] for SSC. Extending our analysis to the case of approximately low rank matrices, unions of overlapping subspaces, and noisy settings are all interesting directions for future work.

## Appendix A Accelerated Sequential Incoherent Selection (oASIS)

We now provide a detailed description of our implementation of oASIS for column sampling. Pseudocode is provided in Alg. 3.

### A-a Accelerated Sampling

A naïve implementation of the column sampling approach described in Sec. III-A1 is inefficient, because each step requires a matrix inversion to form in addition to calculating the errors Fortunately, this can be done efficiently by updating the results from the previous step using block matrix inversion formulas and rank-1 updates. We now provide a derivation of the algorithm and pseudocode in Alg. 3.

We first consider the calculation of after column is added to the approximation. Let denote the first rows of column of the column of and denote its diagonal. Using a block inversion formula, we obtain

 W−1k+1 =[Wkbk+1bTk+1dk+1]−1 (3) =[W−1k+sk+1qk+1qTk+1,−sk+1qk+1−sk+1qTk+1,sk+1], (4)

where is the (scalar valued) Schur complement and is a column vector. This update formula allows to be formed by updating and only requires inexpensive vector-vector multiplication. Note that is invertible as long as (the Schur complement) is non-zero, which is guaranteed by our sampling rule: the algorithm terminates if in which case our approximation is exact.

We now consider the calculation of for all Note that on step of the method, we have We can evaluate all values of simultaneously by computing the entry-wise product of with the matrix and then summing the resulting columns. If we have already formed and on iteration then the matrix needed on the next iteration is obtained by applying Eqn. 3 to to obtain

 Rk+1 =[Rk+sk+1qk+1(qTk+1CTk−cTk+1)sk+1(−qTk+1CT+cTk+1)].

The update formula above forms by updating the matrix from the previous iteration. The update requires only matrix-vector and vector-vector products. The application of this fast update rule to perform incoherent sampling yields Alg. 3. We use this accelerated version of oASIS in all of our numerical expeditions.

## Acknowledgment

The authors would like to thank Azalia Mirhoseini, Ebrahim Songhori, and Farinaz Koushanfar for helpful discussions and their assistance in developing MPI code for running SEED on large datasets. Thanks also to Matt Perich, Lee Miller, and Mohammad Azar for collecting and sharing the neural data used in our evaluations. ELD, TAG, RJP, and RGB were funded by NSF CCF-0926127, NSF CCF-1117939, ONR N00014-12-1-0579, ONR N00014-11-1-0714, and ARO MURI W911NF-09-1-0383. ELD and KPK were funded by R01MH103910. ELD was also funded by NSF GRFP 0940902 and a Texas Instruments Distinguished Graduate Fellowship.

## References

• 

H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,”

J. Comp. Graph. Stat., vol. 15, no. 2, pp. 265–286, 2006.
•  B. Olshausen et al., “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, 1996.
•  M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, 2006.
•  M. Mahoney and P. Drineas, “CUR matrix decompositions for improved data analysis,” Proc. Natl. Acad. Sci., vol. 106, no. 3, pp. 697–702, 2009.
• 

J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,”

Proc. IEEE, vol. 98, no. 6, pp. 1031–1044, 2010.
•  E. Elhamifar and R. Vidal, “Sparse subspace clustering,” in Proc. IEEE Conf. Comp. Vis. Patt. Recog., pp. 2790–2797, June 2009.
•  G. Liu, Z. Lin, and Y. Yu, “Robust subspace segmentation by low-rank representation,” in Proc. Int. Conf. Mach. Learn., pp. 663–670, 2010.
•  E. L. Dyer, A. C. Sankaranarayanan, and R. G. Baraniuk, “Greedy feature selection for subspace clustering,” J. Mach. Learn. Res., vol. 14, pp. 2487–2517, Sep 2013.
•  A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, “Matrix approximation and projective clustering via volume sampling,” in ACM-SIAM Symp. Disc. Alg., pp. 1117–1126, ACM, 2006.
•  E. Elhamifar and R. Vidal, “Sparse subspace clustering: algorithms, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 11, pp. 2765–2781, 2012.
•  M. Soltanolkotabi and E. Candès, “A geometric analysis of subspace clustering with outliers,” Ann. Stat., vol. 40, no. 4, pp. 2195–2238, 2012.
•  J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, pp. 888–905, August 2000.
•  R. Patel, T. Goldstein, E. Dyer, A. Mirhoseini, and R. Baraniuk, “oASIS: Adaptive column sampling for kernel matrix approximation,” Rice University Electrical & Computer Engineering Dept. Technical Report, Oct. 2014.
•  R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit,” CS Technion, vol. 40, no. 8, pp. 1–15, 2008.
•  P. Drineas and M. W. Mahoney, “On the Nyström  method for approximating a gram matrix for improved kernel-based learning.,” J. Mach. Learn. Res., vol. 6, pp. 2153–2175, 2005.
•  M. Mahoney, P. Drineas, M. Magdon-Ismail, and D. Woodruff, “Fast approximation of matrix coherence and statistical leverage,” in Proc. Int. Conf. Mach. Learn., 2012.
•  E. Elhamifar, G. Sapiro, and R. Vidal, “See all by looking at a few: Sparse modeling for finding representative objects,” in Proc. IEEE Conf. Comp. Vis. Patt. Recog., pp. 1600–1607, 2012.
•  E. Elhamifar, G. Sapiro, and R. Vidal, “Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery,” in Proc. Adv. in Neural Processing Systems, pp. 19–27, 2012.
•  X. Fu, W. Ma, T. Chan, J. Bioucas-Dias, and M. Iordache, “Greedy algorithms for pure pixels identification in hyperspectral unmixing: A multiple-measurement vector viewpoint,” in Proc. Europ. Sig. Processing Conf., pp. 1–5, IEEE, 2013.
•  G. Davis, S. Mallat, and Z. Zhang, “Adaptive time-frequency decompositions,” SPIE J. Opt. Engin., vol. 33, no. 7, pp. 2183–2191, 1994.
•  Y. Wang and H. Xu, “Noisy sparse subspace clustering,” in Proc. Int. Conf. Mach. Learn., pp. 89–97, 2013.
•  C. Williams and M. Seeger, “Using the Nyström  method to speed up kernel machines,” in Proc. Adv. in Neural Processing Systems, pp. 682–688, 2001.
•  S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comp., vol. 20, no. 1, pp. 33–61, 1998.
• 

A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643–660, 2001.
•  Y. LeCun, L. D. Jackel, et al., “Comparison of learning algorithms for handwritten digit recognition,” in

Int. Conf. on Artificial Neural Networks

, pp. 53–60, 1995.
•  I. Stevenson, A. Cherian, et al., “Statistical assessment of the stability of neural movement representations,” J. Neurophysiol., vol. 106, no. 2, pp. 764–774, 2011.
•  M. Journée, Y. Nesterov, P. Richtárik, and R. Sepulchre, “Generalized power method for sparse principal component analysis,” J. Mach. Learn. Res., vol. 11, pp. 517–553, 2010.
•  B. Gowreesunker and A. Tewfik, “Learning sparse representation using iterative subspace identification,” IEEE Trans. Signal Process., vol. 58, no. 6, pp. 3055–3065, 2010.
•  I. Dhillon, “Co-clustering documents and words using bipartite spectral graph partitioning,” in Proc. of ACM SIGKDD, pp. 269–274, ACM, 2001.
•  E. Dyer, C. Studer, and R. Baraniuk, “Subspace clustering with dense representations,” in Proc. IEEE Int. Conf. Acoust Speech Signal Process., 2013.
•  B. Nasihatkon and R. Hartley, “Graph connectivity in sparse subspace clustering,” in Proc. IEEE Conf. Comp. Vis. Patt. Recog., pp. 2137–2144, IEEE, 2011.