Low Rank Matrix Recovery with Simultaneous Presence of Outliers and Sparse Corruption

02/07/2017 ∙ by Mostafa Rahmani, et al. ∙ University of Central Florida 0

We study a data model in which the data matrix D can be expressed as D = L + S + C, where L is a low rank matrix, S an element-wise sparse matrix and C a matrix whose non-zero columns are outlying data points. To date, robust PCA algorithms have solely considered models with either S or C, but not both. As such, existing algorithms cannot account for simultaneous element-wise and column-wise corruptions. In this paper, a new robust PCA algorithm that is robust to simultaneous types of corruption is proposed. Our approach hinges on the sparse approximation of a sparsely corrupted column so that the sparse expansion of a column with respect to the other data points is used to distinguish a sparsely corrupted inlier column from an outlying data point. We also develop a randomized design which provides a scalable implementation of the proposed approach. The core idea of sparse approximation is analyzed analytically where we show that the underlying ell_1-norm minimization can obtain the representation of an inlier in presence of sparse corruptions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Standard tools such as Principal Component Analysis (PCA) has been routinely used to reduce dimensionality by finding linear projections of high-dimensional data into lower dimensional subspaces. The basic idea is to project the data along the directions where it is most spread out so that the residual information loss is minimized. This has been the basis for much progress in a broad range of data analysis problems, including problems in computer vision, communications, image processing, machine learning and bioinformatics

[1, 2, 3, 4, 5].

PCA is notoriously sensitive to outliers, which prompted substantial effort in developing robust algorithms that are not unduly affected by outliers. Two distinct robust PCA problems were considered in prior work depending on the underlying data corruption model, namely, the low rank plus sparse matrix decomposition [6, 7] and the outlier detection problem [8].

I-a Low rank plus sparse matrix decomposition

In this problem, the data matrix is a superposition of a low rank matrix representing the low dimensional subspace, and a sparse component with arbitrary support, whose entries can have arbitrarily large magnitude modeling element-wise data corruption [7, 6, 9, 10, 11, 12], i.e.,

For instance, [7] assumes a Bernoulli model for the support of in which each element of

is non-zero with a certain small probability. Given the arbitrary support, all the columns/rows of

may be affected by the outliers. The cutting-edge Principal Component Pursuit (PCP) approach developed in [6] and [7] directly decomposes into its low rank and sparse components by solving a convex program that minimizes a weighted combination of the nuclear norm

(sum of singular values), and the

-norm ,

(2)

If the column and row spaces of are sufficiently incoherent and the non-zero elements of sufficiently diffused, (2) can provably recover the exact low rank and sparse components [6, 7].

I-B Outlier detection

In the outlier detection problem, outliers only affect a portion of the columns of , i.e., corruption is column-wise. The given data is modeled as

A set of the columns of the outlier matrix , the so-called outliers, are non-zero and they do not lie in the Column Space (CS) of . In this problem, it is required to retrieve the CS of or locate the outlying columns.

Many approaches were developed to address this problem, including[13, 14, 15, 8, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 18]. In [8], it is assumed that is a column sparse matrix (i.e., only few data points are outliers) and a matrix decomposition algorithm is proposed to decompose the data into low rank and column-sparse components. In [18], the -norm in the PCA optimization problem is replaced with an -norm to grant robustness against outliers. In [26], we leveraged the low mutual coherence between the outlier data points and the other data points to set them apart from the inliers. An alternative approach relies on the observation that small subsets of the columns of are linearly dependent (since lies in a low-dimensional subspace) while small subsets of the outlier columns are not given that outliers do not typically follow low-dimensional structures. Several algorithms exploit this feature to locate the outliers [23, 15].

I-C Notation

Capital and small letters are used to denote matrices and vectors, respectively. For a matrix

, is the row of , its column, and its null space, i.e., is the complement of the row space of . denotes its spectral norm, its nuclear norm which is the sum of the singular values, and its -norm given by . In addition, the matrix denotes the matrix with its column removed. In an -dimensional space, is the vector of the standard basis. For a given vector , denotes its -norm. The element-wise functions and are the sign and absolute value functions, respectively.

I-D The identifiability problem

In this paper, we consider a generalized data model which incorporates simultaneous element-wise and column-wise data corruption. In other words, the given data matrix can be expressed as , where is the low rank matrix, contains the outliers, and an element-wise sparse matrix with a arbitrary support. Without loss of generality, we assume that the columns of and corresponding to the non-zero columns of are equal to zero. We seek a robust PCA algorithm that can exactly decompose the given data matrix into , and . Without any further assumptions, this decomposition problem is clearly ill-posed. Indeed, there are many scenarios where a unique decomposition of may not exist. For instance, the low rank matrix can be element-wise sparse, or the non-zero columns of can be sparse, or the sparse matrix can be low rank. In the following, we briefly discuss various identifiability issues.

1. Distinguishing from : The identifiability of the low rank plus sparse decomposition problem [7, 6] in which was studied in [6]. This problem was shown to admit a unique decomposition as long as the column and row spaces of are sufficiently incoherent with the standard basis and the non-zero elements of are sufficiently diffused (i.e., not concentrated in few columns/rows) These conditions are intuitive in that they essentially require the low rank matrix to be non-sparse and the sparse matrix not to be of low rank.

2. Distinguishing outliers from inliers: Consider the outlier detection problem in which . Much research was devoted to study different versions of this problem, and various requirements on the distributions of the inliers and outliers were provided to warrant successful detection of outliers. The authors in [8] considered a scenario where is column-sparse, i.e., only few data columns are actually outliers, and established guarantees for unique decomposition when the rank of and the number of non-zero columns of are sufficiently small. The approach presented in [23] does not necessitate column sparsity but requires that small sets of outliers be linearly independent. Under this assumption on the distribution of the outliers, exact decomposition can be guaranteed even if a remarkable portion of the data columns are outliers. In this paper, we make the same assumption about the distribution of the outliers, namely, we assume that an outlier cannot be obtained as a linear combination of few other outliers.

3. Distinguishing a sparse matrix from an outlier matrix: Suppose and assume that the columns of corresponding to the non-zero columns of are equal to zero. Thus, if the columns of are sufficiently sparse and the non-zero columns of are sufficiently dense, one should be able to locate the outlying columns by examining the sparsity of the columns of . For example, suppose the support of follows the Bernoulli model with parameter and that the non-zero elements of

are sampled from a zero mean normal distribution. If

is sufficiently large, the fraction of non-zero elements of a non-zero column of concentrates around while all the elements of a non-zero column of are non-zero with very high probability.

I-E Data model

To the best of our knowledge, this is the first work to account for the simultaneous presence of both sources of corruption. In the numerical examples presented in this paper, we utilize the following data model.

Data Model 1.

The given data matrix follows the following model.
1. The data matrix can be expressed as

(3)

2. .
3. Matrix has non-zero columns. Define as the non-zero columns of . The vectors

are i.i.d. random vectors uniformly distributed on the unit sphere

in . Thus, a non-zero column of does not lie in the CS of with overwhelming probability.
4. The non-zero elements of follow the Bernoulli model with parameter , i.e., each element is non-zero independently with probability .
5. Without loss of generality, it is assumed that the columns of and corresponding to the non-zero columns of , are equal to zero.

Remark 1.

The uniform distribution of the outlying columns (the non-zero columns of ) over the unit sphere is not a necessary requirement for the proposed methods. We have made this assumption in the data model to ensure the following requirements are satisfied with high probability:

  • The non-zero columns of do not lie in the CS of .

  • The non-zero columns of are not sparse vectors.

  • A small subset of the non-zero columns of is linearly independent.

Similarly, the Bernoulli distribution of the non-zero elements of the sparse matrix

is not a necessary requirement. This assumption is used here to ensure that the support is not concentrated in some columns/rows. This is needed for to be distinguishable from the outlier matrix and for ensuring that the sparse matrix is not low rank with high probability.

The proposed data model is pertinent to many applications of machine learning and data analysis. Below, we provide two scenarios motivating the data model in (3).

I. Facial images with different illuminations were shown to lie in a low dimensional subspace [1]. Now a given dataset consists of some sparsely corrupted face images along with few images of random objects (e.g., buildings, cars, cities, etc). The images of the random objects cannot be modeled as face images with sparse corruption, which calls for means to recover the face images while being robust to the presence of random images.

II. A users rating matrix in recommender systems can be modeled as a low rank matrix owing to the similarity between people’s preferences for different products. To account for natural variability in user profiles, the low rank plus sparse matrix model can better model the data. However, profile-injection attacks, captured by the matrix , may introduce outliers in the user rating databases to promote or suppress certain products. The model (3) captures both element-wise and column-wise abnormal ratings.

I-F Motivating scenarios

The low rank plus sparse matrix decomposition algorithms – which only consider the presence of – are not applicable to our generalized data model given that is not necessarily a sparse matrix. Also, when is column-sparse, it may well be low rank which violates the identifiability conditions of the PCP approach for the low rank plus sparse matrix decomposition [6]. As an illustrative example, assume follows Data model 1 with . We apply the decomposition method (2) to and learn the CS of from the obtained low rank component. Define

where is an orthonormal basis for the CS of and is the learned basis. Fig. 1 shows the log recovery error versus . Clearly, (2) cannot yield correct subspace recovery in the presence of outliers.

On the other hand, robust PCA algorithms that solely consider the column-wise corruption are bound to fail in the presence of the sparse corruption since a crucial requirement of such algorithms is that a set of the columns of lies in the CS of . However, in presence of the sparse corruption matrix , even the columns of corresponding to the zero columns of might not lie in the CS of . For instance, assume follows Data model 1 with and . Fig. 2 shows the log recovery error versus . In this example, the robust PCA algorithm presented in [14] is utilized for subspace recovery. It is clear that the algorithm cannot yield correct subspace recovery for . The work of this paper is motivated by the preceding shortcomings of existing approaches.

On a first thought, one may be able to tackle the simultaneous presence of sparse corruption and outliers by solving

(4)

where and are regularization parameters. This formulation combines the norms used in the algorithms in [8] and [6]. However, this method requires tuning two parameters but, more importantly, inherits the limitations of [8]. Specifically, the PCP approach in [8] requires the rank of to be substantially smaller than the dimension of the data, and fails when there are too many outliers. Also, our experiments have shown that (4) does not yield an accurate decomposition of the data matrix. For illustration, consider following Data model 1 with , and . Let be the columns of indexed by the complement of the column support of and the corresponding sparse component recovered by (4). Table I shows the error in recovery of the sparse component versus . As shown, the convex program in (4) which combines (2) and [8] cannot yield an accurate decomposition knowing that in the absence of column outliers (i.e., when ) and setting , (2) does recover the sparse component with recovery error below 0.01 for all values of in Table I.

I-G Summary of contributions

In this paper, we develop a new robust PCA approach, dubbed the sparse approximation approach, which can account for both types of corruptions simultaneously. Below, we provide a summary of contributions.

  • The sparse approximation approach: In this approach, we put forth an -norm minimization formulation that allows us to find a sparse approximation for the columns of using a sparse representation. This idea is used to locate the outlying columns of which, once identified, reduces the primary problem to one of a low rank plus sparse matrix decomposition.

    2 5 10 15
    0.04 0.26 0.51 0.65
    TABLE I: Recovery error in the sparse component using algorithm (4).
  • We develop a new randomized design that provides a scalable implementation of the proposed method. In this design, the CS of is learned using few randomly sampled data columns. Subsequently, the outliers are located using few randomly sampled rows of the data.

  • We provide a mathematical analysis of the sparse approximation idea underlying our approach where we prove that the -norm minimization can yield the linear representation of a sparsely corrupted inlier.

Fig. 1: The subspace recovery error of (2) versus the number of outliers.
Fig. 2: The subspace recovery error of the robust PCA algorithm presented in [14] versus .

I-H Paper organization

The rest of the paper is organized as follows. In section II, the idea of sparse approximation is explained. Section III presents the proposed robust PCA method and Section IV exhibits the numerical experiments. The proofs of all the theoretical results are provided in the appendix along with additional theoretical investigations of the sparse approximation problem.

Ii Sparse approximation of sparsely corrupted data

Suppose the vector lies in the CS of a matrix , i.e., , where vector is the linear representation of with respect to the columns of . In a least-square sense, this representation can be obtained as the optimal point of . The main question that this section seeks to address is whether we can recover such a representation using a convex optimization formulation when both and are sparsely corrupted. In this section, we propose an -norm minimization problem and prove that it can yield the underlying representation. We refer to this approach as “sparse approximation”.

Ii-a Theoretical result

The following definition presents the notion of sparse approximation.

Definition 1.

Suppose can be expressed as , where is a sparse matrix, , and ( is the rank of ). We say that , where , is a sparse approximation of if . Thus, if is a sparse approximation of , then .

The reason that we refer to as the sparse approximation of is that is a sparse vector if and are sufficiently small.

Assume that the CS of does not include sparse vectors, i.e., the CS of is not coherent with the standard basis. According to Definition 1, if and are small enough and lies in the CS of , then is a sparse approximation of , where is the optimal point of

(5)

because the span of does not contain sparse vectors so the only way to obtain a sparse linear combination is to cancel out . Moreover, we could show that if the -norm of (5) is relaxed to an -norm, we will still able to obtain the sparse approximation. The following lemma establishes that the sparse approximation can be recovered by solving a convex -norm minimization problem if is sufficiently sparse. In order to obtain concise sufficient conditions, Lemma 1 assumes a randomized model for the distribution of the rows of in its row space. In the appendix, we present deterministic sufficient conditions for a more general optimization problem.

Assumption 1.

The rows of the matrix are i.i.d. random vectors uniformly distributed on the intersection of the row space of and the unit sphere .

Before we state the lemma, we define as the optimal point of the following oracle optimization problem

(6)

In addition, define and as the cardinalities of and , respectively, where and are defined as

(7)

Thus, is the number of non-zero rows of and is the number of non-zero rows of which are orthogonal to . If the support of follows the random model, is much smaller than . In addition, if and is small enough, will be much smaller than .

Lemma 1.

Suppose matrix is a full rank matrix which can be expressed as , where follows Assumption 1. Define

and define and as orthonormal bases for the row space and null space of , respectively. If

(8)

then (the optimal point of (6)) is the optimal point of

(9)

with probability at least , for all .

Remark 2.

Suppose the support of follows the Bernoulli model with parameter . If and are small enough, . Thus, the order of is roughly . Define such that . The vector cannot be simultaneously orthogonal to too many non-zero rows of . Thus, is much smaller than . Therefore, the order of the RHS of (8) is roughly . If we assume that the row space of is a random -dimensional subspace in the -dimensional space, then

Accordingly, the sufficient conditions in Lemma 1 amount to a requirement that is sufficiently larger than . In our problem, and . Thus, the sufficient conditions of Lemma 1 are naturally satisfied.

Input: Data matrix

1. Outlying Columns Detection
1.1 Define as the optimal points of

(10)

for , respectively. If is not sufficiently sparse, it is concluded that the -th column of is an outlying column. Form set as the index set of detected outlying columns.
1.2 Form matrix which is equal to with the detected outlying columns removed.

2. Matrix Decomposition
Obtain and as the optimal point of

(11)

Output: Set as the identified outlying columns and and as the low rank and sparse components of the non-outlying columns of .

Algorithm 1 Sparse Approximation Approach

Iii Proposed sparse approximation method

In this section, the sparse approximation (SA) method is presented. We also present a randomized design which can reduce the complexity of the proposed method from to . The table of Algorithm 1 presents the proposed algorithm. The -norm functions are utilized to enforce sparsity to both the representation vector and the residual vector [27, 28, 29, 30, 31]. The main idea is to first locate the non-zero columns of to reduce the problem to that of a low rank plus sparse matrix decomposition. In order to identify the outliers, we attempt to find a sparse approximation for each data column , using a sparse linear combination of the columns of . If for certain columns such an approximation cannot be found, we identify these columns as outliers.

The key idea underlying this approach is that the sparsity rate of in (10) can be used to certify the identity of being an outlier or a sparsely corrupted inlier. Before providing some insight, we consider an illustrative example in which follows Data model 1, , and the first 50 columns of are non-zero (). Fig. 3 shows and . The outlying column is clearly distinguishable.

Insight for the SA method: To gain more insight, consider the scenario where the -th column of is zero, i.e., , so that the -th data column is a sparsely corrupted inlier. In this case, if the regularization parameter is chosen appropriately, (10) can identify a sparse vector (sparsity of is promoted by the -norm regularizer) such that is also sparse. The -norm functions forces (10) to put the non-zero values of on the columns of such that a linear combination of their low rank components cancel out the low rank component of (i.e. they provide a sparse approximation for ) and the linear combination of their sparse component yields a sparse vector. In other word, and is a sparse vector since it is a linear combination of few sparse vectors and the algorithm automatically puts the non-zero values of on the columns such that is roughly as sparse as possible.

On the other hand, if , i.e., is an outlying column, is not likely to be sparse for a sparse since small subsets of outlying columns are linearly independent, and an outlying column is unlikely to admit a sparse representation in the sparsely corrupted columns of , to say that linear combinations of few sparsely corrupted columns of are unlikely to approximate an outlying column.

Fig. 3: The elements value of and . The column is an outlying column and the column is not an outlying column.

Iii-a Randomized implementation of the proposed method

Randomized techniques are utilized to reduce the sample and computational complexities of robust PCA algorithms [32, 33, 34, 35, 36, 37, 38, 39]. Algorithm 1 solves an dimensional optimization problem to identify all the outlying columns. However, here we show that this problem can be simplified to a low-dimensional subspace learning problem. Let

be the compact singular value decomposition (SVD) of

, where , and . We can rewrite (3) as

(12)

where . We call the representation matrix. The table of Algorithm 2 details the randomized implementation of the proposed SA method. In the randomized implementation, first the CS of is obtained using a random subset of the data columns. Matrix consists of randomly sampled columns. The proposed outlier detection approach is applied to to identify the outlying columns of . The matrix can be expressed as

(13)

where , and are the corresponding columns sampled from , and , respectively. If is sufficiently large, and will have the same CS. Thus, if we remove the outlying columns of and decompose the resulting matrix, , the obtained low rank component can yield the CS of .

Suppose the CS of is learned correctly and assume that the -th column of is equal to zero. Thus, the -th column of can be represented as

(14)

It was shown in [33] that can be obtained as the optimal point of

(15)

if is sufficiently large and some mild sufficient conditions are satisfied. Define as the optimal point of (15). If , then

The vector consists of randomly sampled elements of the sparse vector . Accordingly, if (the -th column of ) is equal to zero, then is a sparse vector. However, if is not equal to zero and is sufficiently large, it will be highly unlikely that is a sparse vector since cannot cancel out the component of that does not lie in the CS of .

Remark 3.

In the CS learning step, we identify the outlying columns via the sparsity of . If lies in the null space of , then has non-zero elements where . According to our investigations, if is chosen appropriately, is a small number mostly smaller than 3. Thus, has at most non-zero elements. In practice, it is much smaller than since the optimization searches for the most sparse linear combination. In step 3 of the algorithm, the outlying columns are located by examining the sparsity of the columns of . If the -th column of is equal to zero, and are correctly recovered, and is sufficiently large, then the -th column of is a sparse vector with roughly non-zero elements. Accordingly, if we set an appropriate threshold for the number of dominant non-zero elements, the outlying columns are correctly identified.

Input: Data matrix
1. Initialization: Form the column sampling matrix and the row sampling matrix .
2. CS Learning
2.1 Column sampling: The matrix samples columns of the given data matrix, .
2.1 Sampled outlying columns detection: Define as the optimal point of

(16)

for . If is not a sparse vector, the -th column of is identified as an outlying column.
2.2 Obtain and as the optimal point of

(17)

where is equal to with its outlying columns removed.
2.3 CS recovery: Form the orthonormal matrix as a basis for the CS of .
3. Learning and Locating the Outlying Columns.
3.1
Row sampling: The matrix samples rows of the given data matrix, .

3.2 Learning : Obtain as the optimal point of

(18)

3.3 Outlying column Detection: Form set as the index set of the non-sparse columns of .
4. Obtaining the Low Rank and Sparse Components.
Form with the columns indexed by set equal to zero. Form equal to with its columns indexed by set to zero.
Output: The matrices and are the obtained low rank and sparse components, respectively. The set contains the indices of the identified outlying columns.

Algorithm 2 Randomized Implementation of the Sparse Approximation Method

Iv Numerical Simulations

In this section, we present a set of numerical experiments to study the performance of the proposed approach. First, we validate the idea of sparse approximation for outlier detection and study its requirements. Second, we provide a set of phase transition plots to demonstrate the requirements of the randomized implementation, i.e., the sufficient number of randomly sampled columns/rows. Finally, we study the sparse approximation approach for outlier detection with real world data.

Iv-a The idea of outlier detection

Suppose the given data follows Data model 1 and . The first 20 columns of are non-zero. The matrix follows the Bernoulli model with . The rank of is equal to 5. We solve (10) with the constraint vector set equal to and define as the corresponding optimal point. Define . We also, define a vector whose -th entry is set equal to the number elements of greater than . Thus, the -th element of is the number of dominant non-zero elements of . Fig. 4 shows the elements of . As shown, the indices corresponding to the outlying columns are clearly distinguishable.

Iv-B Phase transition

In the presented theoretical analysis, we have shown that if the rank of the low rank component is sufficiently small and the sparse component is sufficiently sparse, the -norm optimization problem can yield the sparse approximation (Lemma 1 and Theorem 2). In this section, we assume the data follows Data model 1 and study the phase transition of Algorithm 1 (which uses sparse approximation for outlier detection) in the 2D-plane of and . The first 200 columns of the given data are outlying columns, i.e., . We define a vector

as before corresponding to each data column, and we classify the

-th column as an outlier if more than 40 percent of the elements of are greater than 0.1. Fig. 5 shows the phase transition in the plane of and . For each pair , we generate 10 random realizations. In this figure, white designates that all outliers are detected correctly and no inlier is misclassified as an outlier. One can observe that if , we can correctly identify all the outliers even with .

In practice, the proposed method can handle larger values of and higher sparsity levels (i.e., more non-zero elements) because the columns of the low rank matrix typically exhibit additional structures such as clustering structures [28, 40, 41]. In the simulation corresponding to Fig. 5, the low rank matrix was generated as , where the elements of and are drawn from a zero mean normal distribution. Thus, the columns of are distributed randomly in the CS of . Accordingly, the elements of must have at least non-zero elements to yield the sparse approximation. However, if the columns of lie in a union of, say, -dimensional subspaces, non-zero elements can be sufficient. Thus, if the data exhibits a clustering structure, the algorithm can bear with higher rank and sparsity levels. As an example, assume follows Data model 1 with , , and (the last 40 columns are outlying columns). Fig. 6 shows the sorted elements of . In the left plot, the columns of lie in one 10-dimensional subspace, whereas in the right plot, the columns of lie in a union of ten 1-dimensional subspaces. As shown, the proposed method yields a better output if the data admits a clustering structure.

Fig. 4: The entries of vector .
Fig. 5: Phase transition of the outlier detector in the plane of and .
Fig. 6: The entries of vector for different number of clusters of the columns of . In the right plot, the columns of lie in a union of 1-dimensional subspaces. In the left plot, the columns of lie in one 10-dimensional subspace.
Fig. 7: The phase transition plots of Algorithm 2 versus and for different values of and .
Fig. 8: The phase transition plots of Algorithm 2 versus and for different sizes of given data matrix.
Fig. 9: Few samples of the face images with different illuminations.
Fig. 10: Random examples of the images in Caltech101 database.

Iv-C Phase transition for the randomized implementation

In this section, the requirements of Algorithm 2 are studied. The data matrix follows Data model 1. The phase transition shows the performance of Algorithm 2 as function of and . A trial is considered successful if the rank of is equal to , step 3.3 identifies the outlying columns correctly, and

(19)

Fig. 7 shows the phase transition plots with different values of , and . One can see that the required values for and increase if the or increase. The required value for is approximately linear in and the required value for is linear in [33, 39]. Fig. 8 Shows the phase transition for different dimensions of , where , and . Interestingly, the required values for and are nearly independent of the size of .

Iv-D The proposed approach with real data

The Extended Yale Face Database [42] consists of face images of 38 human subjects under different illumination. We select 50 images of a human subject. Fig. 9 shows few samples of the face images. It has been observed that these images roughly follow the low rank plus sparse model, and the low rank plus sparse matrix decomposition algorithms were successfully applied to such images to remove shadows and specularities [7]. In addition, we add a sparse matrix with to the face images. We form a data matrix (32256 is the number of pixels per image) consisting of these 50 sparsely corrupted face images plus 50 randomly sampled images from the Caltech101 database [43] as outlying data points (50 images from random subjects). Fig. 10 shows a subset of the images sampled from the Caltech101 database. The first 50 columns are the face images and the last 50 columns are the random images. We have found that, for , the average number of non-zero elements of with absolute values greater than 0.1 is

with standard deviation

. For , the average number of non-zero elements of with absolute value greater than 0.1 is with standard deviation . Thus, the non-face images can be identified with a proper threshold on the number of non-zero elements of .

V Appendix

In this section, we study a more general theoretical problem, dubbed null space learning problem, of which the sparse approximation problem is a special case. Similar to the model used in Section II, assume that matrix is a full rank matrix that can be expressed as , where is a sparse matrix and the columns of are linearly dependent. If the CS of is not coherent with the standard basis, we expect the optimal point of

(20)

to lie in the null space of (i.e., ) because the CS of does not contain sparse vectors and the optimal point should cancel out the component corresponding to . Accordingly, the optimal point of (20) is equal to the optimal point of

(21)

Therefore, the -norm minimization problem can learn a direction in the null space of . In fact, the optimization problem (20) finds the most sparse vector in the CS111

Interestingly, finding the most sparse vector in a linear subspace has bearing on, and has been effectively used in, other machine learning problems, including dictionary learning and spectral estimation

[44, 45, 46]. of .

The optimization problem (20) is non-convex. We relax the cost function using an -norm and replace the quadratic constraint with a linear constraint as

(22)

where is a fixed vector. We refer to (22) as the convex null space learning optimization problem. If we set the constraint vector equal to , the null space learning optimization problem is equivalent to the sparse approximation optimization problem (9), hence the latter is a special case of the former. Note that in (9), and in (22) , yet the equivalence stems from the fact that if is the optimal point of (9), then the optimal point of (22) with will be equal to

where denotes the elements of a vector from index to .

The following Theorem establishes sufficient conditions for the optimal point of the -norm minimization problem (22) to lie in the null space of . Before we state the theorem, let us define as the optimal point of

(23)

The sets and are defined similar to (7).

Theorem 2.

Suppose matrix is a full rank matrix that can be expressed as . Define as the optimal point of (23), and define

(24)

If

(25)

where and are the row of and , respectively, and the subspace is the row space of , then is the optimal point of (22).

The sufficient conditions (25) reveal some interesting properties which merit some intuitive explanation provided next. According to Theorem 2, the following are important factors to ensure that the optimal point of (22) lies in the null space of :
1. The CS of should not be coherent with the standard basis: Recall that we assume that . Thus, if follows the Bernoulli model and and are sufficiently small, then , in which case the LHS of (25) will approximate the permeance statistic[25] – a measure of how well the rows of are distributed in the row space of . The permeance statistic increases if the rows are more uniformly distributed in the row space. But, if they are aligned along some specific directions, the permeance statistic tends to be smaller, wherefore the CS of will be more coherent with the standard basis. This is in agreement with our initial intuition since linear combinations of the columns of are more likely to form sparse vectors when is highly coherent. In other words, the coherence of would imply that could be sparse even if does not lie in the null space of .

2. The matrix should be sufficiently sparse: Per the first inequality of (25), should be sufficiently sparse otherwise the cardinality of will not be sufficiently large. This requirement also confirms our initial intuition because the optimal point of (22) cannot yield a sparse linear combination even if it lies in the null space of unless is sparse.

3. The vector should not be too incoherent with : Recalling that and are orthonormal bases for the null space and row space of , respectively, the factor

on the LHS of the second inequality of (25) unveils that the vector should be sufficiently coherent with (the null space of ) in order to ensure that the optimal point of (22) lies in . The intuition is that if has a very small projection on and the optimal point of (22) lies in , then the optimal point of (22) should have a large Euclidean norm to satisfy the linear constraint of (22). In that sense, points lying in would be unlikely to attain the minimum of the objective function in (22). In addition, this coherency requirement implies that the optimal point of (22) is more likely to lie in when the rank of is smaller because the null space would have higher dimension, which makes more likely to be coherent with .

V-a Proof of Lemma 1

Lemma 1 is a special case of Theorem 2. In order to prove Lemma 1, we make use of the following Lemmas from [47, 25] to lowerbound and upperbound in (25).

Lemma 3.

(Lower-bound on the permeance statistic from [25]) Suppose that are i.i.d. random vectors uniformly distributed on the unit sphere in . When ,

(26)

When , for all ,

(27)

with probability at least .

Lemma 4.

If are i.i.d. random vectors uniformly distributed on the unit sphere in , then

(28)

for all .

V-B Proof of Theorem 2

We want to show that

(29)

Define as

(30)

Since (22) is a convex optimization problem, it suffices to check that for every sufficiently small non-zero perturbation such that

(31)

The conditions on is to ensure that is a feasible point of (22). If is the optimal point of (23), then the cost function of (23) is increased when we move from the optimal point along a feasible perturbation direction. Observe that is a feasible point of (23), if and only if the perturbation satisfies

(32)

where is the null space of . Therefore, for any non-zero which satisfies (32)

(33)

When , we can rewrite (33) as