Sparse Representation Classification Beyond L1 Minimization and the Subspace Assumption

02/04/2015 ∙ by Cencheng Shen, et al. ∙ Johns Hopkins University 0

The sparse representation classifier (SRC) proposed in Wright et al. (2009) has recently gained much attention from the machine learning community. It makes use of L1 minimization, and is known to work well for data satisfying a subspace assumption. In this paper, we use the notion of class dominance as well as a principal angle condition to investigate and validate the classification performance of SRC, without relying on L1 minimization and the subspace assumption. We prove that SRC can still work well using faster subset regression methods such as orthogonal matching pursuit and marginal regression, and its applicability is not limited to data satisfying the subspace assumption. We illustrate our theorems via various real data sets including face images, text features, and network data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently there is a surge in utilizing the sparse representation and regularized regression for many machine learning tasks in computer vision and pattern recognition. Applications include

BrucksteinDonohoElad2009 , WrightYangGaneshMa2009 , WrightMa2010 , YinEtAl2012 , YangZhouGanesh2013 , ElhamifarVidal2013 , among many others. In this paper, we concentrate on one specific but profound application – the sparse representation classification (SRC), which is proposed by WrightYangGaneshMa2009

and exhibits state-of-the-art performance for robust face recognition.

For the classification task, denote as the testing observation and as the matrix of training data with all columns pre-scaled to unit-norm. Each column of is denoted as for , representing a training observation with a known class label . The sparse representation classifier consists of two steps: for each testing observation , first it finds a sparse representation such that ; then the class of the testing observation is determined by , where is the classifier, and takes the values from that are associated with data of class , i.e., if , otherwise. Denoting the true but unknown class of as , SRC correctly finds the true label if . This classifier has been numerically shown to work well and be robust against occlusion and contamination on face images, and argued to be better than nearest-neighbor and nearest-subspace rules in WrightYangGaneshMa2009 .

Clearly finding an appropriate sparse representation is the crucial step of SRC, which is intrinsically subset regression, i.e., apply certain method to select a subset of data

, and then take the corresponding regression vector as the sparse representation

. Most works on sparse representation have been using regularized regression methods to achieve the sparsity, for which minimization/Lasso are very popular choices due to their theoretical justifications Tibshirani1996 , DonohoHuo2001 , DonohoElad2003 , EfronHastie2004 , CandesTao2005 , Donoho2006 , CandesTao2006 , CandesRombergTao2006 , ZhaoYu2006 , MeinshausenYu2009 , Wainwright2009 , etc. The literature in minimization and Lasso are more than abundant, and usually emphasizes how the regularization can help recover the most sparse model. But how the regularization may help the subsequent inference is usually a difficult question to answer in practice, and the role of minimization is not entirely clear for this particular classification task.

The initial motivation for WrightYangGaneshMa2009 to use minimization is its equivalence to minimization (i.e., sparse model recovery) under various conditions, such as the incoherence condition DonohoElad2003 or restricted isometry property CandesTao2005 . Namely if the testing observation does have a unique and correct most sparse representation (correct in the sense of ) with respect to the training data, then assuming proper conditions are satisfied, minimization is an ideal choice in the SRC framework. But the sample training data are usually correlated, which violates many theoretical conditions including incoherence and restricted isometry; furthermore, the classification task requires only the recovered to be mostly associated with data of the correct class rather than one uniquely correct solution, so there usually exists infinite correct which are not the most sparse solution.

Towards this direction, it is argued in WrightYangGaneshMa2009 that if data of the same class lie in the same subspace while data of different classes lie in different subspaces (called the subspace assumption henceforth), then most data selected by minimization should be from the correct class, thus yielding good classification performance in SRC. Since face image data under varying lighting and expression roughly satisfy the subspace assumption BelhumeurHespandaKriegman1997 , BasriJacobs2003 , they further argue that SRC is applicable to face images. Indeed, based on the subspace assumption, ElhamifarVidal2013 derives a theoretical condition for minimization to do perfect variable selection, i.e., all selected training data are from the correct class. This indicates that sparse representation is a valuable tool with minimization under the subspace assumption. But the subspace assumption assumes a low-dimensional structure for data of the same class, which does not always hold and is difficult to validate in practice.

Despite many applications of and investigations into sparse representation, the intrinsic properties and mechanisms of SRC are still not well understood, and there exist evidence RigamontiBrownLepetit2011 , ZhangYangFeng2011 , ShiEriksson2011 , ChiPorikli2013 that neither minimization nor the subspace assumption are necessary in the SRC framework. In particular, ZhangYangFeng2011 and ChiPorikli2013 argue that it is actually the classification step (namely ) that is most effective in SRC; they call it the collaborative representation, and support their claims through many numerical examples. Our previous work applies SRC to vertex classification ChenShenVogelsteinPriebe2015 , which also achieves good performance for network data without minimization or the subspace assumption.

To deepen our understanding, in this paper we target two important questions related to SRC. First, is the subspace assumption a necessity for SRC to perform well? And if not, when and how is SRC applicable with theoretical performance guarantees? Second, despite the popularity of minimization, is this the optimal approach to do variable selection for SRC? Can we use other faster subset regression methods such as orthogonal matching pursuit (OMP) DavisMallatAvellaneda1997 , Tropp2004 and marginal regression WassermanRoeder2009 , GenoveseEtAl2012 ?

With these two target questions in mind, this paper is organized as follows. In Section 2 we review the SRC framework and three subset regression methods including homotopy, OMP, and marginal regression. Section 3 is the main section. In subsection 3.1 we first relate SRC to a notion we call class dominance on the sample data. Then, based on class dominance, in subsection 3.2 we state a principal angle condition on the data distribution that is sufficient for the classification consistency of SRC. In particular, our theorems largely explain the success of SRC, are still valid when minimization is replaced by OMP or marginal regression, and can help identify data models for SRC to work well without requiring the subspace assumption. Our results make SRC more appealing in terms of theoretical foundation, computational complexity and general applicability, and are illustrated via various real data sets including face images, text features, and network data in Section 4. We conclude the paper in Section 5, with all proofs relegated to Section 6.

2 Sparse Representation Review

2.1 The SRC Algorithm

We first summarize the SRC algorithm using minimization in Algorithm , which consists of the subset regression step and the classification step.

Input: An matrix , where each column represents a training observation with a known label . An testing observation with its true label being unknown. Unless mentioned otherwise, we always assume each column of and are pre-scaled to unit norm, and is not orthogonal to (otherwise is always the zero vector).
1. Find a sparse representation of by minimization:
(1)
2. Classify by the sparse representation :
(2)
break ties deterministically. For each entry of , if , otherwise.
Output: The assigned label .
Algorithm 1 Sparse representation classification by minimization

Solving Equation 1 by minimization is the only computational costly part of SRC. There are many possible methods to solve minimization, see in ChenDonoho2001 , DonohoTsaig2008 , YangZhouGanesh2013 , ElhamifarVidal2013 , among which we use the homotopy method for subsequent analysis and numerical experiments. This method is based on a polygonal solution path OsbornePresnellTurlach2000a , OsbornePresnellTurlach2000b and can also be used for Lasso and least angle regression Tibshirani1996 , EfronHastie2004 .

Alternatively, OMP is a greedy approximation of minimization and is equivalent to forward stepwise regression; it gains its popularity in sparse recovery due to its reduced running time and certain theoretical guarantees Tropp2004 , TroppGilbert2007 , Zhang2009 , CaiWang2011 . Furthermore, OMP is quite similar to homotopy in the implementation, and there exist many extensions of OMP NeedleVershynin2009 , NeedleVershynin2010 , DonohoTsaigDroriStarck2012 .

As for marginal regression, it is probably the simplest and fastest way to do subset regression, and it has been studied and applied successfully in many areas. Despite its simplicity, it has been shown to work well for variable selection in high-dimensional data comparing to Lasso

WassermanRoeder2009 , GenoveseEtAl2012 , KolarLiu2012 , is particularly popular in ultra-high-dimensional screening FanLv2008 , FanSamworthWu2009 , FanFengSong2011 , and has been applied to sparse representation as well BalasubramanianYuLebanon2013 . We can always use OMP or marginal regression to find the sparse representation in step 1, rather than solving Equation 1 by minimization. In the next subsection we compare homotopy, OMP, and marginal regression in more detail.

Note that the constraint in Equation 1 can be replaced by in a noiseless setting, but usually is required in order to achieve a more parsimonious model when dealing with high-dimensional or noisy data. This model selection problem, i.e., the choice of or more generally the sparsity level of subset regression, is a difficult problem intrinsic to most subset regression methods. We will explain this issue from the algorithmic point of view in the next subsection.

2.2 Homotopy, OMP, and Marginal Regression

As homotopy can be treated as an extension of OMP, and marginal regression is very simple, we only list the OMP algorithm in detail in Algorithm 2.

Input: The training data , the testing observation , and a specified iteration limit and/or a residual limit .
Initialization: The residual , iteration count , and the selected data .
1. Find the index such that , where is the th column of and is the transpose sign. Break ties deterministically, and add into the selected data so that .
2. Update the regression vector with respect to , i.e., calculate the orthogonal projection matrix with being the pseudo-inverse, and let . Then update the regression residual as .
3. If , or , or entry-wise, stop and let ; else increment .
Output: and . Note that the sparse representation can be enlarged from an vector to an vector based on the relative positions of in .
Algorithm 2 Use orthogonal matching pursuit to solve Step 1 of SRC

The idea of OMP is the same as forward selection: at each iteration OMP finds the column that is most correlated with the residual, and then re-calculates the regression vector by projecting onto the selected sub-matrix . When the iteration limit is reached, or the residual is small enough, or the residual is almost orthogonal to the training data, OMP stops.

The homotopy method is the same as OMP in terms of the data selection, but it has an extra data deletion step and a different updating scheme. Conceptually, the homotopy path seeks iteratively by reducing from a positive number to , which is proved to solve the

minimization problem and can also be used for the Lasso regression. More details can be found in

EfronHastie2004 , DonohoTsaig2008 . Our experiments use the homotopy algorithm implemented by S. Asif and J. Romberg 111http://users.ece.gatech.edu/~sasif/homotopy/.

The marginal regression method does not involve any iteration; it simply chooses columns out of that are most correlated with the testing observation , and calculates to be the regression vector with respect to the selected . Because marginal regression is a non-iterative method, it enjoys a superior running time complexity comparing to others: for the data selection step, it takes only while OMP needs ; and for small marginal regression is much faster than full regression (i.e., the usual minimization using full training data).

Clearly the three subset regression methods may yield different and thus different , but they always coincide at , which is an important fact for the later proof. Another useful observation is that is always full rank when using homotopy or OMP (otherwise they stop), but this is not necessarily the case when using marginal regression after certain . In the main section we will show that under a principal angle condition on the data model, all three methods can have the same asymptotic inferential effect, even though their sparse representation may be different.

Note that the model selection problem is inherent in the stopping criteria, and the stopping criteria used in Algorithm are commonly applied in subset regression. For example, TroppGilbert2007 only specifies the iteration limit to stop OMP, which is suitable when the testing observation is perfectly recoverable; WrightYangGaneshMa2009 stops minimization for small residual , which is more practical for real data, but a good choice of may be data-dependent; the almost orthogonal criterion (i.e., ) has been used in Zhang2009 , CaiWang2011 for OMP to work well for sparse recovery; and other stopping criteria are also possible, such as Mallows’s . As model selection does not affect the main theoretical results, we do not delve into this topic; but its finite-sample inference effect for real data is often difficult to quantify, so in the numerical experiments we always plot the SRC error with respect to various sparsity levels while setting to be effectively zero, in order to give a fair evaluation of SRC for all possible models up to a certain limit.

3 Main Results

Let us introduce some notations before proceeding: denotes the training data matrix of size , denotes the selected sub-matrix of size by subset regression, denotes the sub-matrix of whose columns are associated with class , denotes the sub-matrix of whose columns are not of class . Furthermore, represents the regression vector or sparse representation with respect to or , which may be an vector or vector depending on the context, i.e., we use and interchangeably, where the former is the regression vector and the latter is the sparse representation; they only differ in zero entries. equals except every entry not associated with class is , and ; and similar to , their size may be different depending on the context by shrinking or expanding the zero entries.

3.1 Class Dominance in the Regression Vector

We first define class dominance and positive class dominance for given regression vector and given sample data. They are not only important catalysts between the principal angle condition and the theoretical SRC optimality, but also crucial components underlying the empirical success of SRC as shown in the numerical section.

Definition.

Given and the testing observation and the training data , we say class dominates if and only if .

We say that class positively dominates the regression vector if and only if for all .

Note that we say (positive) class dominance holds if and only if the correct class (positively) dominates the sparse representation of the testing observation.

For any given , class dominance and positive class dominance together are sufficient for correct classification of SRC, formulated as follows.

Theorem 1.

Given , and , class dominance implies for SRC if class also positively dominates .

If positive class dominance does not hold, class dominance itself is not sufficient for .

Although class dominance cannot guarantee correct classification in SRC, it is closely related to positive class dominance and can lead to the latter in many scenarios. The next corollary is an example.

Corollary 0.

Suppose , or the data are non-negative and the regression vector is constrained to be non-negative.

Then given and the sample data, class dominance implies positive class dominance, in which case class dominance alone is sufficient for in SRC.

Despite the limitations of Corollary 2, two-class classification problems are common; real data are often non-negative; and the non-negativity constraint is very useful in subset regression, such as the non-negative OMP BrucksteinEladZibulevsky2008 and the non-negative least squares SlawskiHein2013 , Meinshausen2013 . In fact, the condition in Corollary 2 can be further relaxed. For example, if the dominance magnitude is large enough (i.e., for some ) and the negative entries of are properly bounded, then class dominance still implies positive class dominance and is sufficient for in SRC.

This indicates that class dominance is usually sufficient for correct classification, unless the negative entries of are too large. Indeed in the numerical section we observe that class dominance nearly always implies , even though the non-negative constraint is not used in subset regression; and the class dominance error is usually close to the classification error. In the next subsection, we make use of class dominance to identify a principal angle condition on the data model, so that SRC can be a consistent classifier without requiring minimization and the subspace assumption.

Note that the concept of class dominance appears similar to block sparsity and block coherence EldarMishali2009 , EldarKuppingerBolcskei2010 , ElhamifarVidal2012 . But block sparsity and block coherence are used to guarantee that the fewest number of blocks/classes of data are used in the sparse representation, which is not directly related to correct classification; while our class dominance is defined for the correct class of data to dominate the sparse representation, which can lead to correct classification.

3.2 Classification Consistency of SRC

In this subsection we formalize the probabilistic setting of classification based on DevroyeGyorfiLugosiBook . Suppose , where

is the random variable pair generating the testing observation and its class

, are the random variables generating the training pair for

. Note that the prior probability of the data being in each class

should be nonzero.

The SRC error is denoted as for the SRC classifier . We always have , where is the optimal Bayes error. For SRC to achieve consistent classification, it is equivalent to identify a sufficient condition on so that as . We henceforth consider the case that so that implies SRC is asymptotically optimal.

Based on this probabilistic setting and the previous subsection on class dominance, the SRC error can be decomposed by conditioning on class dominance:

(3)

where denotes the class dominance probability, denotes the conditional classification error given class dominance, and denotes the conditional classification error when class dominance fails. Clearly the class dominance probability depends on both and the subset regression method; moreover, Corollary 2 indicates that when is non-negative and is derived under the non-negative constraint, which approximately holds throughout our numerical experiments without the non-negative constraint.

So for SRC to perform well, it suffices to find a condition on so that is close to , then the SRC error can be close to ; and for SRC to be optimal beyond minimization and the subspace assumption, the condition should be as simple and as general as possible, not requiring the subspace assumption, yet still achieving class dominance almost surely for most subset regression methods.

First we state an auxiliary condition to ensure class dominance for given of full rank, which serves as a starting point for the later results.

Theorem 3.

Given , and any selected data matrix of full rank, class dominance holds if and only if

(4)

where denotes the principal angle between and .

Therefore, when Equation 4 holds for the selected sub-matrix , class dominates the sparse representation. We can convert this condition into the probabilistic setting as follows.

Theorem 4.

Under the probabilistic setting, we define the principal angle condition as follows: for fixed , there exists a constant such that almost surely and almost surely.

Denote as the probability that the principal angle condition holds for . Then the class dominance probability is asymptotically no less than , for derived by minimization at any given .

Namely, class dominance holds if the within-class data are close while the between-class data are far away in terms of the principal angle. By Equation 3.2 and Corollary 2, it is clear that the principal angle condition can lead to SRC optimality, which we state as a corollary.

Corollary 0.

Suppose both the principal angle condition in Theorem 4 and the condition in Corollary 2 hold with probability for . Then the SRC error using minimization satisfies asymptotically.

Thus if (i.e., all possible in the support of satisfy the principal angle condition), SRC is asymptotically optimal with .

Thus this condition does not explicitly rely on the subspace assumption, yet still leads to optimal classification and can be used to validate SRC applicability on general data models. At , the principal angle condition can be easily validated by the nearest neighbor based on principal angle/correlation. But for large , the between-class principal angle is more difficult to check: if the subspace assumption holds, the principal angle between one observation and observations of other classes are usually bounded below, so the condition holds as long as the within-class angle is small; if the subspace assumption does not hold, the principal angle condition at large may not hold even for well-separated data.

Therefore it is sometimes useful to prove the principal angle condition together with the non-negative constraint in Corollary 2. Because equals the correlation between and a linear combination of , we can require the correlation between and any non-negative linear combination of to be small instead of to be large; then Corollary 5 still holds. One such application is illustrated in ChenShenVogelsteinPriebe2015 for the adjacency matrix.

The proof of Theorem 4 can be adapted to any of minimization, OMP, and marginal regression, which yields the next corollary.

Corollary 0.

When minimization is replaced by OMP in the SRC framework, Theorem 4 and Corollary 5 still hold.

Furthermore, if we constrain the sparsity level such that selected by marginal regression is full rank (which is always possible up to certain ), or the original data itself is full rank, then Theorem 4 and Corollary 5 also hold for SRC using marginal regression or full regression.

Therefore, not only can OMP and marginal regression be used in SRC, so can full regression. However, for real data it is quite common that the full training data matrix

is either rank deficient or very close to rank deficient (i.e., having singular values very close to

).

So far our principal angle condition in Theorem 4 is quite restrictive, especially almost surely, as it requires data of the correct class to be always close. This can be relaxed as long as some data of the correct class are close enough to the testing observation, at the cost of treating far away data of the correct class as data of another class.

Corollary 0.

Under the probabilistic setting, suppose we extend the principal angle condition as follows: for fixed , there exists a constant such that

(5)

where is the indicator function, and almost surely.

Then Theorem 4, Corollary 5 and Corollary 6 still hold. Note that the previous principal angle condition in Theorem 4 is now a special case with .

Overall, our results in this subsection can be interpreted as demonstrating that for any given model, if the within-class principal angle can be small while the between-class principal angle is always large, then the correct class is likely to dominate the sparse representation, and SRC will succeed in the classification task. The principal angle condition here is similar to the condition in ElhamifarVidal2013 : their condition is applied on given sample data while we focus more on the distribution; and their condition explicitly requires the subspace assumption and minimization while we do not.

Furthermore, we have addressed the two questions regarding SRC in the introduction: our principal angle condition can be used to check whether SRC is applicable to a given data model without the subspace assumption, for which class dominance plays a crucial role for correct classification; the theorems also indicate that SRC should perform similarly for any of the aforementioned three subset regression methods. They are all reflected in the numerical section.

4 Numerical Experiments

In this section we apply the sparse representation classifier to various simulated and real data sets using homotopy, OMP, and marginal regression, and illustrate how our theoretical derivation of SRC is closely related to its numerical performance.

All experiments are carried out by hold-out validation, and for each data set we always randomly split the data in half for training and testing. Then we estimate the SRC error, the class dominance error, the SRC error given class dominance, and the SRC error when class dominance fails, i.e., the estimates of

, , and in Equation 3.2. To give a fair evaluation and account for possible early termination by various model selection criteria, the errors are always calculated against the sparsity level from , i.e., we re-calculate the regression vector and re-classify the testing observation for each .

We also add -nearest-neighbor (NN) and linear discriminant analysis (LDA) for benchmark purposes of the classification error. They are calculated against the projection dimensions, i.e, we linearly project the data into dimension

by principal component analysis (or spectral embedding if the input is a dissimilarity/similarity matrix), and apply

-nearest-neighbor ( is just an arbitrary choice) and LDA on the projected data.

In all examples, the above procedure is repeated for Monte Carlo replicates with the mean errors presented.

4.1 Face Images

We first apply SRC to two face image data sets, one of which is also used by WrightYangGaneshMa2009 to show the empirical advantage of SRC.

The Extended Yale B database has face images of individuals under various poses and lighting conditions GeorphiadesBelheumeurKriegman2001 , LeeHoKriegman2005 . These images are further re-sized to for our experiment. Half of the data is used for training and the other half for testing, so , , and . We show the mean errors after Monte Carlo runs in Figure 1.

The CMU PIE database has images of individuals under various poses, illuminations and expressions SimBakerBsat2003 . We also use the size re-sized images for classification, so , , and . The mean errors are shown in Figure 2.

The top left panel of each figure shows the SRC error, and we observe that the error rates for different subset regression methods are very similar. The best error achieved for Extended Yale B database is by OMP, and the best error for CMU PIE is by minimization. Note that SRC by full regression achieves a mean error of and respectively, which is a bit worse than subset regression. As for NN and LDA, their error rates are always greater than in both data sets, which are not shown in order to better compare the SRC errors.

For both data sets, the top right panel shows the class dominance error , which is slightly higher than the SRC errors ; their difference is less than . The classification error given class dominance satisfies for all three subset regression methods and all sparsity levels, which is not shown by figure. The bottom panel shows , which is much higher than and .

Those additional panels demonstrate that class dominance largely explains the success of SRC for face images, and the testing data can only be misclassified due to the failure of class dominance. Since all three subset regression methods achieve almost zero errors given class dominance, it is also the main reason that all methods have similar classification errors in the top left panel.

Figure 1: SRC for Extended Yale B Database
Figure 2: SRC for CMU PIE Database

4.2 Wikipedia Data

Next we apply SRC to our Wikipedia documents with text and network features. We collect English documents from Wikipedia based on the 2-neighborhood of the English article “algebraic geometry”, then form an adjacency matrix based on the documents’ hyperlinks and a text feature distance matrix based on latent semantic analysis DeerwesterDumais1990 and cosine distance. The data is available on our website 222http://www.cis.jhu.edu/~cshen/; and other examples of applying SRC to do vertex classification in graphs can be found in ChenShenVogelsteinPriebe2015 .

There are five classes in total for the documents, and both data sets are of size (because the network data is an adjacency matrix, and the text feature data is a cosine distance matrix). Splitting half columns for training and the other half columns for testing, we have , , and . The numerical performance is shown in Figure 3 and Figure 4 for the text and network data respectively. As the input data is a dissimilarity/similarity matrix, we use spectral embedding for projection prior to applying NN and LDA.

The overall interpretation is similar to the face images: SRC performs quite well and is stable throughout different sparsity levels and different subset regression methods; the class dominance error is higher than the SRC error (for text data they are quite close; but for network data they are less close as sparsity level increases); in this example, indicating class dominance is crucial for correct classification; and is close to the chance line and much higher than and .

Note that the SRC classification errors for text features are lower than the network counterparts, because the text features should be more informative than the network data; we also observe that SRC becomes slightly inferior to LDA at large projection dimension for text features, which is not the case for the adjacency matrix. This is probably because the cosine distance is a particularly suitable distance measure for text data SaltonBuckley1988 , Singhai2001 , thus allowing LDA to do better at proper projection dimensions. This phenomenon also holds for the face images in the previous subsection: even though LDA performs much worse than SRC on the raw data, LDA can achieve similar error rates as SRC for appropriate transformations of those images HeEtAl2005 , CaiEtAl2007 .

Figure 3: SRC for Wikipedia English Documents Text Feature
Figure 4: SRC for Wikipedia English Documents Network Feature

5 Conclusion

In this paper we investigate the sparse representation classifier, which becomes very popular recently due to its empirical success for real data. In order to better understand the theory behind this method, we focus on the regression and classification steps of this method, and develop the notion of class dominance and principal angle condition. Our derivation establishes a theoretical foundation of sparse representation from a different point of view from current literature, as well as implying that minimization and the subspace assumption may not be crucial for SRC, which allows faster subset regression methods and easier data model verification for this method. Our results are illustrated by various real data analysis, including face images that roughly satisfy the subspace assumption, as well as text and network data that do not satisfy this assumption.

6 Proofs

6.1 Theorem 1 and Corollary 2

Proof.

Assume that class dominates , we have ; furthermore, if positive class dominance holds, we have for all .

Note that we can always express the testing observation as

where is the regression residual orthogonal to both and for each , and is always fixed throughout for given , and .

Thus given class dominance and positive class dominance, we have for all . Because is orthogonal to , by the Pythagorean theorem we immediately have for all .

Therefore, , and for the SRC classifier.

Clearly if positive class dominance does not hold, there exist counterexamples that SRC fails to find the correct class. However, if there are only two classes (i.e., ) or and are always non-negative (i.e., all observations are non-negative and the regression coefficients are constrained to be non-negative), then class dominance guarantees that cannot be larger than for all . This is because

where the last inequality easily follows when or and are always non-negative. Therefore, in this case class dominance implies positive class dominance, and is sufficient for correct classification of SRC. ∎

6.2 Theorem 3

Proof.

We first decompose the testing observation as , which is essentially the same as in the previous proof with a different notation for easier presentation. Note that the regression residual is orthogonal to each column of .

Next we consider the principal angle . By assuming all involved entities are positive, we have

where the first equality holds because is a vector, the second equality follows by decomposing , and the third and fourth equalities hold when there are no negative terms involved.

Similarly, we have .

Because is always smaller than (if it is , is a vector in the same direction as , in which case cannot be full rank), it is trivial to observe that if and only if .

When the involved entities are not always positive, the only other possible scenario is that one absolute term negates the positive sign, e.g., . This can only happen when , in which case we also have .

Therefore, class dominates the regression vector if and only if , assuming is full rank. ∎

6.3 Theorem 4

Proof.

It suffices to prove that when satisfies the principal angle condition, the class dominance probability satisfies .

We proceed by first assuming that is non-empty when using homotopy. Note that is always full rank when it is selected by homotopy.

As almost surely for some , we always have . And as almost surely, we have .

Therefore, with probability we have , as long as is non-empty. So it remains only to justify that is non-empty asymptotically.

We claim that under the principal angle condition, is asymptotically non-empty when using homotopy. First, as the prior probability of class cannot be zero, the training data contains data of class with probability converging to as . Next, conditioning on the event that contains some data of class , the first selected datum by homotopy must be of class (which is most correlated with the testing observation under the principal angle condition). But the first entered element may get deleted in the homotopy solution path, and it seems possible that is empty at some .

Let us prove this is not possible by contradiction. Suppose that at certain step , the homotopy path deletes an element so that is empty. Because the first added element makes non-empty, to make sure is empty from certain step onwards, the deleted element must be the only datum of class , i.e., .

However, because the principal angle condition guarantees that and , deleting increases both and , and can never minimize for any (which is the objective function on the homotopy path). Thus if there is only one observation of class remaining in the active set , that datum can never be deleted in the homotopy solution path. Thus is almost surely asymptotically non-empty for under the principal angle condition.

Therefore, given the principal angle condition, with probability converging to we have . Then if the principal angle condition holds with probability under , is asymptotically no less than for minimization as . ∎

6.4 Corollary 5

Proof.

Given the principal angle condition, class dominance holds with probability asymptotically. So if the condition in Corollary 2 also holds, i.e., class dominance implies positive class dominance so that class dominance alone is sufficient for correct classification, we have with probability asymptotically.

Therefore if those two conditions hold with probability , the SRC error satisfies

Furthermore, if , SRC is asymptotically optimal with . ∎

6.5 Corollary 6

Proof.

Next we consider replacing minimization by other subset regression methods.

When homotopy is replaced by OMP, the only difference in our proof of Theorem 4 concerns whether is still non-empty when using OMP. At , OMP adds the same element into as homotopy, and the principal angle condition guarantees the first entered element is of class almost surely. Unlike homotopy, OMP never deletes any element on its solution path; thus all other proofs of Theorem 4 and Corollary 5 remain the same, and OMP can achieve SRC optimality.

When homotopy is replaced by marginal regression, the first element to enter coincides with homotopy and OMP. Therefore the principal angle condition still guarantees that is almost surely non-empty for given . However, as marginal regression only picks training observations that are most correlated with the testing observation, it is possible that is no longer full rank after certain . Thus for the proof of Theorem 4 to work, we need to constrain so that is full rank.

Finally, for full regression, i.e., we use directly to derive the regression vector by minimization, is almost surely asymptotically non-empty as the prior probability of class should be nonzero. Therefore all proofs of Theorem 4 and Corollary 5 remain the same, and full regression can also achieve SRC optimality, as long as itself is full rank. ∎

6.6 Corollary 7

Proof.

As , we may treat as from an additional class , and keep still from class .

Then the extended principal angle condition leads to the same class dominance result of Theorem 4, and Corollary 6 and Corollary 7 easily follow with essentially the same proofs. ∎

Acknowledgment

This work was partially supported by Johns Hopkins University Human Language Technology Center of Excellence (JHU HLT COE) and the XDATA program of the Defense Advanced Research Projects Agency (DARPA) administered through Air Force Research Laboratory contract FA8750-12-2-0303. The authors are also supported by the Acheson J. Duncan Fund for the Advancement of Research in Statistics, which allows us to present preliminary results of the paper at Joint Statistical Meeting, Boston, August 2014.

References

  • (1) J. Wright, A. Y. Yang, A. Ganesh, S. Shankar, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, 2009.
  • (2) A. Bruckstein, D. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM REVIEW, vol. 51, no. 1, pp. 34–81, 2009.
  • (3) J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan, “Sparse representation for computer vision and pattern recognition,” Proceedings of IEEE, vol. 98, no. 6, pp. 1031–1044, 2010.
  • (4) J. Yin, Z. Liu, Z. Jin, and W. Yang, “Kernel sparse representation based classification,” Neurocomputing, vol. 77, no. 1, pp. 120–128, 2012.
  • (5) A. Yang, Z. Zhou, A. Ganesh, S. Sastry, and Y. Ma, “Fast l1-minimization algorithms for robust face recognition,” IEEE Transactions on Image Processing, vol. 22, no. 8, pp. 3234–3246, 2013.
  • (6) E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, 2013.
  • (7) R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B, vol. 58, no. 1, pp. 267–288, 1996.
  • (8) D. Donoho and X. Huo, “Uncertainty principles and ideal atomic decomposition,” IEEE Transactions on Information Theory, vol. 47, pp. 2845–2862, 2001.
  • (9) D. Donoho and M. Elad, “Optimal sparse representation in general (nonorthogonal) dictionaries via l1 minimization,” Proceedings of National Academy of Science, pp. 2197–2202, 2003.
  • (10) B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004.
  • (11)

    E. Candes and T. Tao, “Decoding by linear programming,”

    IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005.
  • (12) D. Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm near solution approximates the sparest solution,” Communications on Pure and Applied Mathematics, vol. 59, no. 10, pp. 907–934, 2006.
  • (13) E. Candes and T. Tao, “Near-optimal signal recovery from random projections: Universal encoding strategies?,” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–5425, 2006.
  • (14) E. Candes, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Communications on Pure and Applied Mathematics, vol. 59, no. 8, pp. 1207–1233, 2006.
  • (15) P. Zhao and B. Yu, “On model selection consistency of lasso,” Journal of Machine Learning Research, vol. 7, pp. 2541–2563, 2006.
  • (16) N. Meinshausen and B. Yu, “Lasso-type recovery of sparse representations for high-dimensional data,” Annals of Statistics, vol. 37, no. 1, pp. 246–270, 2009.
  • (17) M. Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 - constrained quadratic programming (lasso),” IEEE Transactions on Information Theory, vol. 55, no. 5, pp. 2183–2202, 2009.
  • (18) P. Belhumeur, J. Hespanda, and D. Kriegman, “Eigenfaces versus fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
  • (19) R. Basri and D. Jacobs, “Lambertian reflection and linear subspaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 3, pp. 218–233, 2003.
  • (20) R. Rigamonti, M. Brown, and V. Lepetit, “Are sparse representations really relevant for image classification?,” in Computer Vision and Pattern Recognition (CVPR), 2011.
  • (21) L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation: which helps face recognition?”,” in International Conference on Computer Vision (ICCV), 2011.
  • (22) Q. Shi, A. Eriksson, A. Hengel, and C. Shen, “Is face recognition really a compressive sensing problem?,” in Computer Vision and Pattern Recognition (CVPR), 2011.
  • (23) Y. Chi and F. Porikli, “Classification and boosting with multiple collaborative representations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1519–1531, 2013.
  • (24) L. Chen, C. Shen, J. Vogelstein, and C. E. Priebe, “Robust vertex classification,” submitted, http://arxiv.org/abs/1311.5954.
  • (25) G. Davis, S. Mallat, and M. Avellaneda, “Greedy adaptive approximation,” Constructive Approximation, vol. 13, pp. 57–98, 1997.
  • (26) J. Tropp, “Greed is good: Algorithmic results for sparse approximation,” IEEE Transactions on Information Theory, vol. 50, no. 10, pp. 2231–2242, 2004.
  • (27) L. Wasserman and K. Roeder, “High dimensional variable selection,” Annals of statistics, vol. 37, no. 5A, pp. 2178–2201, 2009.
  • (28) C. Genovese, J. Lin, L. Wasserman, and Z. Yao, “A comparison of the lasso and marginal regression,” Journal of Machine Learning Research, vol. 13, pp. 2107–2143, 2012.
  • (29) S. Chen, D. Donoho, and M. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, vol. 43, no. 1, pp. 129–159, 2001.
  • (30) D. Donoho and Y. Tsaig, “Fast solution of l1-norm minimization problems when the solution may be sparse,” IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 4789–4812, 2008.
  • (31) M. Osborne, B. Presnell, and B. Turlach, “A new approach to variable selection in least squares problems,” IMA Journal of Numerical Analysis, vol. 20, pp. 389–404, 2000.
  • (32) M. Osborne, B. Presnell, and B. Turlach, “On the lasso and its dual,” Journal of Computational and Graphical Statistics, vol. 9, pp. 319–337, 2000.
  • (33) J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Transactions on Information Theory, vol. 53, no. 12, pp. 4655–4666, 2007.
  • (34)

    T. Zhang, “On the consistency of feature selection using greedy least squares regression,”

    Journal of Machine Learning Research, vol. 10, pp. 555–568, 2009.
  • (35) T. Cai and L. Wang, “Orthogonal matching pursuit for sparse signal recovery with noise,” IEEE Transactions on Information Theory, vol. 57, no. 7, pp. 4680–4688, 2011.
  • (36) D. Needle and R. Vershynin, “Uniform uncertainty principle and signal recovery via regularized orthogonal matching pursuit,” Foundations of Computational Mathematics, vol. 9, pp. 317–334, 2009.
  • (37) D. Needle and R. Vershynin, “Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matching pursuit,” IEEE Journal of Selected Topics in Signal Processing, vol. 4, pp. 310–316, 2010.
  • (38) D. Donoho, Y. Tsaig, I. Drori, and J.-L. Starck, “Sparse solution of underdetermined linear equations by stagewise orthogonal matching pursuit,” IEEE Transactions on Information Theory, vol. 58, no. 2, pp. 1094–1121, 2012.
  • (39) M. Kolar and H. Liu, “Marginal regression for multitask learning,” Journal of Machine Learning Research W & CP, vol. 22, pp. 647–655, 2012.
  • (40) J. Fan and J. Lv, “Sure independence screening for ultrahigh dimensional feature space,” Journal of the Royal Statistical Society: Series B, vol. 70, no. 5, pp. 849–911, 2008.
  • (41) J. Fan, R. Samworth, and Y. Wu, “Ultrahigh dimensional feature selection: beyond the linear model,” Journal of Machine Learning Research, vol. 10, pp. 2013–2038, 2009.
  • (42) J. Fan, Y. Feng, and R. Song, “Nonparametric independence screening in sparse ultra-high-dimensional additive models,” Journal of the American Statistical Association, vol. 106, no. 494, pp. 544–557, 2011.
  • (43) K. Balasubramanian, K. Yu, and G. Lebanon, “Smooth sparse coding via marginal regression for learning sparse representations,” Journal of Machine Learning Research W & CP, vol. 28, no. 3, pp. 289–297, 2013.
  • (44) A. Bruckstein, M. Elad, and M. Zibulevsky, “On the uniqueness of nonnegative sparse solutions to underdetermined systems of equations,” IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 4813–4820, 2008.
  • (45) M. Slawski and M. Hein, “Non-negative least squares for high-dimensional linear models: consistency and sparse recovery without regularization,” Electronic Journal of Statistics, vol. 7, pp. 3004–3056, 2013.
  • (46) N. Meinshausen, “Sign-constrained least squares estimation for high-dimensional regression,” Electronic Journal of Statistics, vol. 7, pp. 1607–1631, 2013.
  • (47) Y. Eldar and M. Mishali, “Robust recovery of signals from a structured union of subspaces,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 5302–5316, 2009.
  • (48) Y. Eldar, P. Kuppinger, and H. Bolcskei, “Compressed sensing of block-sparse signals: Uncertainty relations and efficient recovery,” IEEE Transactions on Signal Processing, vol. 58, no. 6, pp. 3042–3054, 2010.
  • (49) E. Elhamifar and R. Vidal, “Block-sparse recovery via convex optimization,” IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4094–4107, 2012.
  • (50) L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer, 1996.
  • (51) A. Georghiades, P. Buelhumeur, and D. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2001.
  • (52) K. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 684–698, 2005.
  • (53) T. Sim, S. Baker, and M. Bsat, “The cmu pose, illumination, and expression database,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1615–1618, 2003.
  • (54) S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, vol. 41, no. 6, pp. 391–407, 1990.
  • (55) G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information Processing and Management, vol. 24, no. 5, pp. 513–523, 1988.
  • (56) A. Singhai, “Modern information retrieval: A brief overview,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 24, no. 4, pp. 35–43, 2001.
  • (57) X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang, “Face recognition using Laplacianfaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 328–340, 2005.
  • (58) D. Cai, X. He, Y. Hu, J. Han, and T. Huang, “Learning a spatially smooth subspace for face recognition,” in Proceedings of IEEE Conference Computer Vision and Pattern Recognition Machine Learning (CVPR’07), 2007.