Classification via Incoherent Subspaces

05/10/2010 ∙ by Karin Schnass, et al. ∙ Austrian Academy of Sciences 0

This article presents a new classification framework that can extract individual features per class. The scheme is based on a model of incoherent subspaces, each one associated to one class, and a model on how the elements in a class are represented in this subspace. After the theoretical analysis an alternate projection algorithm to find such a collection is developed. The classification performance and speed of the proposed method is tested on the AR and YaleB databases and compared to that of Fisher's LDA and a recent approach based on on ℓ_1 minimisation. Finally connections of the presented scheme to already existing work are discussed and possible ways of extensions are pointed out.



There are no comments yet.


page 7

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A general approach in classification is to select features of the signal at hand and to get a decision by comparing them to the equivalent features of already labelled signals with a simple classifier like nearest neighbour, e.g.

[3], or nearest subspace, cp [9]

. This of course raises the question which features to take. For face recognition, which is the example we will use here, some classic and simple, because linear, features are Eigen,

[19], Fisher, [6], or Laplace features, [7]. However, as these classifiers are very simple and the features not adjusted to them, their performance is somehow disappointing, and researchers turned to the development of more complicated nonlinear features and kernel methods, [10, 15].
Here we start from the point of view that the potential of linear methods and simple classifiers is not exhausted. In order to achieve better results, we propose to give up the uniformity of features over classes and mix the feature selection with the classifier. To motivate the idea of class specific features let us have a look at classical nearest neighbour (NN) and nearest subspace (NS) classification using linearly selected features and give it a new interpretation.
Assume we have already labelled training signals belonging to classes, where each class contains elements, i.e. . We denote the -th signal in class as , . For each class we collect all its training signals as columns in the class matrix , i.e. , and these class matrices in turn are combined into a big data matrix . Given a new signal the goal is to decide which class it belongs to with the help of the already labelled training signals.
The classical first step is to select relevant features from

via a linear transform

, where is a matrix of rank .


The exact shape of the transform is determined by the training signals and their labels. For instance for Fisher’s LDA is chosen as the orthogonal projection that maximises the ratio of between-class scatter to that of within-class scatter, [6].
In the second step these features are compared to the features of the training signals . In case of the nearest neighbour classifier this means that the new signal will get the label of the training signal which has features that maximally correlate with the features of the new signal, ie.


If by analogy we define the feature matrix for a class as we can rewrite the expression, for which we are seeking the maximal argument, and combine the feature selection with the labelling step,


where the matrix denotes the transpose of , the

-norm of a vector is defined by

for and and the -norm of a matrix by . Thus another way of looking at the classification procedure is to say that for every class we have a set of sensing signals and the new signal belongs to the class which has the sensing signal closest to it. From this point of view we also see that the scheme will work stably only if two conditions are fullfilled. Every new signal is well represented by one vector in its class, i.e. a lot of its energy is captured by the projection on one vector, and no two sensing vectors from different classes are the same or close to each other, i.e.


Let us do the same analysis for the nearest subspace classifier. Again the features of the new signal are compared to those of the training signals. For each class the features of the training signals in it span a subspace and the new signal will get the label of the class for which the orthogonal projection of the features of the new signal on the corresponding subspace has the highest energy. Let be an orthonormal system111

can for instance be found via a (reduced) qr-decomposition of

spanning the subspace for class , then i.e.


Again we can combine the feature selection with the labelling by manipulating the expression, we want to maximise,

If we compare to NN classification we see that again for every class we get a set of sensing signals, the columns of the matrix , and that the new signal belongs to the class for which the sensing signals can take out the most energy (or for which the biorthogonal system to provides the best representation). Again this leads to two conditions for the classification to work, which are however more complex. First every new signal should be comparatively well represented in the biorthogonal system determined by its class and second no signal which is in the span of the sensing signals of one class should be well representable in the biorthogonal system of another class.


Summarising our findings for both nearest neighbour and nearest subspace classification we see that in both cases for every class we have a set of sensing signals or a subspace defined by the feature selection transform. There is a model how signals from this class are represented by this subspace, which implicitly determines which norm is used for the classification, for NN, for NS, and at the same requires that the interaction of subspaces measured by a corresponding matrix norm is small, i.e. that they are incoherent.
The classification scheme presented in this paper is based on the following idea. We give up the restriction that the subspaces associated to each class are generated canonically as a function of the feature selection transform and the training samples, i.e.

in the case of nearest neighbour classification, but generate them individually. This idea can also be motivated using the example of face recognition, to which we will apply our scheme later. Uniform feature extraction would mean realising that in general the most relevant parts of a face are the regions of the eyes, nose and mouth. Thus in order to classify a person we would focus on the eyes, nose and mouth regions while ignoring the hairstyle and comparing them to the eyes, nose and mouth regions of all the candidates. While this makes sense in general it will fail as soon as the set of candidates contains identical twins which can only be distinguished by the birth mark one has on his cheek. So while for most people the cheek is not a very distinguishing feature for the twins it is and it would be better to remember for them the cheek instead of for instance the nose. Even without the extreme example of the identical twins individual features are natural considering that the people we meet every day all have eyes, mouths and noses but not all of them have distinguishing eyes, mouths and noses. Instead they may have distinguishing birthmarks, scars, chins, etc. and a representation using these features will characterise them well but nobody else.

In the next section we will introduce the mathematical framework on which we base our classification scheme. It consists of a model of subspaces associated to each class and a model of how the elements in this class are represented in this subspace, which together lead to a natural choice of the norm we have to use for the classification and an incoherence requirement on the subspaces. In Section III we will develop a comparatively simple algorithm to learn these subspaces from the training signal, which we will use to classify faces in Section IV. In the last section we summarise our findings, point out connections to related approaches and outline possibilities for future work.

Ii Class Model

The most general model for the subspaces we can think of is to ascribe to every class a set of vectors , , which are collected as columns in the matrix . These correspond to the features that characterise elements of this class, so every element in class can be written as a combination of these class specific features with coefficients and some residual , orthogonal to the feature span,


The condition that features of a class well characterise the elements in it translates into a property of the coefficients , i.e. when measuring their strength in some norm it is higher than the strength of the coefficients we would obtain trying to represent the element by features of the wrong class. Since without further restrictions on the set of features per class it is not straightforward to calculate the coefficients of the best representation of signal in a class, we will sacrifice generality for simplicity and for the rest of the analysis assume that for every class we have the same number of features and that they form an orthonormal system, i.e. . We will point out how to deal with the more general situation in the last section. Given that the features of each class form an orthonormal system we can easily calculate the coefficients of the best representation of a general signal in class as . The question is now how should we measure these coefficients in order to correctly classify our images, i.e. which norm should we choose such that for all in class we have


To answer this question we will introduce three models on the coefficients, each leading to a certain p-norm as optimal measure.

Ii-a Sparse Coefficients

Assume that all signals we want to classify can be well represented by one element of one class, i.e.


where counts the number of non-zeros entries. An example for this situation would be trying to sort pictures of monkeys, snails, cucumbers and broccoli into animal and vegetable pictures. Even though monkeys and snails are both animals their shapes are very different, meaning that we can think of them as orthogonal, and the same goes for the shapes of cucumbers and broccoli in the other class. Let be the absolute value of the only non-zero component of the coefficients . We immediately see that whatever p-norm we apply using the correct class the response is always equal to ,

. Therefore to find out which p-norm is best we will use a trick that involves estimating the ratio we need to be smaller than 1 for successful classification with the triangular equation and a matrix norm bound. So for general

we get,


In the special case where the coefficients are sparse and thus , this means that

The smallest norm of a matrix is obtained when and . Then it corresponds to the maximal absolute entry of the matrix , i.e. the maximal absolute correlation between to features from different classes. Since in that case also the response from the residual is minimal we get the best bound choosing the -norm for the classification. Summarising our findings we see that in case of a sparse model on the coefficients, the norm is optimal and that the incoherence requirement we get for classification to work stably is that no two features from two different classes are too similar, but it does not matter if a feature is moderately close to all features in a different class or even representable by them. Thinking to the example of the animal vs. vegetable pictures this means that even though you can approximate the shape of a snail combining the shape of the cucumber and the broccoli, classification using the norm will work well because no animal shape alone closely resembles a vegetable shape and vice versa.

Ii-B Flat Coefficients

Let us now assume the completely opposite distribution of the coefficients, i.e. to represent one element in a class we need to combine all features of that class with equal magnitudes, i.e.


An example would be trying to label pictures of national flags and with the corresponding countries. For simplicity assume that the only flags in question are those of the Netherlands, Germany, Estonia, Lithuania and Gabon, which all consist of three horizontal stripes in various colours, i.e. red, white and blue for the Netherlands, black, red and yellow for Germany, blue, black and white for Estonia, yellow, green and red for Lithuania and green, yellow and blue for Gabon, cp. Figure 1. Good features in this example are the colours of the stripes. Each national flag has its three distinctive colours which appear in an equal amount but are not exclusive to this flag.

Netherlands Germany Estonia Lithuania Gabon
Fig. 1: National Flags

In this case we get the maximal response from the correct class when choosing , i.e. . Also from Inequality (10) we see that using and gives a very beneficial bound.

Remembering that is smaller than the absolute sum of all the correlations between features in one class and features in another class we get a less sharp version of the above bound,

which shows that for the case of flat coefficients we have a quite different coherence constraint. Even if a few features in a class are very close to features in another class or actually the same this is not a problem as long as the majority of features from two different classes are not very correlated.
In the example of the flags this means that even though two different national flags might share up to two colours, as long as we take into account that all three colours have to appear to the same degree, we can still identify the country from a picture of the flag.

Ii-C Unstructured Coefficients

The last case we are going to discuss is probably the most common and concerns coefficients which follow neither of the two extreme distributions discussed above or where the exact distribution is unknown. An example is the task of face recognition, i.e. identifying a person from a picture. Obvious features in this case are noses, eyes and mouths. In any picture most of these features will be visible but their strength will largely depend on the facial expression and lighting conditions. To choose a good p-norm for the classification in this case, we again bound the norm ratio we need to be small.


Since we do not have information about the shape of the coefficients, the first term on the right hand side can be as big as . Taking into account the orthogonality of the features in the matrices , we see that for this term can only be equal to one if two classes overlap, meaning that there is a signal whose features in its own class can be represented by features in a different class. For , however, the corresponding term is equal to the maximum absolute column/row sum of the and it can be easily seen that this can be larger than one, even if for no signal the features in its own class can be fully represented by features in a different class. Similar results hold for all other , thus making the best choice in this case. Observe also that corresponds to measuring the energy captured by the features of a class. Thus if the features are well chosen also the second term in Inequality (12) can be expected to be small.
Finally we see that choosing puts the following incoherence constraint on the feature spaces. No signal that can be constructed from features in one class should be well representable by features in another class. This constraint is the strongest we have encountered so far, which is only natural since we do not have an assumption on coefficient distribution. Coming back to our example it also corresponds quite naturally to what one would expect from face recognition, ie. that in all pictures enough distinctive features are visible and no matter the lighting condition or facial expression two people can always be uniquely identified from their features.

Of course there is ample opportunity to develop more class models, assuming different distributions on the coefficients and using more exotic norms. Also one could use different assumptions on the features, i.e. non-orthogonal. However, in this paper we will focus on finding a practical way to calculate sensing or feature matrices for classification based on the three main models.

Iii Finding Feature/Sensing Matrices

From the analysis in the last section we can derive two types of conditions that the collection of features or subspaces needs to satisfy. The first type describes how features from different classes should interact, i.e. the interplay measured in the appropriate matrix norm should be small, and the second type how the features should interact with the training data, i.e. the ratio of the response without to within class should be small. The problem with both kinds of conditions is they are not linear and difficult to handle. For instance calculating the

-norm is equivalent to finding the largest singular value and calculating the

-norm is even NP-hard. We will therefore start with a very simple approach that will lead to a reasonably fast algorithm, and in the last section point out how to extend it to include more complicated constraints. Instead of requiring explicitly that the interplay between features from different classes is small, hereby avoiding to investigate what small means quantitatively, we use the intuition that this should come as free side effect from regulating the interaction with the training data, and simply ask that is a collection of orthonormal systems each of rank . What we would actually like to do about the interaction of the features with the training data is to minimise the ratio between the response of the training data without to within class. However, a constraint involving the ratio is not linear and very hard to handle. We will therefore split it into two constraints that guarantee that the ratio is small if they are fulfilled. The first constraint is that the response within class is equal to a constant which we choose to be the maximally achievable value given the rank of the orthonormal systems and . The second constraint is that the response without class is smaller than a constant , whose dependence on is more complicated and will be discussed later. Define the two sets and as


then our problems could be summarised as finding a matrix in the intersection of the two sets, i.e. . However, since this intersection might be empty, we should rather look for a pair of matrices, each belonging to one set, with minimal distance to each other measured in some matrix norm, eg. the Frobenius norm, denoted by 222We use this notation instead of the more common variant to avoid confusion.,


One line of attack is to use an alternate projection method, i.e. we fix a maximal number of iterations, an initialisation for and then in each iterative step do:

  • find a matrix

  • check if is smaller than the distance of any previous pair and if yes store

  • find a matrix

  • check if is smaller than the distance of any previous pair and if yes store

If both sets are convex, the outlined algorithm is known as Projection onto Convex Sets (POCS) and guaranteed to converge. Non convexity of possibly both sets, as is the case here, results in much more complex behaviour. Instead of converging, the algorithm just creates a sequence with at least one accumulation point. We will not discuss all the possible difficulties here but refer to [18], where all details, proofs and background information can be found and wherein the authors conclude that alternate projection is a valid strategy for solving the posed problem.
To keep the flow of the paper, we will not discuss the two minimisation problems that need to be alternatively solved here. The interested reader can find them, including the exact parameter settings in the simulations of the next section, in the appendix. Instead we will discuss how to set the parameters and possible choices for the initialisation .
As mentioned above we choose to be the maximally achievable value. An orthonormal system of feature vectors can maximally take out all the energy of a signal,


As the signals are assumed to have unit norm, this energy is at most one and we set . The maximal 1-norm of the vector of length with energy 1 is . This is attained when all features of one class take out the same energy, i.e. the absolute values of the entries in are all equal to . This leads to . The infinity norm corresponds to the maximal inner product between one of the feature vectors and the signal. As both the feature vector and the signals are normalised, this can be at most one and so we set .
From the discussion in the last section we see that the parameter reflects the incoherence we require between features from different classes. If we have , it is theoretically possible to have subspaces of dimension which are mutually orthogonal to each other, and could be zero. As soon as the above inequality is reversed, because for instance the actual dimension of the span of all features, i.e. , is smaller than , not all subspaces corresponding to the different classes can be orthogonal but will have to overlap. How the size of this overlap, i.e. coherence, should be measured, is determined by the choice of -norm for classification. For instance for the coherence was measured by and from theory about Grassmannian manifolds, see [18], we know that the maximal coherence between two of subspaces of dimension embedded in the space can be lower bounded by


The problem with setting as above is that we are not controlling the interaction between the sets of features directly but only indirectly over the training data. There the worst case might not be assumed and so as above would be too large. Also for the cases we do not have a similar bound. Therefore instead of trying to analyse theoretically how to set , where we have to deal with too many unknowns, we use the above bound as an indication of order of magnitude and, when testing our scheme on real data, vary the parameter . Lastly for the initialisation for each class we choose the orthogonal system that maximises the energy taken from this class opposed to the energy taken from the other classes, i.e.


This problem can be easily solved, by considering the rewritten version of the function to minimise,



is an eigenvalue decomposition of the symmetric (Hermitian) matrix

, then the minimum is attained for consisting of the eigenvectors corresponding to the largest eigenvalues.

Iv Testing

To test the proposed scheme we use two face databases, the AR-database, [13] and the extended Yale B database, [1]. First we will test the validity of all three approaches on the AR-database, even though it is intuitively clear that the most appropriate model for faces corresponds to . Using the experience from the AR-database we will then run similar tests on the extended Yale B database using only the most appropriate model .

Iv-a AR-Database

For the test we used a subset of images from the AR-database. For each of the 126 people there are 26 frontal images of size taken in two separate sessions. The images include changes in illumination, facial expression and disguises. For the experiment we selected 50 male and 50 female subjects and for each of them took the 14 images with just variations in illumination and facial expression, neutral, light from the right and left, front light, angry, happy, sleepy. The all together 700 images from the first session were used as training data and the 699 images333700 minus corrupted image w-027-14.bmp from the second session for testing. Every image was converted to grayscale and then stored as a 19800 dimensional column vector. The images from the first session were stored in the matrix and those from the second in the matrix . All images (columns) in were re-scaled to have unit norm. In order to speed up the calculations, we first applied a unitary transform, which does not change the geometry of the problem, but reduces the size of the matrices, i.e. we did a reduced -factorisation decomposing into the matrix with orthogonal columns and the upper triangular matrix and set and .
We tested the proposed scheme for all three choices of and varying values of scaling from to of and number of features per class varying from 1 to 7. The choice of the maximal outside-class contribution was inspired by the bound in (16). If we take as effective signal dimension and assume that the space should not only accommodate the 100 different people in our training set but all people, i.e. we let go to infinity, the bound approaches which is if and if . The maximal number of features per class is 7, since we only have 7 test images and so it does not make sense to look for spaces of higher dimension containing all test images. Note also that for the three schemes are the same, so the results are only displayed once. For each set of parameters we calculated the corresponding feature matrix using the algorithm described in the last section on the images from the first session. We then classified the images from the second session using the appropriate -norm. The results are shown in Tables I, II and III.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
2 60 56 56 57 60 58 60 61 66 64 69
3 52 46 48 46 51 51 53 58 62 61 61
4 62 52 54 55 55 56 56 54 55 57 61
5 64 59 56 56 55 58 61 63 66 68 68
6 61 54 57 54 56 59 62 58 61 71 71
7 57 55 57 55 59 57 58 62 61 68 69
TABLE I: Number of misclassified images on the AR-database for and varying values and .
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
1 57 58 59 58 60 59 59 58 58 58 62
2 51 49 51 51 51 55 57 57 59 58 56
3 47 42 45 50 53 53 54 61 62 61 64
4 46 42 41 41 47 48 51 62 63 61 63
5 48 43 40 44 50 51 52 55 55 59 61
6 49 45 42 45 49 48 51 54 54 57 58
7 45 43 43 43 45 45 48 53 51 54 52
TABLE II: Number of misclassified images on the AR-database for and varying values and .
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
2 55 62 59 54 56 52 54 61 63 64 62
3 55 63 58 56 60 58 59 63 65 69 69
4 55 64 60 57 59 58 58 61 67 70 67
5 55 60 59 55 58 57 57 60 66 71 69
6 55 61 59 54 57 56 56 65 67 72 69
7 55 61 59 55 56 54 55 66 66 71 70
TABLE III: Number of misclassified images on the AR-database for and varying values and .

As we can see we get the best performance for , followed by and . This comes as no surprise when considering the structure of our data. Intuitively the important features of a face are eyes, nose and mouth. Since the people in the pictures have different facial expression, usually not all of these features will be active explaining why is not the most appropriate model. On the other hand we can expect to have more than one feature active at the same time even if not to the same extent. Using we lose the information given by these secondary active features while with we still incorporate it into the final decision.
We can also see that of as maximally allowed outside class ’energy’ seemed to have been a good choice as we can always see a small decrease and large increase of the error going from to , with the best range for and between and and for between and . For we get better performance for the lower dimensions, which seems reasonable because there the equal energy distribution over the features is easier achieved. For on the other hand the better performance is achieved with higher dimensions, which are able to capture more important side details. Finally for the results seem equal for all dimensions. A possible explanation is given by the initialisation, which ensures that for all dimensions the first, most promising direction is included.
Still in all three cases in the most promising ranges the proposed scheme outperforms a standard method like Fisher’s LDA, [6]. The best result by LDA is obtained when using the original (not-normalised) images and the highest possible number of discriminant axes . In this case nearest neighbour classification, corresponding to but with non orthogonal features, fails to identify 59 images, and nearest subspace classification, corresponding to fails to identify images. When concentrating on the results for , which is the most sensible choice given the structure of the data, , we also see that the scheme performs well in comparison to a recent, successful method based on minimisation, [20]. The best result reported there is a success rate of , meaning misclassified images, which is 5 images better than our best case of errors.
Encouraged by the promising results we now turn to testing our scheme on the extended Yale B database.

Iv-B Extended Yale B Database

From the extended Yale B database we used the 2414 frontal face images, about 64 images taken under varying illumination conditions for each of the 38 people. For the test we randomly split the set of images per person into an equal number of training and test images, using one more training than test image in case of an odd number of images per class. We then ran our classification scheme with the number of features per class varying from 2 to 5 and thanks to the experience gained from the AR-database with the values of

running only from 0 to 0.05. For the computation of the feature matrices we used the same simplifications as described for the AR-database. For comparison we ran Fisher’s LDA with 37 and 30 discriminative axes in combination with the nearest neighbour classifier. This procedure was repeated 19 times and the mean of all 20 runs was computed.

0 0.01 0.02 0.03 0.04 0.05
2 19.80 5.74 20.30 5.80 22.25 7.20 23.85 6.81 25.25 6.66 26.25 6.61
3 14.15 4.37 13.60 4.22 13.85 3.73 15.85 5.25 16.40 4.78 17.55 6.00
4 15.75 3.82 14.05 3.49 13.95 3.55 15.35 3.95 16.45 4.10 16.95 4.30
5 15.70 4.78 15.00 4.91 14.45 4.30 15.30 3.34 17.60 4.65 17.65 4.55
TABLE IV: Mean standard deviation of misclassified images on the Extended Yale B database for and varying values and .

The results of our method can be found in Table IV. While Fisher’s LDA on average missclassified 23.30 6.42 images (success rate of 98.07 0.53%) using 37 discriminant axes and 231.55 23.48 images (success rate 80.78 1.95%) using 30 discriminant axes, our method in the best case only misclassified 13.60 4.22 images (success rate 98.87 0.35%). In general it outperformed Fisher’s LDA for a wide range of values for and .
Comparison to the -minimisation scheme in [20] is harder, as it seems that there only a single run was used. However, their best success rate of 98.26%, achieved at the same time as Fisher’s LDA with 30 discriminant axes achieved 87.57% (the maximal rate for Fisher’s LDA we encountered in 20 runs was 84.73%), is still below our best average rate of 98.87%.

To illustrate the results in Figure 2 and to confirm the motivation in the introduction for using different features for different classes, we show what happens to the training images of two different subjects when projected on the features of their own class and the other subject’s class. As expected the projections on features of their own class nicely filter out common traits like eyes, mouths and noses, but on top of that the features of the first subject capture the very distinctive birth mark on his right cheek. The projections on the wrong class on the other hand are not only much weaker (note the difference in scale) but also less clear. Two overlapping sets of features seems to appear at the same time, the ones that belong to the subject in the image and the ones that the projection is trying to filter out.

(a) (b) (c)
(d) (e) (f)
Fig. 2: Images of two subjects, original (a) & (d), projected onto the span of features from their own class (b) & (e), projected onto the span of features of the wrong class (c) & (f)

Summarising the results, we can say that our method outperforms a classic scheme like Fisher’s LDA. In comparison to the -minimisation scheme in [20] it performs slightly worse on the AR-database but seem to be better on the YaleB-database. However it has one big advantage over the -minimisation scheme, which is its low computational complexity. Not taking the calculation of the feature matrices into account, as this is part of the pre-processing, basically all that has to be done to classify a new data vector is to multiply it with the feature matrix and calculate some statistics on the resulting vector. The minimisation method on the other hand requires on top of extracting the features the solution of a convex optimisation problem


where in this case is the matrix containing the features of all the training data. For comparison in [20] the authors state that the classification of one image takes a few seconds on a typical 3 GHz Pc. At the same time for classifying 1205 images of size , using our method with 4 feature dimensions per class, MATLAB takes less than half a minute on a Dual 1.8Ghz PowerPC G5, which is less than 25ms per image.

In the next section we will introduce the mathematical framework on which we base our classification scheme. It consists of a model of subspaces, one associated to each class, and a model on how the elements in this class are represented in this subspace, which together lead to a natural choice of the norm we have to use for the classification and an incoherence requirement on the subspaces.

V Discussion

We have presented a classification scheme based on a model of incoherent subspaces, each one associated to one class, and a model on how the elements in a class are represented in this subspace. From a more practical viewpoint we have developed an algorithm to calculate these subspaces, i.e. the feature matrices, and shown that the scheme gives promising results on the AR database and even outperforms a state of the art method like the -minimisation scheme in [20] on the YaleB-database. The idea that each class should have its own representative system, learned from the training data can already be found in [17]. There frames or dictionaries for texture classification are learned, such that each provides a sparse representation for its texture class. The new texture then gets the label of the texture frame providing the sparsest representation. In [12], the same basic idea is used but the learning is guided by the principle that the dictionaries should also be discriminant, while in [16] both learning principles are combined, i.e. the dictionaries should be discriminant and approximative. This third scheme can be considered as a more general and more complicated version of our approach. Alternatively our approach can be considered to be a hybrid of Nearest Subspace respectively Nearest Neighbour and the discriminative and approximative frame scheme, in so far as it is linear but has individual features for every class.
The idea to use a collection of subspaces for data analysis can also be found in [11]

, where the subspaces are used to model homogenous subsets of high-dimensional data which together can capture the heterogenous structures.

For the future there remain some interesting directions to explore. Firstly the possibilities of the subspace classification approach do not seem exhausted using the proposed algorithm. Ironically this fact revealed itself through a bug in the minimisation procedure, resulting in matrix pairs with distances larger than the optimal ones, and sensing matrices giving better classification results, i.e. in the best case an error of only misclassified images. The main difference of these fake optimal matrices to the sensing matrices corresponding to the actual minima, seemed to be that, while capturing approximately the same ’energy’ within class, they were more accurate in respecting the without class energy bound, i.e. less overshooting of the maximally allowed value . This overshooting for the real minima is a result of imposing not only but also , which forces the optimal feature matrix to balance the error incurred by not attaining within class and the error incurred by being larger than without class. A promising idea to avoid the overshooting would be to change the problem formulation and ask to maximise the ’energy’ within class subject to keeping the ’energy’ without class small, i.e. in the case solve,


Lastly our approach allows to impose additional constraints on , like incoherence of the subspaces between each other, e.g. for , or low rank of the whole feature matrix to reduce the cost of calculating . Another possibility to reduce computational cost if and are very large, especially in the training step, would be to first take random samples of the training data, which reduce their dimension but very likely preserve the geometrical structure, as described in [2] and used in [20]. Alternatively to reduce the dimension of one can apply our scheme on classical features, like Eigen or Laplace features, instead of directly on the raw training data.

Appendix A Solution Sketches for the Minimisation Problems

In order to use the alternate projection method for calculating the feature matrices we need to find the projection of a matrix onto and onto in the three cases . We will start with the easier of the two problems


Since the minimisation problem is invariant under squaring of the objective function and thus equivalent to


it splits into independent problems


The solution of these problems is straightforward. If

has the reduced singular value decomposition

then the orthonormal system of same rank closest to it is , see e.g. [8].
The second minimisation problem

is more complicated to solve. Assume that the number of training signals is larger than the dimension of the signals and span the whole space, so that the matrix has rank . If not we embed the training signals into a lower dimensional space corresponding to the rank of via a reduced -decomposition of and set before starting the alternating projection procedure. Afterwards we set , where is the feature matrix calculated from the lower dimensional embedded data. Since has rank we have and can reformulate the problem to solve as


The advantage of this formulation is that it is in terms of , which is also used to describe . To further exploit this property we define the set , which is of the form with . To characterise the set we assume the following notation. Let refer to the submatrix that corresponds to inside and denote the -th column of by . We can then define


Set then the problem in (23) is equivalent to


To attack this problem we will use resolvents or proximity operators which are a generalisation of projection operators. Given a Hilbertspace and a function from to that is lower semicontinuous, convex and not identical to , i.e. belonging to the proximity operator is defined by

Proximity operators were first studied by Moreau in [14], who developed a theory of proximal calculus, and recently have been used to solve optimisation problems in signal processing, [4]. Here we will use the forward backward splitting approach as described in [5]. Assume that we can write the function to minimise as the sum of two functions in , i.e.


If is differentiable with a -Lipschitz continuous gradient for then the sequence generated by fixing and iterating


converges weakly to a minimum of (26) if .
To apply the concept to our problem we take as Hilbert space the set of all matrices equipped with the Frobenius norm and define the indicator function of the set by

The we can replace problem (25) by


The slight imperfection of this approach is that the set is not convex, therefore is not convex and the sequence generated applying (27) is not guaranteed to converge. Finding only a local minimum is however not such a big problem, since the procedure is only part of a bigger iterative scheme, as long as in each step we get some improvement.
What remains to be done is to calculate the proximity operators for , the gradient of and decide about the initialisation and the step sizes . A straightforward calculation shows that . Since is an indicator function the proximity operator is simply the orthogonal projection onto , i.e.

Because of the structure of , see (A), the problem above splits into the smaller problems


In other words for we need to solve problems of the form


The solutions are collected in the following Theorem.

Theorem 1

Denote by the minimal argument of the first problem and by the minimal argument of the second problem in (29).
Set if and else, and denote by the length of the , then

If set . Otherwise set and iteratively shrink

until with is found and set .

Let be the index of (one of) the largest absolute component of then

Lastly as initialisation we choose the projection of onto , i.e. . Finding the correct step-sizes is usually a matter or trial and error. For the application considered here we used , which worked better for small , and , which worked better for large . The iteration was stopped when the relative improvement in each step was below . The number of iterations for the alternative projections was set to 10.

Thanks: We would like to thank John Wright for helping us getting the cropped version of the AR-database faces.


  • [1] Georghiades A., P.N. Belhumeur, and D.J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001.
  • [2] D. Achlioptas. Database-friendly random projections. In Proc. 20th Annual ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, pages 274–281, 2001.
  • [3] R. Brunelli and T. Poggio. Face recognition: Features vs. templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(10):1042–1053, October 1993.
  • [4] P.L. Combettes and Pesquet J.-C. A Douglas-Rachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4):564–574, December 2007.
  • [5] P.L. Combettes and V.R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.
  • [6] R.A. Fisher. The use of multiple measures in taxonomic problems. Ann. Eugenics, 7:179–188, 1936.
  • [7] X. He, S. Yan, Y. Hu, P. Niyogi, and H. Zhang. Face recognition using laplacianfaces. In Proc. IEEE CVPR, 2005.
  • [8] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1985.
  • [9] K. Lee, J. Ho, and D.J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5):684–698, 2005.
  • [10] C. Liu. Capitalize on dimensionality increasing techniques for improving face recognition grand challenge performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006.
  • [11] Y. Ma, A. Yang, H. Derksen, and R. Fossum. Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Review, 50(3):413–458, August 2008.
  • [12] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. IMA Preprint Series 2212, University of Minnesota, 2008.
  • [13] A.M. Martinez and Benavente R. The AR face database. Technical Report 24, CVC, 1998.
  • [14] J.-J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R. Acad. Sci. Paris Ser. A. Math., 255:2897–2899, 1962.
  • [15] P. Phillips, W. Scruggs, A. O’Tools, P. Flynn, K. Bowyer, C. Schott, and M. Sharpe. FRVT 2006 and ICE 2006 large-scale results. NISTIR 7408, NIST, 2007.
  • [16] F. Rodriguez and G. Sapiro. Sparse representations for image classification: Learning discriminative and reconstructive non-parametric dictionaries. IMA Preprint Series 2213, University of Minnesota, 2008.
  • [17] K. Skretting and J.H. Husoy. Texture classification using sparse frame-based representations. EURASIP Journal on Applied Signal Processing, 2006:11, 2006.
  • [18] J. Tropp, I. Dhillon, R Heath Jr, and T. Strohmer. Designing structured tight frames via an alternating projection method. IEEE Transactions on Information Theory, 51(1):188–209, January 2005.
  • [19] M. Turk and A. Pentland. Eigenfaces for recognition. In Proc. IEEE CVPR, 1991.
  • [20] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma. Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 2009.