1 Introduction
A number of real world applications model data as being sampled from a union of independent subspaces. These applications include image representation and compression [6], systems theory [12], image segmentation [15], motion segmentation [13], face clustering [7, 5] and texture segmentation [8]
, to name a few. Dimensionality reduction is generally used prior to applying these methods because most of these algorithms optimize expensive loss functions like nuclear norm,
regularization, e.t.c. Most of these applications simply apply offtheshelf dimensionality reduction techniques or resize images (in case of image data) as a preprocessing step. ^{†}^{†}footnotetext: This work was partially funded by the National Science Foundation under the grant number CNS1314803The union of independent subspace model can be thought of as a generalization of the traditional approach of representing a given set of data points using a single low dimensional subspace (e.g. Principal Component Analysis). For the application of algorithms that model data at hand with this independence assumption, the subspace structure of the data needs to be preserved after dimensionality reduction. Although a number of existing dimensionality reduction techniques
[10, 3, 1, 4] try to preserve the spacial geometry of any given data, no prior work has tried to explicitly preserve the independence between subspaces to the best of our knowledge.In this paper, we propose a novel dimensionality reduction technique that preserves independence between multiple subspaces. In order to achieve this, we first show that for any two disjoint subspaces with arbitrary dimensionality, there exists a two dimensional subspace such that both the subspaces collapse to form two lines. We then extend this nontrivial idea to multiclass case and show that projection vectors are sufficient for preserving the subspace structure of a class dataset. Further, we design an efficient algorithm that finds the projection vectors with the aforementioned properties while being able to handle corrupted data at the same time.
2 Preliminaries
Let be subspaces in . We say that these subspaces are independent if there does not exist any nonzero vector in which is a linear combination of vectors in the other subspaces. Let the columns of the matrix denote the support of the subspace of dimensions. Then any vector in this subspace can be represented as . Now we define the notion of margin between two subspaces.
Definition 1
(Subspace Margin) Subspaces and are separated by margin if
(1) 
Thus margin between any two subspaces is defined as the maximum dot product between two unit vectors (), one from either subspace. Such a vector pair () is known as the principal vector pair between the two subspaces while the angle between these vectors is called the principal angle.
With these definitions of independent subspaces and margin, assume that we are given a dataset which has been sampled from a union of independent linear subspaces. Specifically, each class in this dataset lies along one such independent subspace. Then our goal is to reduce the dimensionality of this dataset such that after projection, each class continues to lie along a linear subspace and that each such subspace is independent of all others. Formally, let be a class dataset in such that vectors from class () lie along subspace . Then our goal is to find a projection matrix () such that the projected data vectors () are such that data vectors belong to a linear subspace ( in ). Further, each subspace is independent of all others.
3 Proposed Approach
In this section, we propose a novel subspace learning approach applicable to labeled datasets that theoretically guarantees independent subspace structure preservation. The number of projection vectors required by our approach is not only independent of the size of the dataset but is also fixed, depending only on the number of classes. Specifically, we show that for any class labeled dataset with independent subspace structure, only projection vectors are required for structure preservation.
The entire idea of being able to find a fixed number of projection vectors for the structure preservation of a class dataset is motivated by theorem 1. This theorem states a useful property of any pair of disjoint subspaces.
Theorem 1
Let unit vectors and be the principal vector pair for any two disjoint subspaces and in . Let the columns of the matrix
be any two orthonormal vectors in the span of
and . Then for all vectors , (), where depends on and is a fixed vector independent of . Further,Proof: We use the notation to denote the column vector of matrix for any arbitrary matrix . We claim that (). Also, without any loss of generality, assume that . Then in order to prove theorem 1, it suffices to show that , . By symmetry, , will also lie along a line in the subspace spanned by the columns of .
Let the columns of and be the support of and respectively, where and are the dimensionality of the two subspaces. Then we can represent and as and for some and . Let be any arbitrary vector in where . Then we need to show that . Notice that,
(2) 
Let be the svd of . Then and are the columns of and respectively, and is the diagonal element of if and are the principal vectors of and . Thus,
(3) 
Geometrically, this theorem says that after projection on the plane () defined by any one of the principal vector pairs between subspaces and , both the entire subspaces collapse to just two lines such that points from lie along one line while points from lie along the second line. Further, the angle that separates these lines is equal to the angle between the principal vector pair between and if the span of the principal vector pair is used as .
We apply theorem 1 on a three dimensional example as shown in figure 1. In figure 1 (a), the first subspace (yz plane) is denoted by red color while the second subspace is the black line in xy axis. Notice that for this setting, the xy plane (denoted by blue color) is in the span of the (and only) principal vector pair between the two subspaces. After projection of both the entire subspaces onto the xy plane, we get two lines (figure 1 (b)) as stated in the theorem.
Finally, we now show that for any class dataset with independent subspace structure, projection vectors are sufficient for structure preservation.
Theorem 2
Let be a class dataset in with Independent Subspace structure. Let be a projection matrix for such that the columns of the matrix consists of orthonormal vectors in the span of any principal vector pair between subspaces and . Then the Independent Subspace structure of the dataset is preserved after projection on the vectors in .
Before stating the proof of this theorem, we first state lemma 1 which we will use later in our proof. This lemma states that if two vectors are separated by a nonzero angle, then after augmenting these vectors with any arbitrary vectors, the new vectors remain separated by some nonzero angle as well. This straightforward idea will help us extend the two subspace case in theorem 1 to multiple subspaces.
Lemma 1
Let , be any two fixed vectors of same dimensionality with respect to each other such that . Let , be any two arbitrary vectors of same finite dimensionality with respect to each other such that . Then there exists a constant such that vectors and are also separated such that . Here the equality in only holds when .
Proof:
(4) 
Expanding the denominator, we get,
(5) 
Thus,
(6) 
On the other hand, squaring the numerator yields,
(7) 
Finally, since arithmetic mean is at least equal to geometric mean, we have that
(8) 
which implies , thus proving the claim.
Proof of theorem 2:
For the proof of theorem 2, it suffices to show that data vectors from subspaces and (for any ) are separated by margin less than after projection using . Let and be any vectors in and respectively and the columns of the matrix be in the span of the (say) principal vector pair between these subspaces. Using theorem 1, the projected vectors and are separated by an angle equal to the the angle between the principal vector pair between and . Let the cosine of this angle be . Then, using lemma 1, the added dimensions in the vectors and to form the vectors and are also separated by some margin . As the same argument holds for vectors from all classes, the Independent Subspace Structure of the dataset remains preserved after projection.
For any two disjoint subspaces, theorem 1 tells us that there is a two dimensional plane in which the entire projected subspaces form two lines. It can be argued that after adding arbitrary valued finite dimensions to the basis of this plane, the two projected subspaces will also remain disjoint (see proof of theorem 2). Theorem 2 simply applies this argument to each subspace and the sum of the remaining subspaces one at a time. Thus for subspaces, we get projection vectors.
Finally, our approach projects data to dimensions which could be a concern if the original feature dimension itself is less than . However, since we are only concerned with data that has underlying independent subspace assumption, notice that the feature dimension must be at least . This is because each class must lie on at least dimension which is linearly independent of others. However, this is too strict an assumption and it is straight forward to see that if we relax this assumption to dimensions for each class, the feature dimensions are already at .
3.1 Implementation
A naive approach to finding projection vectors (say for a binary class case) would be to compute the SVD of the matrix , where the columns of and contain vectors from class and class respectively. For large datasets this would not only be computationally expensive but also be incapable of handling noise. Thus, even though theorem 2 guarantees the structure preservation of the dataset after projection using as specified, this does not solve the problem of dimensionality reduction. The reason is that given a labeled dataset sampled from a union of independent subspaces, we do not have any information about the basis or even the dimensionality of the underlying subspaces. Under these circumstances, constructing the projection matrix as specified in theorem 2 itself becomes a problem. To solve this problem, we propose an algorithm that tries to find the underlying principal vector pair between subspaces and (for to ) given the labeled dataset . The assumption behind this attempt is that samples from each subspace (class) are not heavily corrupted and that the underlying subspaces are independent.
Notice that we are not specifically interested in a particular principal vector pair between any two subspaces for the computation of the projection matrix. This is because we have assumed independent subspaces and so each principal vector pair is separated by some margin . Hence we need an algorithm that computes any arbitrary principal vector pair, given data from two independent subspaces. These vectors can then be used to form one of the submatrices in as specified in theorem 2 . For computing the submatrix , we need to find a principal vector pair between subspaces and . In terms of dataset
, we estimate the vector pair using data in
and where . We repeat this process for each class to finally form the entire matrix . Our approach is stated in algorithm 1. For each class , the idea is to start with a random vector in the span of and find the vector in closest to this vector. Then fix this vector and search of the closest vector in . Repeating this process till the convergence of the cosine between these vectors leads to a principal vector pair. In order to estimate the closest vector from opposite subspace, we have used a quadratic program in 1 that minimizes the reconstruction error of the fixed vector (of one subspace) using vectors from the opposite subspace. The regularization in the optimization is to handle noise in data.3.2 Justification
The definition 1 for margin between two subspaces and can be equivalently expressed as
(9) 
where the columns of and are the basis of the subspaces and respectively such that and are both identity matrices.
Proposition 1
Let and be the basis of two disjoint subspaces and . Then for any principal vector pair between the subspaces and , the corresponding vector pair (,), s.t. and , is a local minima to the objective in equation (9).
Proof: The Lagrangian function for the above objective is:
(10) 
Then setting the gradient w.r.t. to zero we get
(11) 
Let be the SVD of and and be the columns of and respectively. Then equation (11) becomes
(12) 
Thus the gradient w.r.t. is zero when . Similarly, it can be shown that the gradient w.r.t. is zero when . Thus the gradient of the Lagrangian is w.r.t. both and for every corresponding principal vector pair. Thus vector pair corresponding to any of the principal vector pairs between subspaces and is a local minima to the objective 9.
Since corresponding to any principal vector pair between two disjoint subspaces form a local minima to the objective given by equation (9), one can alternatively minimize equation (9) w.r.t. and and reach one of the local minima. Thus, by assuming independent subspace structure for all the classes in algorithm 1 and setting to zero, it is straight forward to see that the algorithm yields a projection matrix that satisfies the criteria specified by theorem 2.
Finally, real world data do not strictly satisfy the independent subspace assumption in general and even a slight corruption in data may easily lead to the violation of this independence. In order to tackle this problem, we add a regularization () term while solving for the principal vector pair in algorithm 1. If we assume that the corruption is not heavy, reconstructing a sample using vectors belonging to another subspace would require a large coefficient over those vectors. The regularization avoids reconstructing data from one class using vectors from another class that are slightly corrupted by assigning such vectors small coefficients.
3.3 Complexity
Solving algorithm 1 requires solving an unconstrained quadratic program within a whileloop. Assume that we run this while loop for T iterations and that we use conjugate gradient descent to solve the quadratic program in each iteration. Also, it is known that for any matrix and vector , conjugate gradient applied to a problem of the form
(13) 
takes time , where is the condition number of . Thus it is straight forward to see that the time required to compute the projection matrix for a class problem in our case is , where is the dimensionality of feature space, is the total number of samples and is the condition number of the matrix . Here
is the identity matrix. Note that the quadratic program (bottleneck of our algorithm) can also be solved using optimization techniques such as Stochastic Coordinate Descent or Stochastic Gradient Descent in case of very large dimensionality or dataset size and hence our algorithm is scalable.
4 Empirical Analysis
In this section, we present empirical evidence to support our theoretical analysis of our subspace learning approach. For real world data, we use the following datasets:
1. Extended Yale dataset B [2]: It consists of frontal face images of 38 individuals () with images per person. These images were taken under constrained but varying illumination conditions.
2. AR dataset [9]: This dataset consists of more than frontal face images of individuals with images per person. These images were taken under varying illumination, expression and facial disguise. For our experiments, similar to [14], we use images from individuals () with males and females. We further use only images per class which correspond to illumination and expression changes. This corresponds to images from Session and rest from Session 2.
3. PIE dataset [11]: The pose, illumination, and expression (PIE) database is a subset of CMU PIE dataset consisting of images of people ().
We crop all the images to , and concatenate all the pixel intensity to form our feature vectors. Further, we normalize all data vectors to have unit norm.
4.1 Qualitative Analysis
4.1.1 Two SubspacesTwo Lines
We test both the claim of theorem 1 and the quality of approximation achieved by algorithm 1 in this section. We perform these tests on both synthetic and real data.
1. Synthetic Data: We generate two random subspaces in of dimensionality and
(notice that these subspaces will be independent with probability
). We randomly generate data vectors from each subspace and normalize them to have unit length. We then compute the principal vector pair between the two subspaces using their basis vectors by performing SVD of , where and are the basis of the two subspaces. We orthonormalize the vector pair to form the projection matrix . Next, we use the labeled dataset of points generated to form the projection matrix by applying algorithm 1. The entire dataset of points is then projected onto and separately and plotted in figure 3. The green and red points denote data from either subspace. The results not only substantiate our claim in theorem 1 but also suggest that the proposed algorithm for estimating the projection matrix is a good approximation.2. Real Data: Here we use Extended Yale dataset B for analysis. Since we are interested in projection of two class data in this experimental setup, we randomly choose different pairs of classes from the dataset and use the labeled data from each pair to generate the two dimensional projection matrix (for that pair) using algorithm 1. The resulting projected data from the pairs can be seen in figure 3. As is evident from the figure, the projected two class data for each pair approximately lie along two different lines.
4.1.2 Multiclass separability
We analyze the separation between the classes of a given class dataset after dimensionality reduction. First we compute the projection matrix for that dataset using our approach and project the data. Second, we compute the top principal vector for each class separately from the projected data. This gives us vectors. Let the columns of the matrix contain these vectors. Then in order to visualize interclass separability, we simply take the dot product of the matrix with itself, i.e. . Figure 4 shows this visualization for the three face datasets. The diagonal elements represent selfdot product; thus the value is (white). The offdiagonal elements represent interclass dot product and these values are consistently small (dark) for all the three datasets reflecting between class separability.
4.2 Quantitative Analysis
In order to evaluate theorem 2, we perform a classification experiment on all the three real world datasets mentioned above after projecting the data vectors using different dimensionality reduction techniques. We compare our quantitative results against PCA, Linear discriminant analysis (LDA), Regularized LDA and Random Projections (RP) ^{1}^{1}1We also used LPP (Locality Preserving Projections) [3], NPE (Neighborhood Preserving Embedding) [4], and Laplacian Eigenmaps [1] for dimensionality reduction on Extended Yale B dataset. However, because the best performing of these reduction techniques yielded a result of only 73% compared to the close to 98% accuracy from our approach, we do not report results from these methods.. We make use of sparse coding [14] for classification.
For Extended Yale dataset B, we use all classes for evaluation with traintest split 1 and traintest split 2. Since our method is randomized, we perform 50 runs of computing the projection matrix using algorithm 1
and report the mean accuracy with standard deviation. Similarly for RP, we generate
different random matrices and then perform classification. Since all other methods are deterministic, there is no need for multiple runs.Method  Ours  PCA  LDA  RegLDA  RP 
dim  76  76  37  37  76 
acc  98.06 0.18  92.54  83.68  95.77  93.78 0.48 
Method  Ours  PCA  LDA  RegLDA  RP 
dim  76  76  37  37  76 
acc  99.45 0.20  93.98  93.85  97.47  94.72 0.66 
Method  Ours  PCA  LDA  RegLDA  RP 
dim  200  200  99  99  200 
acc  92.18 0.08  85.00    88.71  84.76 1.36 
Method  Ours  PCA  LDA  RegLDA  RP 
dim  136  136  67  67  136 
acc  93.65 0.08  87.76  86.71  92.59  90.46 0.93 
Method  Ours  PCA  LDA  RegLDA  RP 
dim  20  20  9  9  20 
acc  99.07 0.09  97.06  95.88  97.25  95.03 0.41 
For AR dataset, we take the images from Session for training and the images from Session for testing. The results are shown in table 3. The result using LDA is not reported because we found that the summed within class covariance was degenerate and hence LDA was not applicable. It can be clearly seen that our approach significantly outperforms other dimensionality reduction methods.
Finally for PIE dataset, we perform experiments on two different subsets. First, we take all the classes and for each class, we randomly choose images for training and for testing. The performance for this subset is shown in table 4. Second, we take only the first classes of the dataset and of all the images per class, we randomly split the data into traintest set. The performance for this subset is shown in table 5.
Evidently, our approach consistently yields the best performance on all the three datasets compared to other dimensionality reduction methods.
5 Conclusion
We proposed a theoretical analysis on the preservation of independence between multiple subspaces. We show that for independent subspaces, projection vectors are sufficient for independence preservation (theorem 2). This result is motivated from our observation that for any two disjoint subspaces of arbitrary dimensionality, there exists a two dimensional plane such that after projection, the entire subspaces collapse to just two lines (theorem 1). Resulting from this analysis, we proposed an efficient iterative algorithm (1) that tries to exploit these properties for learning a projection matrix for dimensionality reduction that preserves independence between multiple subspaces. Our empirical results on three real world datasets yield stateoftheart results compared to popular dimensionality reduction methods.
References
 [1] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, June 2003.

[2]
A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman.
From few to many: Illumination cone models for face recognition under variable lighting and pose.
IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.  [3] X. He and P. Niyogi. Locality preserving projections (lpp). Proc. of the NIPS, Advances in Neural Information Processing Systems. Vancouver: MIT Press, 103, 2004.
 [4] Xiaofei He, Deng Cai, Shuicheng Yan, and HongJiang Zhang. Neighborhood preserving embedding. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pages 1208–1213 Vol. 2, Oct 2005.

[5]
Jeffrey Ho, MingHusang Yang, Jongwoo Lim, KuangChih Lee, and David Kriegman.
Clustering appearances of objects under varying illumination
conditions.
In
Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on
, volume 1, pages I–11–I–18. IEEE, 2003.  [6] Wei Hong, John Wright, Kun Huang, and Yi Ma. Multiscale hybrid linear models for lossy image representation. Image Processing, IEEE Transactions on, 15(12):3655–3671, 2006.
 [7] Guangcan Liu, Zhouchen Lin, and Yong Yu. Robust subspace segmentation by lowrank representation. In ICML, 2010.
 [8] Yi Ma, Harm Derksen, Wei Hong, John Wright, and Student Member. Segmentation of multivariate mixed data via lossy coding and compression. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3, 2007.
 [9] Aleix Martínez and Robert Benavente. AR Face Database, 1998.
 [10] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323–2326, December 2000.
 [11] Terence Sim, Simon Baker, and Maan Bsat. The cmu pose, illumination, and expression (pie) database. In Automatic Face and Gesture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, pages 46–51. IEEE, 2002.
 [12] René Vidal, Stefano Soatto, Yi Ma, and Shankar Sastry. An algebraic geometric approach to the identification of a class of linear hybrid systems. In Decision and Control, 2003. Proceedings. 42nd IEEE Conference on, volume 1, pages 167–172. IEEE, 2003.
 [13] René Vidal, Roberto Tron, and Richard Hartley. Multiframe motion segmentation with missing data using powerfactorization and gpca. International Journal of Computer Vision, 79(1):85–105, 2008.
 [14] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Yi Ma. Robust face recognition via sparse representation. IEEEE TPAMI, 31(2):210 –227, Feb. 2009.
 [15] Allen Y Yang, John Wright, Yi Ma, and S Shankar Sastry. Unsupervised segmentation of natural images via lossy data compression. Computer Vision and Image Understanding, 110(2):212–225, 2008.