High-dimensional data are often characterized by very rich and diverse information, which enables us to classify or recognize the targets more effectively and analyze data attributes more easily, but inevitably introduces some drawbacks (e.g. information redundancy, complex noise effects, high storage-consuming, etc.) due to the curve of dimensionality
. A general way to address this problem is to learn a low-dimensional and high-discriminative feature representation. In general, it is also called as dimensionality reduction or subspace learning. In the past decades, a large number of subspace learning techniques have been developed in the machine learning community, with successful applications to biometrics, image/video analysis , visualization , hyperspectral data analysis (e.g., dimensionality reduction and unmixing) 
. These subspace learning techniques are generally categorized into linear or nonlinear methods. Theoretically, nonlinear approaches are capable of curving the data structure in a more effective way. There is, however, no explicit mapping function (poor explainability), and meanwhile it is relatively hard to embed the out-of-samples into the learned subspace (weak generalization) as well as high computational cost (lack of cost-effectiveness). Additionally, for a task of multi-label classification, these classic subspace learning techniques, such as principal component analysis (PCA), local discriminant analysis (LDA) , local fisher discriminant analysis (LFDA) , manifold learning (e.g. Laplacian eigenmaps (LE) , locally linear embedding (LLE) ) and their linearized methods (e.g. locality preserving projection (LPP), neighborhood preserving embedding (NPE)), are commonly applied as a disjunct feature learning step before classification, whose limitation mainly lies in a weak connection between features by subspace learning and label space (see the top panel of Fig. 1). It is unknown which learned features (or subspace) can improve the classification.
Recently, a feasible solution to the above problems can be generalized as a joint learning framework  that simultaneously considers linearized subspace learning and classification, as illustrated in the middle panel of Fig. 1. Following it, more advanced methods have been proposed and applied in various fields, including supervised dimensionality reduction (e.g. least-squares dimensionality reduction (LSDR)  and its variants: least-squares quadratic mutual information derivative (LSQMID) ), multi-modal data matching and retrieval [28, 27], and heterogeneous features learning for activity recognition [15, 16]
. In these work, the learned features (or subspace) and label information are effectively connected by regression techniques (e.g. linear regression) to adaptively estimate a latent and discriminative subspace. Despite this, they still fail to find an optimal subspace, as single linear projection is hardly enough to represent the complex transformation from the original data space to the potential optimal subspace.
Motivated by the aforementioned studies, we propose a novel joint and progre-ssive learning strategy (J-Play) to linearly find an optimal subspace for general multi-label classification, illustrated in the bottom panel of Fig. 1. We practically extend the existing joint learning framework by learning a series of subspaces instead of single subspace, aiming at progressively converting the original data space to a potentially optimal subspace through multi-coupled intermediate transformations . Theoretically, by increasing the number of subspaces, coupled subspace variations are gradually narrowed down to a very small range that can be represented effectively via a linear transformation. This renders us to find a good solution easier, especially when the model is complex and non-convex. We also contribute to structure learning in each latent subspace by locally embedding manifold structure.
The main highlights of our work can be summarized as follows:
A linearized progressive learning strategy is proposed to describe the variations from the original data space to potentially optimal subspace, tending to find a better solution. A joint learning framework that simultaneously estimates subspace projections (connect the original space and the latent subspaces) and a property-labeled projection (connect the learned latent subspaces and label space) is considered to find a discriminative subspace where samples are expected to be better classified.
Structure learning with local manifold regularization is performed in each latent subspace.
Based on the above techniques, a novel joint and progressive learning strategy (J-Play) is developed for multi-label classification.
An iterative optimization algorithm based on the alternating direction method of multipliers (ADMM) is designed to solve the proposed model.
2 Joint & Progressive Learning Strategy (J-Play)
Let be a data matrix with dimensions and samples, and the matrix of corresponding class labels be . The th column of is whose each element can be defined as follows:
In our task, we aim to learn a set of coupled projections and a property-labeled projection , where stands for the number of subspace projections and are defined as the dimensions of those latent subspaces respectively, while is specified as the dimension of .
2.2 Basic Framework of J-Play from the View of Subspace Learning
Subspace learning is to find a low-dimensional space where we expect to maximize certain properties of the original data, e.g. variance (PCA), discriminative ability (LDA), and graph structure (manifold learning). Yan et al. summarized these subspace learning methods in a general graph embedding framework.
Given an undirected similarity graph with the vertices and the adjacency matrix , we can intuitively measure the similarities among the data. By preserving the similarities relationship, the high-dimensional data can be well embedded into the low-dimensional space, which can be formulated by denoting the low-dimensional data representation as () in the following
where is a diagonal matrix, is a Laplacian matrix defined by , and
is the identity matrix. In our case, we aim at learning multi-coupled linear projections to find optimal mapping, therefore a linearized subspace learning problem can be reformulated on the basis of Eq. (2) by substituting for
which can be solved by generalized eigenvalue decomposition.
Different from the previously mentioned subspace learning methods, a re-gression-based joint learning model  can explicitly bridge the learned latent subspace and labels, which can be formulated in a general form:
where is the error term defined as , represents a Frobenius norm, and are the corresponding penalty parameters. and denote regularization functions, which might be norm, norm, norm or manifold regularization. Herein, the variable is called intermediate transformation and the corresponding subspace generated by is called latent subspace where the feature can be further structurally learned and represented in a more suitable way .
On the basis of Eq. (5), we further extend the framework by following a progressive learning strategy:
where is specified as and represent a set of intermediate transformations.
2.3 Problem Formulation
Following the general framework given in Eq.(6), the proposed J-Play can be formulated as the following constrained optimization problem:
where is assigned to , while , , and are three penalty parameters corresponding to the different terms, which aim at balancing the importance between the terms. Fig. 2 illustrates the J-Play framework. Since Eq. (7) is a typically ill-posed problem, reasonable assumptions or priors need to be introduced to search a solution in a narrowed range effectively. More specifically, we cast Eq.(7) as a least-square regression problem with reconstruction loss term (), prediction loss term () and two regularization terms ( and ). We detail these terms one by one as follows.
1) Reconstruction Loss Term
: Without any constraints or prior, directly estimating multi-coupled projections in J-Play is hardly performed with the increase of the number of estimated projections. This can be reasonably explained by gradient missing between the two neighboring variables estimated in the process of optimization. That is, the variations between these neighboring projections are made to be tiny and even zero. In particular, when the number of projections increases to a certain extent, most of learned projections tend to be zero and become meaningless. To this end, we adopt a kind of autoencoder-like scheme to make the learned subspace projected back to the original space as much as possible. The benefits of the scheme are, on one hand, to prevent the data over-fitting to some extent, especially avoiding overmuch noises from being considered; on the other hand, to establish an effective link between the original space and the subspace, making the learned subspace more meaningful. Therefore, the resulting expression is
In our case, to fully utilize the advantages of this term, we consider it in each latent subspace as shown in Eq.(8).
2) Predication Loss Term : This term is to minimize the empirical risk between the original data and the corresponding labels through multi-coupled projections in a progressive way, which can be formulated as
3) Local Manifold Regularization : As introduced in 
, a manifold structure is an important prior for subspace learning. Superior to vector-based feature learning, such as artificial neural network (ANN), a manifold structure can effectively capture the intrinsic structure between samples. To facilitate structure learning in J-Play, we perform the local manifold regularization to each latent subspace. Specifically, this term can be expressed by
4) Regression Coefficient Regularization : The regularization term can promote us to derive a more reasonable solution with a reliable generalization to our model, which can be written as
Moreover, the non-negativity constraint with respect to each learned dimension-reduced feature (e.g. ) is considered since we aim to obtain a meaningful low-dimensional feature representation similar to original image data acquired in a non-negative unit. In addition to the non-negativity constraint, we also impose a norm constraint 111Regarding this constraint,please refer to  for more details. for sample-based of each subspace: and .
2.4 Model Optimization
Considering the complexity and the non-convexity of our model, we pretrain our model to have an initial approximation of subspace projections as this can greatly reduce the model’s training time and also help finding an optimal solution easier. This is a common tactic that has been successfully employed in deep autoencoders . Inspired by this trick, we propose a pre-training model with respect to by simplifying Eq.(7) as
which is named as auto-reconstructing unsupervised learning (AutoRULe). Given the outputs of AutoRULe, the problem of Eq. (7) can be more effectively solved by an alternatively minimizing strategy that separately solves two subproblems with respect to and . Therefore, the global algorithm of J-Play can be summarized in Algorithm 1,where AutoRULe is initialized by LPP.
The pre-training method (AutoRULe) can be effectively solved via the ADMM-based framework. Following this, we consider an equivalent form of Eq. (12) by introducing multiple auxiliary variables , , and to replace , , and , respectively, where denotes an operator that converts each component of the matrix to its absolute value and is a proximal operator for solving the constraint of , written as follows
The augmented Lagrangian version of Eq. (13) is
where are Lagrange multipliers and is the penalty parameter. The two terms and represent two kinds of projection operators, respectively. That is, is defined as
while is a vector-based operator defined by
where is the th column of matrix . Algorithm 2 details the procedures of AutoRULe.
Auto-reconstructing unsupervised learning (AutoRULe)
The two subproblems in Algorithm 1 can be optimized alternatively as follows:
Optimization with respect to : This is a typical least square regression problem, which can be written as
which has a closed-form solution
Optimization with respect to : The variables can be individually optimized, and hence the optimization problem of each can be generally formulated by
which can be basically deduced by following the framework of Algorithm 2. The only difference lies in the optimization subproblem with respect to whose solution can be collected by solving the following problem:
The analytical solution of Eq. (20) is given by
Finally, we repeat these optimization procedures until a stopping criterion is satisfied. Please refer to Algorithm 1 and Algorithm 2 for more explicit steps.
In this section, we conduct the classification to quantitatively evaluate the performance of the proposed method (J-Play) using three popular and advanced classifiers, namely the nearest neighbor (NN) based on the Euclidean distance, kernel support vector machines (KSVM) and canonical correlation forest (CCF), in comparison with previous state-of-the-art methods. Overall accuracy (OA) is given to quantify the classification performance.
3.1 Data Description
The experiments are performed on two different types of datasets: hyperspectral datasets and face datasets, as both of them easily suffer from the information redundancy and need to improve the representative ability of features. We have used the following two hyperspectral datasets and two face datasets:
1) Indian Pines AVIRIS Image: The first hyperspectral cube was acquired by the AVIRIS sensor with the size of , which consists of class of vegetation. More specific classes and the arrangement of training and test samples can be found in . The first image of Fig. 3 shows a false color image of Indian Pines data.
2) University of Houston Image: The second hyperspectral cube was provided for the IEEE GRSS data fusion contest acquired by ITRES-CASI sensor with size of . The information regarding classes and corresponding train and test samples can be found in . A false color image of the study scene is shown in the first image of Fig. 4.
3) Extended Yale-B Dataset: We only choose a subset of the mentioned dataset with the frontal pose and the different illuminations of subjects ( images in total), which can widely used in evaluating the performance of subspace learning . These images were aligned and cropped to the size of , that is, -dimensional vector-based representation. Each individual has near frontal images under different illuminations.
4) AR Dataset: Similar to , we choose a subset of AR under the conditions of illumination and expressions, which comprises of subjects. Each person has images with seven ones from Session as training set and others from Session as testing samples. The images are resized to .
3.2 Experimental Steup
As the fixed training and testing samples are given for the hyperspectral datasets, subspace learning techniques can directly be performed on training set to learn an optimal subspace where the testing set can be simply classified by NN, KSVM, and CCF. For the face datasets, since there is no standard training and testing sets, ten replications are performed for randomly selecting training and testing samples. A random subset with facial images per individual is chosen with labels as the training set and the rest of it is considered to be the testing set. Furthermore, we compare the performance of the proposed method (J-Play) with the baseline (original features without dimensionality reduction) and six popular and advanced methods (PCA, LPP, LDA, LFDA, LSDR, and LSQMID). With learning the different number of coupled projections, the proposed method can be successively specified as J-Play,…,J-Play,…,J-Play, . To investigate the trend of OAs, are uniformly set up to on the four datasets.
|Methods||Indian Pines dataset||Houston dataset|
3.3 Results of Hyperspectral Data
Initially, we conduct a -fold cross-validation for the different algorithms on the training set in order to estimate the optimal parameters which can be selected from . Table 1 lists classification performances of the different methods with the optimal subspace dimensions obtained by cross-validation using three different classifiers. Correspondingly, the classification maps are given in Figs. 3 and 4 to intuitively highlight the difference.
Overall, PCA performs basically similar performance with the baseline using the three different classifiers on the two datasets. For LPP, due to its sensitivity to noise, it yields a poor performance on the first dataset, while on the relatively high-quality second dataset, LPP steadily outperforms the baseline and PCA. In the supervised algorithms, owing to the limitation of training samples and discriminative power, the classification accuracies of classic LDA is holistically lower than those previously mentioned. With a more powerful discriminative criterion, LFDA obtains more competitive results by locally focusing on discriminative information, which are generally better than those of the baseline, PCA, LPP, and LDA. However, the features learned by LFDA is sensitive to noise and the number of neighbors, resulting in the unstable performance particularly for the different classifiers. For LSDR and LSQMID, they aim to find a linear projection by maximizing the mutual information between input and output from the view of statistics. With fully considering the mutual information, they achieve the good performance on the two given hyperspectral datasets.
Remarkably, the performance of the proposed method (J-Play) is superior to the other methods on the two hyperspectral datasets. This indicates that J-Play is prone to learn a better feature representation and robust against noise. On the other hand, with the increase of , the performance of J-Play steadily increases to the best with around or layers for the first dataset and or layers for the second one, and then gradually decreases with a slight perturbation since our model is only trained on the training set.
|Methods||Extended Yale-B dataset||AR dataset|
3.4 Results of Face Images
As J-Play is proposed as a general subspace learning framework for multi-label classiciation, we additionally used two popular face datasets to further assess its generalization capability. Similarly, cross-validation on training set is conducted for estimating the optimal parameter combination on the extended Yale-B and AR datasets. Considering the high-dimensional vector-based face images, we first perform the PCA for face images in order to roughly reduce the feature redundancy, whose results are further explored to the dimensionality reduction methods by following the previous work on face recognition (e.g. LDA (Fisherfaces) and LPP (Laplacianfaces) ). Table 2 gives the corresponding OAs using the different methods on the two face datasets respectively.
By comparison, the performance of PCA and LPP is steadily superior to that of baseline, while PCA is even better than LPP. For supervised approaches, LDA performs better than baseline, PCA, LPP and even LFDA, showing an impressive result. Due to the less number of training samples from face datasets, LSDR and LSQMID are limited to effectively estimate the mutual information between the training samples and labels, resulting in the performance degradation compared to the hyperspectral data. The proposed method outperforms other algorithms, which indicates that this method can effectively learn an optimal mapping from original space to label space, further improving the classification accuracy. Likewise, there is a similar trend for the proposed method with the increase of that J-Play can basically obtain the optimal OAs with around or layers and more layers would lead to the performance degradation. We also characterize and visualize each column of the learned projection, as shown in Fig. 5 where those high-level or semantically meaningful features, i.e. face features under the different pose and illumination, can be learned well, making the faces identified easier.
To effectively find an optimal subspace where the samples can be semantically represented and thereby be better classified or recognized, we proposed a novel linearized subspace learning framework (J-Play) which aims at learning the feature representation from the high-dimensional data in a joint and progressive way. Extensive experiments of multi-label classification are conducted on two types of datasets: hyperspectral images and face images, in comparison with some previously proposed state-of-the-art methods. The promising results using J-Play demonstrate its superiority and effectiveness. In the future, we will further build an unified framework based on J-Play by extending it to semi-supervised learning, transfer learning, or multi-task learning.
-  Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)
-  Cai, D., He, X., Han, J.: Spectral regression: A unified approach for sparse subspace learning. In: International Conference on Data Mining (ICDM). pp. 73–82 (2007)
-  Chung, F.R.K.: Spectral graph theory. American Mathematical Society (1997)
He, X., Cai, D., Yan, S., Zhang, H.J.: Neighborhood preserving embedding. In: International Conference on Computer Vision (ICCV). vol. 2, pp. 1208–1213 (2005)
-  He, X., Hu, S., Niyogi, P., Zhang, H.J.: Face recognition using laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 27(3), 328–340 (2005)
-  He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems (NIPS). pp. 153–160 (2004)
Heide, F., Heidrich, W., Wetzstein, G.: Fast and flexible convolutional sparse coding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5135–5143 (2015)
-  Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
-  Hong, D., Liu, W., Su, J., Z.Pan, Wang, G.: A novel hierarchical approach for multispectral palmprint recognition. Neurocomputing 151, 511–521 (2015)
-  Hong, D., Liu, W., Wu, X., Pan, Z., Su, J.: Robust palmprint recognition based on the fast variation vese–osher model. Neurocomputing 174, 999–1012 (2016)
-  Hong, D., Yokoya, N., Zhu, X.: The k-lle algorithm for nonlinear dimensionality ruduction of large-scale hyperspectral data. In: IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). pp. 1–5. IEEE (2016)
-  Hong, D., Yokoya, N., Zhu, X.: Local manifold learning with robust neighbors selection for hyperspectral dimensionality reduction. In: IEEE International Conference on Geoscience and Remote Sensing Symposium (IGARSS). pp. 40–43. IEEE (2016)
-  Hong, D., Yokoya, N., Zhu, X.: Learning a robust local manifold representation for hyperspectral dimensionality reduction. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS) 10(6), 2960–2975 (2017)
-  Hong, D., Yokoya, N., Chanussot, J., Zhu, X.X.: Learning a low-coherence dictionary to address spectral variability for hyperspectral unmixing. In: Image Processing (ICIP), 2017 IEEE International Conference on. pp. 235–239. IEEE (2017)
-  Hu, J., Zheng, W., Lai, J., Zhang, J.: Jointly learning heterogeneous features for rgb-d activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5344–5352 (2015)
-  Hu, J., Zheng, W., Lai, J., Zhang, J.: Jointly learning heterogeneous features for rgb-d activity recognition (2016)
-  Ji, S., Ye, J.: Linear dimensionality reduction for multi-label classification. In: International Joint Conference on Artifical Intelligence (IJCAI). vol. 9, pp. 1077–1082 (2009)
-  Kan, M., Shan, S., Chang, H., Chen, X.: Stacked progressive auto-encoders (spae) for face recognition across poses. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1883–1890 (2014)
-  Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: Advances in Neural Information Processing Systems (NIPS). pp. 801–808 (2007)
-  Martínez, A.M., Avinash, C.K.: Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(2), 228–233 (2001)
-  Roweis, S.T., Lawrence, K.S.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
-  Saul, S.L., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research (JMLR) 4, 119–155 (2003)
-  Sugiyama, M.: Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. Journal of Machine Learning Research (JMLR) 8, 1027–1061 (2007)
-  Suzuki, T., Sugiyama, M.: Sufficient dimension reduction via squared-loss mutual information estimation. Neural Computation 25(3), 725–758 (2013)
-  Tangkaratt, V., Sasaki, H., Sugiyama, M.: Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction. Neural Computation 29(8), 2076–2122 (2017)
-  Tosato, D., Farenzena, M., Spera, M., Murino, V., Cristani, M.: Multi-class classification on riemannian manifolds for video surveillance. In: Europe Conference on Computer Vision (ECCV). pp. 378–391 (2010)
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)38(10), 2010–2023 (2016)
-  Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: International Conference on Computer Vision (ICCV). pp. 2088–2095 (2013)
-  Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2(1), 37–52 (1987)
-  Yan, S., Xu, D., Zhang, B., Zhang, H.J., Yang, Q., Lin, S.: Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(1), 40–51 (2007)
-  Yang, M., Zhang, L., Yang, J., Zhang, D.: Robust sparse coding for face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 625–632 (2011)
-  Zhang, L., Yang, M., Feng, X.: Sparse representation or collaborative representation: which helps face recognition?. In: International Conference on Computer Vision (ICCV). pp. 471–478 (2011)