1 Introduction
Highdimensional data are often characterized by very rich and diverse information, which enables us to classify or recognize the targets more effectively and analyze data attributes more easily, but inevitably introduces some drawbacks (e.g. information redundancy, complex noise effects, high storageconsuming, etc.) due to the curve of dimensionality
. A general way to address this problem is to learn a lowdimensional and highdiscriminative feature representation. In general, it is also called as dimensionality reduction or subspace learning. In the past decades, a large number of subspace learning techniques have been developed in the machine learning community, with successful applications to biometrics
[20][5][9][10], image/video analysis [26], visualization [22], hyperspectral data analysis (e.g., dimensionality reduction and unmixing) [12][13][14]. These subspace learning techniques are generally categorized into linear or nonlinear methods. Theoretically, nonlinear approaches are capable of curving the data structure in a more effective way. There is, however, no explicit mapping function (poor explainability), and meanwhile it is relatively hard to embed the outofsamples into the learned subspace (weak generalization) as well as high computational cost (lack of costeffectiveness). Additionally, for a task of multilabel classification, these classic subspace learning techniques, such as principal component analysis (PCA)
[29], local discriminant analysis (LDA) [20], local fisher discriminant analysis (LFDA) [23], manifold learning (e.g. Laplacian eigenmaps (LE) [1], locally linear embedding (LLE) [21]) and their linearized methods (e.g. locality preserving projection (LPP)[6], neighborhood preserving embedding (NPE)[4]), are commonly applied as a disjunct feature learning step before classification, whose limitation mainly lies in a weak connection between features by subspace learning and label space (see the top panel of Fig. 1). It is unknown which learned features (or subspace) can improve the classification.Recently, a feasible solution to the above problems can be generalized as a joint learning framework [17] that simultaneously considers linearized subspace learning and classification, as illustrated in the middle panel of Fig. 1. Following it, more advanced methods have been proposed and applied in various fields, including supervised dimensionality reduction (e.g. leastsquares dimensionality reduction (LSDR) [24] and its variants: leastsquares quadratic mutual information derivative (LSQMID) [25]), multimodal data matching and retrieval [28, 27], and heterogeneous features learning for activity recognition [15, 16]
. In these work, the learned features (or subspace) and label information are effectively connected by regression techniques (e.g. linear regression) to adaptively estimate a latent and discriminative subspace. Despite this, they still fail to find an optimal subspace, as single linear projection is hardly enough to represent the complex transformation from the original data space to the potential optimal subspace.
Motivated by the aforementioned studies, we propose a novel joint and progressive learning strategy (JPlay) to linearly find an optimal subspace for general multilabel classification, illustrated in the bottom panel of Fig. 1. We practically extend the existing joint learning framework by learning a series of subspaces instead of single subspace, aiming at progressively converting the original data space to a potentially optimal subspace through multicoupled intermediate transformations [18]. Theoretically, by increasing the number of subspaces, coupled subspace variations are gradually narrowed down to a very small range that can be represented effectively via a linear transformation. This renders us to find a good solution easier, especially when the model is complex and nonconvex. We also contribute to structure learning in each latent subspace by locally embedding manifold structure.
The main highlights of our work can be summarized as follows:

[noitemsep,topsep=0pt]

A linearized progressive learning strategy is proposed to describe the variations from the original data space to potentially optimal subspace, tending to find a better solution. A joint learning framework that simultaneously estimates subspace projections (connect the original space and the latent subspaces) and a propertylabeled projection (connect the learned latent subspaces and label space) is considered to find a discriminative subspace where samples are expected to be better classified.

Structure learning with local manifold regularization is performed in each latent subspace.

Based on the above techniques, a novel joint and progressive learning strategy (JPlay) is developed for multilabel classification.

An iterative optimization algorithm based on the alternating direction method of multipliers (ADMM) is designed to solve the proposed model.
2 Joint & Progressive Learning Strategy (JPlay)
2.1 Notations
Let be a data matrix with dimensions and samples, and the matrix of corresponding class labels be . The th column of is whose each element can be defined as follows:
(1) 
In our task, we aim to learn a set of coupled projections and a propertylabeled projection , where stands for the number of subspace projections and are defined as the dimensions of those latent subspaces respectively, while is specified as the dimension of .
2.2 Basic Framework of JPlay from the View of Subspace Learning
Subspace learning is to find a lowdimensional space where we expect to maximize certain properties of the original data, e.g. variance (PCA), discriminative ability (LDA), and graph structure (manifold learning). Yan et al.
[30] summarized these subspace learning methods in a general graph embedding framework.Given an undirected similarity graph with the vertices and the adjacency matrix , we can intuitively measure the similarities among the data. By preserving the similarities relationship, the highdimensional data can be well embedded into the lowdimensional space, which can be formulated by denoting the lowdimensional data representation as () in the following
(2) 
where is a diagonal matrix, is a Laplacian matrix defined by [3], and
is the identity matrix. In our case, we aim at learning multicoupled linear projections to find optimal mapping, therefore a linearized subspace learning problem can be reformulated on the basis of Eq. (
2) by substituting for(3) 
which can be solved by generalized eigenvalue decomposition.
Different from the previously mentioned subspace learning methods, a regressionbased joint learning model [17] can explicitly bridge the learned latent subspace and labels, which can be formulated in a general form:
(4) 
where is the error term defined as , represents a Frobenius norm, and are the corresponding penalty parameters. and denote regularization functions, which might be norm, norm, norm or manifold regularization. Herein, the variable is called intermediate transformation and the corresponding subspace generated by is called latent subspace where the feature can be further structurally learned and represented in a more suitable way [16].
On the basis of Eq. (5), we further extend the framework by following a progressive learning strategy:
(5) 
where is specified as and represent a set of intermediate transformations.
2.3 Problem Formulation
Following the general framework given in Eq.(6), the proposed JPlay can be formulated as the following constrained optimization problem:
(6)  
where is assigned to , while , , and are three penalty parameters corresponding to the different terms, which aim at balancing the importance between the terms. Fig. 2 illustrates the JPlay framework. Since Eq. (7) is a typically illposed problem, reasonable assumptions or priors need to be introduced to search a solution in a narrowed range effectively. More specifically, we cast Eq.(7) as a leastsquare regression problem with reconstruction loss term (), prediction loss term () and two regularization terms ( and ). We detail these terms one by one as follows.
1) Reconstruction Loss Term
: Without any constraints or prior, directly estimating multicoupled projections in JPlay is hardly performed with the increase of the number of estimated projections. This can be reasonably explained by gradient missing between the two neighboring variables estimated in the process of optimization. That is, the variations between these neighboring projections are made to be tiny and even zero. In particular, when the number of projections increases to a certain extent, most of learned projections tend to be zero and become meaningless. To this end, we adopt a kind of autoencoderlike scheme to make the learned subspace projected back to the original space as much as possible. The benefits of the scheme are, on one hand, to prevent the data overfitting to some extent, especially avoiding overmuch noises from being considered; on the other hand, to establish an effective link between the original space and the subspace, making the learned subspace more meaningful. Therefore, the resulting expression is
(7) 
In our case, to fully utilize the advantages of this term, we consider it in each latent subspace as shown in Eq.(8).
2) Predication Loss Term : This term is to minimize the empirical risk between the original data and the corresponding labels through multicoupled projections in a progressive way, which can be formulated as
(8) 
3) Local Manifold Regularization : As introduced in [27]
, a manifold structure is an important prior for subspace learning. Superior to vectorbased feature learning, such as artificial neural network (ANN), a manifold structure can effectively capture the intrinsic structure between samples. To facilitate structure learning in JPlay, we perform the local manifold regularization to each latent subspace. Specifically, this term can be expressed by
(9) 
4) Regression Coefficient Regularization : The regularization term can promote us to derive a more reasonable solution with a reliable generalization to our model, which can be written as
(10) 
Moreover, the nonnegativity constraint with respect to each learned dimensionreduced feature (e.g. ) is considered since we aim to obtain a meaningful lowdimensional feature representation similar to original image data acquired in a nonnegative unit. In addition to the nonnegativity constraint, we also impose a norm constraint ^{1}^{1}1Regarding this constraint,please refer to [19] for more details. for samplebased of each subspace: and .
2.4 Model Optimization
Considering the complexity and the nonconvexity of our model, we pretrain our model to have an initial approximation of subspace projections as this can greatly reduce the model’s training time and also help finding an optimal solution easier. This is a common tactic that has been successfully employed in deep autoencoders [8]. Inspired by this trick, we propose a pretraining model with respect to by simplifying Eq.(7) as
(11) 
which is named as autoreconstructing unsupervised learning (AutoRULe). Given the outputs of AutoRULe, the problem of Eq. (7) can be more effectively solved by an alternatively minimizing strategy that separately solves two subproblems with respect to and . Therefore, the global algorithm of JPlay can be summarized in Algorithm 1,where AutoRULe is initialized by LPP.
The pretraining method (AutoRULe) can be effectively solved via the ADMMbased framework. Following this, we consider an equivalent form of Eq. (12) by introducing multiple auxiliary variables , , and to replace , , and , respectively, where denotes an operator that converts each component of the matrix to its absolute value and is a proximal operator for solving the constraint of [7], written as follows
(12)  
The augmented Lagrangian version of Eq. (13) is
(13)  
where are Lagrange multipliers and is the penalty parameter. The two terms and represent two kinds of projection operators, respectively. That is, is defined as
(14) 
while is a vectorbased operator defined by
(15) 
where is the th column of matrix . Algorithm 2 details the procedures of AutoRULe.
Autoreconstructing unsupervised learning (AutoRULe)
The two subproblems in Algorithm 1 can be optimized alternatively as follows:
Optimization with respect to : This is a typical least square regression problem, which can be written as
(16) 
which has a closedform solution
(17) 
where .
Optimization with respect to : The variables can be individually optimized, and hence the optimization problem of each can be generally formulated by
(18)  
which can be basically deduced by following the framework of Algorithm 2. The only difference lies in the optimization subproblem with respect to whose solution can be collected by solving the following problem:
(19)  
The analytical solution of Eq. (20) is given by
(20) 
Finally, we repeat these optimization procedures until a stopping criterion is satisfied. Please refer to Algorithm 1 and Algorithm 2 for more explicit steps.
3 Experiments
In this section, we conduct the classification to quantitatively evaluate the performance of the proposed method (JPlay) using three popular and advanced classifiers, namely the nearest neighbor (NN) based on the Euclidean distance, kernel support vector machines (KSVM) and canonical correlation forest (CCF), in comparison with previous stateoftheart methods. Overall accuracy (OA) is given to quantify the classification performance.
3.1 Data Description
The experiments are performed on two different types of datasets: hyperspectral datasets and face datasets, as both of them easily suffer from the information redundancy and need to improve the representative ability of features. We have used the following two hyperspectral datasets and two face datasets:
1) Indian Pines AVIRIS Image: The first hyperspectral cube was acquired by the AVIRIS sensor with the size of , which consists of class of vegetation. More specific classes and the arrangement of training and test samples can be found in [11]. The first image of Fig. 3 shows a false color image of Indian Pines data.
2) University of Houston Image: The second hyperspectral cube was provided for the IEEE GRSS data fusion contest acquired by ITRESCASI sensor with size of . The information regarding classes and corresponding train and test samples can be found in [13]. A false color image of the study scene is shown in the first image of Fig. 4.
3) Extended YaleB Dataset: We only choose a subset of the mentioned dataset with the frontal pose and the different illuminations of subjects ( images in total), which can widely used in evaluating the performance of subspace learning [32][2]. These images were aligned and cropped to the size of , that is, dimensional vectorbased representation. Each individual has near frontal images under different illuminations.
4) AR Dataset: Similar to [31], we choose a subset of AR under the conditions of illumination and expressions, which comprises of subjects. Each person has images with seven ones from Session as training set and others from Session as testing samples. The images are resized to .
3.2 Experimental Steup
As the fixed training and testing samples are given for the hyperspectral datasets, subspace learning techniques can directly be performed on training set to learn an optimal subspace where the testing set can be simply classified by NN, KSVM, and CCF. For the face datasets, since there is no standard training and testing sets, ten replications are performed for randomly selecting training and testing samples. A random subset with facial images per individual is chosen with labels as the training set and the rest of it is considered to be the testing set. Furthermore, we compare the performance of the proposed method (JPlay) with the baseline (original features without dimensionality reduction) and six popular and advanced methods (PCA, LPP, LDA, LFDA, LSDR, and LSQMID). With learning the different number of coupled projections, the proposed method can be successively specified as JPlay,…,JPlay,…,JPlay, . To investigate the trend of OAs, are uniformly set up to on the four datasets.
Methods  Indian Pines dataset  Houston dataset  

NN  KSVM  CCF  NN  KSVM  CCF  
Baseline  
PCA  
LPP  
LDA  
LFDA  %  
LSDR  
LSQMID  
JPlay  
JPlay  
JPlay  
JPlay  
JPlay  
JPlay  
JPlay 
3.3 Results of Hyperspectral Data
Initially, we conduct a fold crossvalidation for the different algorithms on the training set in order to estimate the optimal parameters which can be selected from . Table 1 lists classification performances of the different methods with the optimal subspace dimensions obtained by crossvalidation using three different classifiers. Correspondingly, the classification maps are given in Figs. 3 and 4 to intuitively highlight the difference.
Overall, PCA performs basically similar performance with the baseline using the three different classifiers on the two datasets. For LPP, due to its sensitivity to noise, it yields a poor performance on the first dataset, while on the relatively highquality second dataset, LPP steadily outperforms the baseline and PCA. In the supervised algorithms, owing to the limitation of training samples and discriminative power, the classification accuracies of classic LDA is holistically lower than those previously mentioned. With a more powerful discriminative criterion, LFDA obtains more competitive results by locally focusing on discriminative information, which are generally better than those of the baseline, PCA, LPP, and LDA. However, the features learned by LFDA is sensitive to noise and the number of neighbors, resulting in the unstable performance particularly for the different classifiers. For LSDR and LSQMID, they aim to find a linear projection by maximizing the mutual information between input and output from the view of statistics. With fully considering the mutual information, they achieve the good performance on the two given hyperspectral datasets.
Remarkably, the performance of the proposed method (JPlay) is superior to the other methods on the two hyperspectral datasets. This indicates that JPlay is prone to learn a better feature representation and robust against noise. On the other hand, with the increase of , the performance of JPlay steadily increases to the best with around or layers for the first dataset and or layers for the second one, and then gradually decreases with a slight perturbation since our model is only trained on the training set.
Methods  Extended YaleB dataset  AR dataset  

NN  KSVM  CCF  NN  KSVM  CCF  
Baseline  
PCA  
LPP  
LDA  
LFDA  %  
LSDR  
LSQMID  
JPlay  
JPlay  
JPlay  
JPlay  
JPlay  
JPlay  
JPlay 
3.4 Results of Face Images
As JPlay is proposed as a general subspace learning framework for multilabel classiciation, we additionally used two popular face datasets to further assess its generalization capability. Similarly, crossvalidation on training set is conducted for estimating the optimal parameter combination on the extended YaleB and AR datasets. Considering the highdimensional vectorbased face images, we first perform the PCA for face images in order to roughly reduce the feature redundancy, whose results are further explored to the dimensionality reduction methods by following the previous work on face recognition (e.g. LDA (Fisherfaces)
[20] and LPP (Laplacianfaces) [5]). Table 2 gives the corresponding OAs using the different methods on the two face datasets respectively.By comparison, the performance of PCA and LPP is steadily superior to that of baseline, while PCA is even better than LPP. For supervised approaches, LDA performs better than baseline, PCA, LPP and even LFDA, showing an impressive result. Due to the less number of training samples from face datasets, LSDR and LSQMID are limited to effectively estimate the mutual information between the training samples and labels, resulting in the performance degradation compared to the hyperspectral data. The proposed method outperforms other algorithms, which indicates that this method can effectively learn an optimal mapping from original space to label space, further improving the classification accuracy. Likewise, there is a similar trend for the proposed method with the increase of that JPlay can basically obtain the optimal OAs with around or layers and more layers would lead to the performance degradation. We also characterize and visualize each column of the learned projection, as shown in Fig. 5 where those highlevel or semantically meaningful features, i.e. face features under the different pose and illumination, can be learned well, making the faces identified easier.
4 Conclusions
To effectively find an optimal subspace where the samples can be semantically represented and thereby be better classified or recognized, we proposed a novel linearized subspace learning framework (JPlay) which aims at learning the feature representation from the highdimensional data in a joint and progressive way. Extensive experiments of multilabel classification are conducted on two types of datasets: hyperspectral images and face images, in comparison with some previously proposed stateoftheart methods. The promising results using JPlay demonstrate its superiority and effectiveness. In the future, we will further build an unified framework based on JPlay by extending it to semisupervised learning, transfer learning, or multitask learning.
References
 [1] Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), 1373–1396 (2003)
 [2] Cai, D., He, X., Han, J.: Spectral regression: A unified approach for sparse subspace learning. In: International Conference on Data Mining (ICDM). pp. 73–82 (2007)
 [3] Chung, F.R.K.: Spectral graph theory. American Mathematical Society (1997)

[4]
He, X., Cai, D., Yan, S., Zhang, H.J.: Neighborhood preserving embedding. In: International Conference on Computer Vision (ICCV). vol. 2, pp. 1208–1213 (2005)
 [5] He, X., Hu, S., Niyogi, P., Zhang, H.J.: Face recognition using laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 27(3), 328–340 (2005)
 [6] He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems (NIPS). pp. 153–160 (2004)

[7]
Heide, F., Heidrich, W., Wetzstein, G.: Fast and flexible convolutional sparse coding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5135–5143 (2015)
 [8] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
 [9] Hong, D., Liu, W., Su, J., Z.Pan, Wang, G.: A novel hierarchical approach for multispectral palmprint recognition. Neurocomputing 151, 511–521 (2015)
 [10] Hong, D., Liu, W., Wu, X., Pan, Z., Su, J.: Robust palmprint recognition based on the fast variation vese–osher model. Neurocomputing 174, 999–1012 (2016)
 [11] Hong, D., Yokoya, N., Zhu, X.: The klle algorithm for nonlinear dimensionality ruduction of largescale hyperspectral data. In: IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). pp. 1–5. IEEE (2016)
 [12] Hong, D., Yokoya, N., Zhu, X.: Local manifold learning with robust neighbors selection for hyperspectral dimensionality reduction. In: IEEE International Conference on Geoscience and Remote Sensing Symposium (IGARSS). pp. 40–43. IEEE (2016)
 [13] Hong, D., Yokoya, N., Zhu, X.: Learning a robust local manifold representation for hyperspectral dimensionality reduction. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS) 10(6), 2960–2975 (2017)
 [14] Hong, D., Yokoya, N., Chanussot, J., Zhu, X.X.: Learning a lowcoherence dictionary to address spectral variability for hyperspectral unmixing. In: Image Processing (ICIP), 2017 IEEE International Conference on. pp. 235–239. IEEE (2017)
 [15] Hu, J., Zheng, W., Lai, J., Zhang, J.: Jointly learning heterogeneous features for rgbd activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5344–5352 (2015)
 [16] Hu, J., Zheng, W., Lai, J., Zhang, J.: Jointly learning heterogeneous features for rgbd activity recognition (2016)
 [17] Ji, S., Ye, J.: Linear dimensionality reduction for multilabel classification. In: International Joint Conference on Artifical Intelligence (IJCAI). vol. 9, pp. 1077–1082 (2009)
 [18] Kan, M., Shan, S., Chang, H., Chen, X.: Stacked progressive autoencoders (spae) for face recognition across poses. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1883–1890 (2014)
 [19] Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: Advances in Neural Information Processing Systems (NIPS). pp. 801–808 (2007)
 [20] Martínez, A.M., Avinash, C.K.: Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(2), 228–233 (2001)
 [21] Roweis, S.T., Lawrence, K.S.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
 [22] Saul, S.L., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research (JMLR) 4, 119–155 (2003)
 [23] Sugiyama, M.: Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. Journal of Machine Learning Research (JMLR) 8, 1027–1061 (2007)
 [24] Suzuki, T., Sugiyama, M.: Sufficient dimension reduction via squaredloss mutual information estimation. Neural Computation 25(3), 725–758 (2013)
 [25] Tangkaratt, V., Sasaki, H., Sugiyama, M.: Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction. Neural Computation 29(8), 2076–2122 (2017)
 [26] Tosato, D., Farenzena, M., Spera, M., Murino, V., Cristani, M.: Multiclass classification on riemannian manifolds for video surveillance. In: Europe Conference on Computer Vision (ECCV). pp. 378–391 (2010)

[27]
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for crossmodal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
38(10), 2010–2023 (2016)  [28] Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for crossmodal matching. In: International Conference on Computer Vision (ICCV). pp. 2088–2095 (2013)
 [29] Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2(1), 37–52 (1987)
 [30] Yan, S., Xu, D., Zhang, B., Zhang, H.J., Yang, Q., Lin, S.: Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(1), 40–51 (2007)
 [31] Yang, M., Zhang, L., Yang, J., Zhang, D.: Robust sparse coding for face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 625–632 (2011)
 [32] Zhang, L., Yang, M., Feng, X.: Sparse representation or collaborative representation: which helps face recognition?. In: International Conference on Computer Vision (ICCV). pp. 471–478 (2011)
Comments
There are no comments yet.