The rapid development of imaging technology allows us to capture a single object with different sensors or from different views. Consequently, a single object may have multi-view representations [1, 2, 3]. Although more information is provided, there is a tremendous variation and diversity among different views (i.e., within-class samples from different views might have less similarity than between-class samples from the same view). This unfortunate fact poses an emerging yet challenging problem - the classification of an object when the gallery and probe data come from different views, also known as cross-view classification [4, 5, 6, 7].
To tackle this problem, substantial efforts have been made in recent years. These include, for example, the multi-view subspace learning (MvSL) based methods (e.g., [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]) that attempt to remove view discrepancy by projecting all samples into a common subspace. As one of the best-known MvSL based methods, Canonical Correlation Analysis (CCA) [21, 22] learns two view-specific mapping functions such that the projected representations from two views are mostly correlated. Multi-view Canonical Correlation Analysis (MCCA) [23, 24] was later developed as an extension of CCA in multi-view scenario. Although the second-order pairwise correlations are maximized, the neglect of discriminant information may severely deteriorate the classification performance of CCA and MCCA. To deal with this problem, more advanced methods, such as the Generalized Multi-view Analysis (GMA)  and the Multi-view Modular Discriminant Analysis (MvMDA) , were proposed thereafter to render improved classification performance by taking into consideration either intra-view or inter-view discriminant information. Despite promising results obtained on some applications, these methods are only capable of discovering the intrinsic geometric structure of data lying on linear or near-linear manifolds [25, 26], and their performance cannot be guaranteed when the data is non-linearly embedded in high dimensional observation space or suffers from heavy outliers [27, 28, 29].
To circumvent the linear limitation, a straightforward way is to extend these methods with the famed kernel trick. However, it is trivial to design a favorable kernel and it is inefficient to deal with the out-of-sample problem . Moreover, to the best of our knowledge, majority of the current kernel extensions, such as Kernel Canonical Correlation Analysis (KCCA) [31, 32], can only deal with two-view classification. On the other hand, several recent methods (e.g., Multi-view Deep Network (MvDN) , Deep Canonical Correlation Analysis (DCCA) [34, 35]
, Multimodal Deep Autoencoder
) suggest using deep neural networks (DNNs) to eliminate the nonlinear discrepancy between pairwise views. Same as other DNNs that have been widely used in different machine learning and computer vision tasks, it remains a question on the optimal selection of network topology and the network cannot be trained very well, theoretically and practically, when the training data is insufficient.
This work also deals with nonlinear challenge. However, different from aforementioned work that is either incorporated into a kernel framework in a brute force manner or built upon an over-complicated DNN that is hard to train, our general idea is to enforce the latent representation to preserve the local geometry of intra-view and inter-view samples as much as possible, thus significantly improving its representation power. Albeit its simplicity, it is difficult to implement this idea in cross-view classification. This is because traditional local geometry preservation methods (e.g., Locally Linear Embedding (LLE)  and Locality Preserving Projections (LPP) ) require the manifolds are locally linear or smooth . Unfortunately, this condition is not met in multi-view scenario, especially considering the fact that discontinuity could easily occur at the junction of pairwise views .
To this end, we propose a novel cross-view classification algorithm with the Divide-and-Conquer strategy [41, chapter 4]. Instead of building a sophisticated “unified” model, we partition the problem of cross-view classification into three subproblems and build one model to solve each subproblem. Specifically, the first model (i.e., low-dimensional embedding of paired samples or LE-paired) aims to remove view discrepancy, whereas the second model (i.e., local discriminant embedding of intra-view samples or LDE-intra) attempts to discover nonlinear embedding structure and to increase discriminability in intra-view samples (i.e., samples from the same view). By contrast, the third model (i.e., local discriminant embedding of inter-view samples or LDE-inter) attempts to discover nonlinear embedding structure and to increase discriminability in inter-view samples (i.e., samples from different views). We term the combined model that integrates LE-paired, LDE-intra and LDE-inter Multi-view Hybrid Embedding (MvHE).
To summarize, our main contributions are threefold:
1) Motivated by the Divide-and-Conquer strategy, a novel method, named MvHE, is proposed for cross-view classification. Three subproblems associated with cross-view classification are pinpointed and three models are developed to solve each of them.
2) The part optimization and whole alignment framework  that generalizes majority of the prevalent single-view manifold learning algorithms has been extended to multi-view scenario, thus enabling us to precisely preserve the local geometry of inter-view samples.
3) Experiments on four benchmark datasets demonstrate that our method can effectively discover the intrinsic structure of multi-view data lying on nonlinear manifolds. Moreover, compared with previous MvSL based counterparts, our method is less sensitive to outliers.
The rest of this paper is organized as follows. Sect. II introduces the related work. In Sect. III, we describe MvHE and its optimization in detail. The kernel extension and complexity analysis of MvHE are also conducted. Experimental results on four benchmark datasets are presented in Sect. IV. Finally, Sect. V concludes this paper.
Ii Related Work and Preliminary Knowledge
In this section, the key notations used in this paper are summarized and the most relevant MvSL based cross-view classification methods are briefly reviewed. We also present the basic knowledge on part optimization and whole alignment modules, initiated in Discriminative Locality Alignment (DLA) , to make interested readers more familiar with our method that will be elaborated in Sect. III.
Ii-a Multi-view Subspace Learning based Approaches
Suppose we are given -view data , where denotes the data matrix from the -th view. Here, is the feature dimensionality and is the number of samples. We then suppose is a sample from the -th object under the -th view and denotes that the sample is from the -th class, where and is the number of classes. Let refer to the paired samples from the -th object.
MvSL based approaches aim to learn mapping functions , one for each view, to project data into a latent subspace , where is the mapping function of the -th view. For ease of presentation, let denote the mean of all samples of the -th class under the -th view in , denote the mean of all samples of the -th class over all views in , denote the mean of all samples over all views in , denote the number of samples from the -th view of the -th class and denote the number of samples of the -th class over all views. Also let denote the trace operator, denote a
-dimensional identity matrix and
indicate a column vector with all elements equal to one.
We analyze different MvSL based cross-view classification methods below.
Ii-A1 CCA, KCCA and MCCA
where and are kernel matrices with respect to and , and are atom matrices in their corresponding view. Similar to CCA, KCCA is limited to two-view data.
As baseline methods, CCA, KCCA and MCCA suffer from the neglect of any discriminant information.
Ii-A3 GMA and MULDA
Generalized Multi-view Analysis (GMA)  offers an advanced avenue to generalize CCA to supervised algorithm by taking into consideration intra-view discriminant information using the following objective:
where and are within-class and between-class scatter matrices of the -th view respectively.
Multi-view Uncorrelated Linear Discriminant Analysis (MULDA)  was later proposed to learn discriminant features with minimal redundancy by embedding uncorrelated LDA (ULDA)  into CCA framework:
Multi-view Discriminant Analysis (MvDA)  is the multi-view version of Linear Discriminant Analysis (LDA) . It maximizes the ratio of the determinant of the between-class scatter matrix and that of the within-class scatter matrix:
where the between-class scatter matrix and the within-class scatter matrix are given by:
A similar method to MvDA is the recently proposed Multi-view Modular Discriminant Analysis (MvMDA)  that aims to separate various class centers across different views. The objective of MvMDA can be formulated as:
where the between-class scatter matrix and the within-class scatter matrix are given by:
Different from GMA, MvDA and MvMDA incorporate inter-view discriminant information. However, all these methods are incapable of discovering the nonlinear manifolds embedded in multi-view data due to the global essence .
Ii-B Part Optimization and Whole Alignment
The LDE-intra and LDE-inter are expected to be able to precisely preserve local discriminant information in either intra-view or inter-view samples. A promising solution is the part optimization and whole alignment framework  that generalizes majority of the prevalent single-view manifold learning methods (e.g., LDA, LPP  and DLA ) by first constructing a patch for each sample and then optimizing over the sum of all patches. Despite its strong representation power and elegant flexibility, one should note that part optimization and whole alignment cannot be directly applied to multi-view scenario. Before elaborating our solutions in Sect. III, we first introduce part optimization and whole alignment framework below for completeness.
Part optimization. Assume that denotes a single-view dataset embedded in -dimensional space, where is the number of samples. Considering and its nearest neighbors, a patch with respect to is obtained. Given a mapping (or dimensionality reduction) function , for each patch , we project it to a latent representation , where . Then, the part optimization is given by:
where is the objective function of the -th patch, and it varies with different algorithms. For example, the for LPP  can be expressed as:
where refers to the -th element of and is a tuning parameter of heat kernel.
Similarly, for DLA, becomes:
where is a scaling factor, is the number of within-class nearest neighbors of , is the number of between-class nearest neighbors of and .
Whole alignment. The local coordinate can be seen to come from a global coordinate using a selection matrix :
where is the mentioned selection matrix, denotes its element in the -th row and -th column, and stands for the set of indices for the -th patch. Then, Eq. (11) can be rewritten as:
By summing over all patches, the whole alignment of (also the overall manifold learning objective) can be expressed as:
Iii Multi-view Hybrid Embedding (MvHE) and Its Kernel Extension
In this section, we first detail the motivation and general idea of our proposed MvHE, and then give its optimization procedure. The kernel extension to MvHE and the computational complexity analysis are also conducted.
Iii-a Multi-view Hybrid Embedding
The view discrepancy in multi-view data disrupts the local geometry preservation, posing a new challenge of handling view discrepancy, discriminability and nonlinearity simultaneously [33, 6, 49]. Inspired by the Divide-and-Conquer strategy, we partition the general problem of cross-view classification into three subproblems:
Subproblem I: Remove view discrepancy;
Subproblem II: Increase discriminability and discover the intrinsic nonlinear embedding in intra-view samples;
Subproblem III: Increase discriminability and discover the intrinsic nonlinear embedding in inter-view samples.
Three models (i.e., LE-paired, LDE-intra and LDE-inter), each for one subproblem, are developed and integrated in a joint manner. We term the combined model MvHE. An overview of MvHE is shown in Fig. 1.
Since paired samples are collected from the same object, it makes sense to assume that they share common characteristics [5, 50], and extracting view-invariant feature representations could effectively remove view discrepancy . Motivated by this idea, we require the paired samples converge to a small region (or even a point) in the latent subspace . Therefore, the objective of LE-paired can be represented as:
To improve intra-view discriminability, LDE-intra attempts to unify intra-view within-class samples and separate intra-view between-class samples. A naive objective to implement this idea is given by:
where is a scaling factor to unify different measurements of the within-class samples and the between-class samples.
One should note that Eq. (III-A2) is incapable of discovering the nonlinear structure embedded in high-dimensional observation space due to its global essence. Thus, to extend Eq. (III-A2) to its nonlinear formulation, inspired by the part optimization and whole alignment framework illustrated in Sect. II-B, for any given sample , we build an intra-view local patch containing , its within-class nearest samples, and between-class nearest samples (see Fig. 2 for more details), thus formulating the part discriminator as:
where is the -th within-class nearest sample of and is the -th between-class nearest sample of . It is worth mentioning that each sample is associated with such a local patch. By summing over all the part optimizations described in Eq. (21), we obtain the whole alignment (also the overall objective) of LDE-intra as:
The intra-view discriminant information and local geometry of observed data can be learned by minimizing Eq. (22).
Similar to our operations on intra-view samples in Eq. (18), a reliable way to improve inter-view discriminability is to unify inter-view within-class samples and separate inter-view between-class samples. Thus, a naive objective can be represented as:
Similarly, Eq. (III-A3) fails to preserve the local geometry. Hence, we design LDE-inter by generalizing the part optimization and whole alignment framework to multi-view scenario. In part optimization phase, each sample is associated with inter-view local patches, one for each view, and each local patch includes nearest samples of a same class and nearest samples of different classes. Nevertheless, it is infeasible to directly calculate the similarity between and heterogeneous samples due to the large view discrepancy. As an alternative, we make use of the paired samples of in different views, and construct the local patch of in the -th () view using , within-class nearest samples with respect to , and between-class nearest samples also with respect to (see Fig. 2 for more details). We thus formulate the part discriminator as:
By summing over all the part optimization terms described in Eq. (24), we obtain the whole alignment (also the overall objective) of LDE-inter as:
The local discriminant embedding of inter-view samples can be obtained by minimizing Eq. (25).
Iii-B Solution to MvHE
where denotes the -th within-class nearest neighbor of and stands for the -th between-class nearest neighbor of .
where () is the mapping function of the -th view, denotes data matrix from the -th view, is the projection of in latent subspace. Obviously, .
This way, the matrix form of the objective of LE-paired can be expressed as (see supplementary material for more details):
where is defined as below:
On the other hand, the matrix form of objective of LDE-intra is given by (see supplementary material for more details):
where is defined as below:
In particular, encodes the objective function for the local patch with respect to . is a selection matrix, denotes the set of indices for the local patch of . is the representation matrix with respect to and its local patch in the common subspace.
Similarly, the matrix form of the objective of LDE-inter can be written as:
where is defined as below:
Iii-C Kernel Multi-view Hybrid Embedding (KMvHE)
KMvHE performs MvHE in the Reproducing Kernel Hilbert Space (RKHS). Given , there exists a nonlinear mapping such that , where is the kernel function. For , its kernel matrix can be obtained with . We denote the mapping function in RKHS as . Assume that atoms of lie in the space spanned by , we have:
where is an atom matrix of the -th view and . Then, can be rewritten as:
where is the kernel matrix of all samples from all views. Then, the objective of KMvHE in analogue to Eq. (31) can then be expressed as:
Obviously, Eq. (III-C) can be optimized the same as MvHE.
Iii-D Computational Complexity Analysis
For convenience, suppose that , we first analyze the computational complexity of MvHE. Since (III-B) is a standard eigenvalue decomposition problem, its computational complexity is , where is the order of square matrix . On the other hand, the computational complexity of in Eq. (III-B), in Eq. (III-B), and in Eq. (III-B) can be easily derived. For example, the computational complexity of the selection matrix in Eq. (III-B) is , and the computational complexity of is , where . Therefore, the computational complexity of is given by . Similarly, the computational complexity of and are and . To summarize, the overall computational complexity of MvHE is approximately .
By contrast, the computational complexity of MvDA and MvMDA are approximately and respectively. In practice, we have , , and , thus the computational complexity of MvHE, MvDA and MvMDA can be further simplified with , and . Since , our method does not exhibit any advantage in computational complexity. Nevertheless, the extra computational cost is still acceptable considering the overwhelming performance gain that will be demonstrated in the next section. On the other hand, the experimental results in Sect. IV-C also corroborate our computational complexity analysis.
Iii-E Relation and Difference to DLA
We specify the relation and difference of MvHE to DLA . Both MvHE and DLA are designed to deal with nonlinearity by preserving the local geometry of data. However, MvHE differs from DLA in the following aspects. First, as a well-known single-view algorithm, DLA is not designed for multi-view data, hence it does not reduce the view discrepancy for the problem of cross-view classification. By contrast, MvHE is an algorithm specifically for multi-view data and it removes the view discrepancy by performing LE-paired. Second, in DLA, only the discriminant information and the nonlinearity in intra-view samples are taken into consideration. By contrast, MvHE incorporates discriminant information and nonlinear embedding structure in both intra-view samples and inter-view samples via performing LDE-intra and LDE-inter respectively. One should note that, the inter-view discriminancy and nonlinearity are especially important for cross-view classification since the task aims to distinguish inter-view samples.
In this section, we evaluate the performance of our methods on four benchmark datasets: the Heterogeneous Face Biometrics (HFB) dataset, the CUHK Face Sketch FERET (CUFSF) database111http://mmlab.ie.cuhk.edu.hk/archive/cufsf/, the CMU Pose, Illumination, and Expression (PIE) database (CMU PIE) and the Columbia University Image Library (COIL-100)222http://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php. We first specify the experimental setting in Sect. IV-A. Following this, we demonstrate that our methods can achieve the state-of-the-art performance on two-view datasets (i.e., HFB and CUFSF) in Sect. IV-B. We then verify that the overwhelming superiority of our methods also holds on multi-view data, including CMU PIE in Sect. IV-C and COIL-100 in Sect. IV-D. Finally, we conduct a sensitivity analysis to different parameters in our methods in Sect. IV-E.
Iv-a Experimental Setting
We compare our methods with eight baselines and five state-of-the-art methods. The first two baseline methods, PCA , LDA , are classical classification methods for single-view data. The other four baseline methods are CCA , KCCA , MCCA  and PLS , i.e., the most popular unsupervised methods for multi-view data that aim to maximize the (pairwise) correlations or convariance between multiple latent representations. Due to the close relationship between our methods and DLA  as mentioned in Sect. III-E. We also list DLA  as a final baseline method. Moreover, we also extend DLA to multi-view scenario (term it MvDLA). Note that, to decrease discrepancy among views, we have to set a large (the number of within-class nearest neighbors). This straightforward extension, although seemingly reliable, leads to some issues, we will discuss them in the experimental part in detail. Moreover, The state-of-the-art methods selected for comparison include GMA , MULDA , MvDA , MvDA-VC  and MvMDA . GMA and MULDA are supervised methods that jointly consider view discrepancy and intra-view discriminability, whereas MvDA, MvDA-VC and MvMDA combine intra-view discriminability and inter-view discriminability in a unified framework.
For a fair comparison, we repeat experiments 10 times independently by randomly dividing the given data into training set and test set at a certain ratio, and report the average results. Furthermore, the hyper-parameters of all methods are determined by 5-fold cross validation. To reduce dimensionality, PCA is first applied for all methods, and the PCA dimensions are empirically set to achieve the best classification accuracy via traversing possible dimensions as conducted in 
. In the test phase, we first project new samples into a subspace with the learned mapping functions and then categorize their latent representations with a 1-NN classifier. For KCCA and KMvHE, the radial basis function (RBF) kernel is selected and the kernel size is tuned with cross validation. All the experiments mentioned in this paper are conducted on MATLAB 2013a, with CPU i5-6500 and 16.0GB memory size.
Iv-B The Efficacy of (K)MvHE on Two-view Data
We first compare (K)MvHE with all the other competing methods on HFB and CUFSF to demonstrate the superiority of our methods on two-view data. HFB [54, 55] contains 100 individuals, and each has 4 visual light images and 4 near-infrared images. This dataset is used to evaluate visual (VIS) light image versus near-infrared (NIR) image heterogeneous classification. CUFSF contains 1194 subjects, with 1 photo and 1 sketch per subject. This dataset is used to evaluate photo versus sketch face classification. For HFB, 60 individuals are selected as training data and the remaining individuals are used for testing. For CUFSF, 700 subjects are selected as training data and the remaining subjects are used for testing. The results are listed in Table I. Note that the performance of GMA and MvMDA is omitted on CUFSF, because there is only one sample in each class under each view.
As can be seen, PCA, LDA and DLA fail to provide reasonable classification results as expected. Note that the performance of DLA is the same as MvDLA on the CUFSF dataset, due to only one within-class nearest neighbor for any one sample on this dataset. CCA and PLS perform the worst among MvSL based methods due to the naive utilization of view discrepancy. KCCA, GMA and MULDA improve the performance of CCA and PLS by incorporating the kernel trick or taking into consideration intra-view discriminant information respectively. On the other hand, MvDA and MvDA-VC achieve a large performance gain over GMA by further considering inter-view discriminant information. Although MvMDA is a supervised method, its performance is very poor. One possible reason is that there is a large class-center discrepancy between views on HFB, and the within-class scatter matrix (defined in Eq. (II-A5)) in MvMDA does not consider inter-view variations. Although MvDLA achieves comparable performance to MvDA, there is still a performance gap between MvDLA and our methods. This result confirms our concern that a large in MvDLA makes it ineffective to preserve local geometry of data. Our methods, MvHE and KMvHE, perform the best on these two databases by properly utilizing local geometric discriminant information in both intra-view and inter-view samples.
|LE-paired||LDE-intra||LDE-inter||Case 1||Case 2||Case 3||Case 4||Case 5||Case 6||Case 7||Case 8|