Traditional computer vision techniques are mainly based on single feature representations, either global or local . For local methods, descriptors such as SIFT  are computed for each detected or densely sampled point, then the Bag-of-Words scheme or its improved version is employed to embed these local features into a whole representation. On the one hand, local feature based methods tend to be more robust and effective in challenging scenarios, while this kind of representation is often not precise and informative because of the quantization error during the codebook construction and the loss of structural relationships among local features. On the other hand, global representations [18, 10] describe the image as a whole. Unfortunately, global methods are sensitive to shift, scaling, occlusion and cluttering, which commonly exist in realistic images.
Notwithstanding the remarkable results achieved by both local and global methods in some cases, most of them are still based on a single view (feature representation). In realistic applications, variations in lighting conditions, intra-class differences, complex backgrounds and viewpoint and scale changes all lead to obstacles for robust feature extraction. Naturally, single representations cannot handle realistic tasks to a satisfactory extent.
In practice, a typical sample can be represented by different views/features, e.g., gradient, shape, color, texture and motion. Generally speaking, these views from different feature spaces always maintain their particular statistical characteristics. Accordingly, it is desirable to incorporate these heterogeneous feature descriptors into one compact representation, leading to the multiview learning approaches. These techniques have been designed for multiview data classification , clustering 28]
. For such multiview learning tasks, the feature representations are usually very high-dimensional for each view. However, little effort has been paid to learning low-dimensional and compact representations for multiview computer vision tasks. Thus, how to obtain an effective low-dimensional embedding to discover the discriminative information from all views is a worthy research topic, since the effectiveness and efficiency of the methods drop exponentially as the dimensionality increases, which is commonly referred to as the curse of dimensionality.
, which have explored the locality information and probability distributions for the fusion of multiview data respectively. Recently, Han et al. proposed a sparse unsupervised dimensionality reduction to obtain a sparse representation for multiview data. However, these methods are only defined on the training data and it remains unclear how to embed the new test data due to their nonlinearity. In other words, they suffer from the out-of-sample problem , which heavily restricts their applicability in realistic and large-scale vision tasks.
In this paper, to tackle the out-of-sample problem, we propose a novel unsupervised multiview subspace learning method called kernelized multiview projection (KMP), which can successfully learn the projection to encode different features with different weights achieving a semantically meaningful embedding. KMP considers different probabilistic distributions of data points and the locality information among data simultaneously. Different from the measurement of locality information in the locality preserving projections (LPP)  and the locally linear embedding (LLE) , an -graph [9, 15] is applied to generate the similarity matrix, which is shown to be more robust to data noise and automatically sparse. Moreover, the -graph can also adaptively discover the natural neighborhood information for each data point.
Instead of using the multiview features directly, the kernel matrices from multiple views enable KMP to normalize the scales and the dimensions of different features. In fact, we show that the fusion of multiple kernels is actually the concatenation of features in the high-dimensional reproducing kernel Hilbert space (RKHS), while the learning phase of KMP remains in the low-dimensional space. Having obtained kernels for each view in RKHS, KMP can not only fuse the views by exploring the complementary property of different views as multiple kernel learning (MKL) [14, 11, 23], but also find a common low-dimensional subspace where the distribution of each view is sufficiently smooth and discriminative. Note that multiview learning techniques are used to fuse different views/features while MKL is used to combine different kernel functions.
2 Related Work
A simple multiview embedding framework is to concatenate the feature vectors from different views together as a new representation and utilize an existing dimensionality reduction method directly on the concatenated vector to obtain the final multiview representation. Nonetheless, this kind of concatenation is not physically meaningful because each view has a specific characteristic. And, the relationship between different views is ignored and the complementary nature of intrinsic data structure of different views is not sufficiently explored.
One feasible solution is proposed in , namely, distributed spectral embedding (DSE). For DSE, a spectral embedding scheme is first performed on each view, respectively, producing the individual low-dimensional representations. After that, a common compact embedding is finally learned to guarantee that it would be similar with all single-view’s representations as much as possible. Although the spectral structure of each view can be effectively considered for learning a multiview embedding via DSE, the complementarity between different views is still neglected.
To effectively and efficiently learn the complementary nature of different views, multiview spectral embedding (MSE) is introduced in . The main advantage of MSE is that it can simultaneously learn a low-dimensional embedding over all views rather than separate learning as in DSE. Additionally, MSE shows better effectiveness in fusing different views in the learning phase.
However, both DSE and MSE are based on nonlinear embedding, which leads to a serious computational complexity problem and the out-of-sample problem . In particular, when we apply them to classification or retrieval tasks, the methods have to be re-trained for learning the low-dimensional embedding when new test data are used. Due to their nonlinearity nature, this will cause heavily computational costs and even become impractical for realistic and large-scale scenarios.
Towards solving the out-of-sample problem for multiview embedding, we propose a unsupervised projection method, namely, KMP. It is noteworthy that, as a linear method, a projection is learned via the proposed KMP using all of the training data. Nevertheless, different from non-linear approaches, once the learning phase finishes, the projection will be fixed and can be directly applied to embed any new test sample without re-training.
3 Kernelized Multiview Projection
Given training samples and different descriptors for multiview feature extraction, represents the feature vector for the -th view and -th sample. Since the dimensions of various descriptors are different, kernel matrices are constructed by the kernel functions such as the RBF kernel and the polynomial kernel, for the fusion of different views in the same scale. Our task is to output an optimal projection matrix and weights satisfying for kernel matrices such that the fused feature matrix can represent original multiview data comprehensively.
3.2 Formulation of KMP
The projection learning of KMP is based on the similarity matrix for the -th view, . For each view, we value the similarity of each sample pair by using the neighbors of each point. The construction of is illustrated below via the -graph , which is demonstrated to be robust to data noise, automatically sparse and adaptive to the neighborhood.
For each , we find the coefficients such that , where . Considering the noise effect, we can rewrite it as , where and . Thus, seeking the sparse representation for leads to the following optimization problem:
where is the parameter with a small value. This problem can be solved by the orthogonal matching pursuit .
Considering different probabilistic distributions that exist over the data points and the natural locality information of the data, we first employ the Gaussian mixture model (GMM) on the training data for each view. On the one hand, it has been proved that data in the high-dimensional space do not always follow the same distribution, but are naturally clustered into several groups. On the other hand, realistic data distributions basically follow the same form, i.e., Gaussian distribution. In this case,clusters are obtained by the unsupervised GMM clustering for each view. Thus, we can solve the above problem (1) using the data from the same cluster to represent each point rather than the whole data points , which is also regarded as a solution to alleviate the computational complexity of problem (1).
In particular, for , we can first set if and are in different clusters, , then solve the above problem. Now the similarity matrix can be defined as: , , if , and if . To ensure the symmetry, we update . Then we set the diagonal matrix with and the Laplacian matrix for each view .
Multiview kernel fusion
Due to the complementary nature of different descriptors, we assign different weights for different views. The goal of KMP is to find the basis of a subspace in which the lower-dimensional representation can preserve the intrinsic structure of original data. Therefore, we impose a set of nonnegative weights on the similarity matrices and we have the fused similarity matrix , fused diagonal matrix and the fused Laplacian matrix .
For the kernel matrix, we also define the fused kernel matrix . In fact, suppose is the substantial feature map for kernel , i.e., , then the fused kernel value is computed by the feature vector concatenated by the mapped vectors via , since we have
where is the fused feature map and is the -tuple consisting of features from all the views.
To preserve the fused locality information, we need to find the optimal projection for the following optimization problem:
where is the fused mapped feature, i.e., . Through simple algebra derivation, the above optimization problem can be transformed to the following form:
With the constraint , minimizing the objective function in Eq. (3
) is to solve the following generalized eigenvalue problem:
Note that each solution of problem (4) is a linear combination of , and there exists an -tuple such that . For matrix consisting of all the linearly independent solutions of problem (4), there exists a matrix such that . Therefore, with the additional constraint , we can formulate the new objective function as follows:
or in the form associated with the norm constraint:
3.3 Alternate Optimization via Relaxation
In this section, we employ a procedure of alternate optimization  to derive the solution of the optimization problem. To the best of our knowledge, it is difficult to find its optimal solution directly, especially for the weights in (6).
First, for a fixed , finding the optimal projection is simply reduced to solve the generalized eigenvalue problem
and set corresponds to the smallest eigenvalues based on the Ky-Fan theorem .
Next, to optimize , we derive a relaxed objective function from the original problem. The output of the relaxed function can ensure that the value of the objective function in (6) is in a small neighborhood of the true minimum.
We fix the projection to update individually. Without loss of generality, we first consider the condition that , i.e., there are only two views. Then the optimization problem (6) is reduced to
For simplicity, we denote and , . Then we can simply find that and .
With the Cauchy-Schwarz inequality , the relaxation for the objective function in (9) is shown in Eq. (3.3), where is the coefficient of and . In this way, the objective function in (9) is relaxed to a weighted sum of . Thus, minimizing the weighted sum of the right-hand-side in (3.3) can lower the objective function value in (9). Note that
and then the weights without containing and are always smaller than a constant. Therefore, we only ensure that a part of the terms in the weighted sum is minimized, i.e., to solve the following optimization problem:
Since and are the functions of , we first find the optimal weights without parameters . To avoid trivial solution, we assign an exponent for each weight. By denoting and , the relaxed optimization will be
For (11), we have the Lagrangian function with the Lagrangian multiplier :
We only need to set the derivatives of with respect to , and to zeros as follows:
Then and can be calculated by
With the constraint , we can easily find that
Hence, for the general -view situation, we also have the corresponding relaxed problems:
The coefficients and can be obtained in similar forms:
Although the weight obtained in the above procedure is not the global minimum, the objective function is ensured in a range of small values. We let and be the objective functions in (6) and (19), respectively, and let
We can find that and if there exists for some , then . During the alternate procedure, for optimizing , is minimized, and for optimizing , is minimized. Denote and , then we have
and we can define the following nonnegative continuous function:
Note that is independent of , thus for any , there exists , such that . If we impose the above alternate optimization on , is nonincreasing and therefore converges. Though does not converge to a fixed point, the value of is reduced into a small district, which is smaller than plus a constant. It is also worthwhile to note that is actually the weighted sum of the objective functions for preserving each view’s locality information. However, the optimization for still learns information from each view separately, i.e., the locality similarity is not fused. We summarize the KMP in Algorithm 1.
4 Experiments and Results
In this section, we evaluate our Kernelized Multiview Projection (KMP) on three image datasets: CMU PIE, CIFAR10 and SUN397 respectively. The CMU PIE face dataset  contains images from 68 subjects (people). Following the settings in , we select front face images, which are manually aligned and cropped into pixels. Further, images are used as the training set and the remaining images are used for testing. The CIFAR10 dataset  is a labeled subset of the -million tiny images collection. It consists of a total of color images in classes. The entire dataset is partitioned into two parts: a training set with samples and a test set with samples. The SUN397 dataset  contains scene images in total from well-sampled categories with at least images per category. We randomly select samples from each category to construct the training set and the rest of samples are the test set. Thus, there are and images in the training set and test set, respectively.
|Histogram of oriented gradients (HOG)||225|
|Local binary pattern (LBP)||256|
|Color histogram (ColorHist)||192|
4.1 Compared Methods and Settings
For image classification, each image can be usually described by different feature representations, i.e., multiview representation, in high-dimensional feature spaces. In this paper, we adopt four different feature representations: HOG , LBP , ColorHist and GIST  to describe each image. Table 1 illustrates the original dimensions of these features.
We compare our proposed KMP with two related multi-kernel fusion methods. In particular, the RBF kernels111Our approach can work with any legitimate kernel function, though we focus on the popular RBF kernel in this paper for each view are adopted in the proposed KMP method:
where the weight is obtained via alternate optimization. AM indicates that the kernels are combined by arithmetic mean:
and GM denotes the combination of kernels through geometric mean:
Besides, we also include the best performance of the single-view-based spectral projection (BSP), the average performance of the single-view-based spectral projection (ASP) and the concatenation of single-view-based embeddings (CSP) in our compared experiments. In particular, AM and GM are incorporated with the proposed KMP framework. BSP, ASP and CSP are based on the kernelized extension of Discriminative Partition Sparsity Analysis (DPSA)  technique. In addition, two non-linear embedding methods, distributed spectral embedding (DSE) and multiview spectral embedding (MSE), are adopted in our comparison, as well. In DSE and MSE, the Laplacian eigenmap (LE)  is adopted. For all these compared embedding methods, the RBF-SVM is adopted to evaluate the final performance.
All of the above methods are then evaluated on seven different lengths of codes: . Under the same experimental setting, all the parameters used in the compared methods have been strictly chosen according to their original papers. For KMP and MSE, the optimal balance parameter for each dataset is selected from one of , which yields the best performance by 10-fold cross-validation on the training set. The number of the GMM clusters in KMP is selected from one of
with a step of 10 via cross-validation on the training data. The same procedure occurs on the selection of sparsity hyperparameterfrom one of . The best smooth parameter in the construction of the RBF kernel and RBF-SVM is also chosen by the cross-validation on the training data. Since the clustering procedure has uncertainty, all experiments are performed five times repeatedly and each of the results in the following section is the averages of five runs.
In Table 2, we first illustrate the performance of the original single-view representations on all the three datasets. In detail, we extract original feature representations under one certain view and then directly feed them to the SVM for classification. From the comparison, we can easily observe that the GIST features consistently outperform the other descriptors on the CMU PIE and CIFAR10 datasets but HOG takes the superior place on the SUN397 dataset. The lowest accuracy is always obtained by ColorHist. Furthermore, we also include the long representation, which is concatenated by all the four original feature representations, into this comparison. It is shown that in most of the time the concatenated representation can reach better performance than single view representations, but is always significantly worse than the proposed KMP. Additionally, the results of the multiple kernel learning based on SVM (MKL-SVM)  are listed in Table 2 using the same four feature descriptors. Specifically, the best accuracies achieved by KMP are 99.5%, 89.7% and 40.5% on the CMU PIE, CIFAR10, and SUN397, respectively.
In Fig. 1, seven different embedding schemes are compared with the proposed KMP on all the three datasets. From the comparison, the proposed KMP always leads to the best performance for image classification. Meanwhile, arithmetic mean (AM) and the single-view-based spectral projection (BSP) generally achieve higher accuracies than the best performance of geometric mean (GM) and the average performance of the single-view-based spectral projection (ASP). The concatenation of single-view-based embeddings (CSP) achieves competitive performance compared with BSP on all the three datasets. DSE always produces worse performance than MSE and sometimes even obtains lower results than CSP. However, DSE generates better performance than GM and ASP, since a more meaningful multiview combination scheme is adopted in DSE. Beyond that, it is obviously observed that, with different target dimensions, there are large differences among the final results. Fig. 2 plots the low-dimensional embedding results obtained by AM, GM, KMP, DSE and MSE on the CIFAR10 dataset. Our proposed KMP can well separate different categories, since it takes the semantically meaningful data structure of different views into consideration for embedding.
In addition, we can observe that with the increase of the dimension, all the curves of compared methods on the CIFAR10 and SUN397 datasets are climbing up except for DSE and MSE, both of which have a slight decrease on SUN397 when the dimension exceeds . However, on the CMU PIE dataset, the results in comparison always climb up then go down for almost every compared method except for DSE when the length of dimension increases (see Fig. 1). For instance, the highest accuracy on the CMU PIE dataset is on the dimension of and the best performance on CIFAR10 and SUN397 happens when and , respectively.
Furthermore, some parameter sensitivity analysis is carried out. Table 3 illustrates the performance variation of KMP with respect to the parameter on the CMU PIE dataset; the target dimensionality of the low-dimensional embedding is fixed at with a step of 10, respectively. By adopting the 10-fold cross-validation scheme on the training data, it is demonstrated that higher dimensions prefer a larger in our KMP. Finally, Fig. 3 shows the variation of parameters and on all three datasets. The general tendency of these curves is consistently shown as “rise-then-fall”. It can be also seen from this figure that a larger training set needs larger values of and , and vice versa.
|CMU PIE||Training time||1148.24||716.79||873.72||755.28|
|CIFAR 10||Training time||1683.70||1026.32||1098.97||991.54|
4.3 Time Consumption Analysis
In this section, we compare the training and coding time of the proposed KMP algorithm with other methods. As we can see from Table 4, our method can achieve competitive training time compared with the state-of-the-art multiview and multiple kernel learning methods. Since there is no embedding procedure in MKL, the coding time is not applicable for MKL. Due to the nature of DSE and MSE, they need to be re-trained when receiving a new test sample. In contrast, once the projection and weights are gained by KMP, they are fixed for all test samples and implemented in a fast way. All the experiments are completed using Matlab 2014a on a workstation configured with an i7 processor and 32GB RAM.
In this paper, we have presented an effective subspace learning framework called Kernelized Multiview Projection (KMP). KMP, as an unsupervised method, can encode a variety of features in different ways, to achieve a semantically meaningful embedding. Specifically, KMP is able to successfully explore the complementary property of different views and finally find the low-dimensional subspace where the distribution of each view is sufficiently smooth and discriminative. KMP can be regarded as a fused dimensionality reduction method for multiview data. We have objectively evaluated our approach on three datasets: CMU PIE, CIFAR10 and SUN397. The corresponding results have shown the effectiveness and the superiority of our algorithm compared with other multiview embedding methods. For future work, we plan to combine the current KMP approach with semi-supervised learning for other computer vision tasks.
-  T. Ahonen, A. Hadid, and M. Pietikäinen. Face recognition with local binary patterns. In European Conference on Computer Vision. 2004.
-  M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in Neural Information Processing Systems, 2001.
Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet.
Out-of-sample extensions for LLE, isomap, MDS, eigenmaps, and spectral clustering.In Advances in Neural Information Processing Systems, pages 177–184, 2003.
-  J. C. Bezdek and R. J. Hathaway. Some notes on alternating optimization. In AFSS International Conference on Fuzzy Systems, 2002.
-  R. Bhatia. Matrix analysis. Springer-Verlag, 1997.
-  S. Bickel and T. Scheffer. Multi-view clustering. In International Conference on Data Mining, 2004.
O. Boiman, E. Shechtman, and M. Irani.
In defense of nearest-neighbor based image classification.
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008.
-  D. Cai, X. He, and J. Han. Speed up kernel discriminant analysis. VLDB, 20(1):21–33, 2011.
-  B. Cheng, J. Yang, S. Yan, Y. Fu, and T. S. Huang. Learning with -graph for image analysis. IEEE Transactions on Image Processing, 19(4):858–866, 2010.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2005.
M. Gönen and E. Alpaydin.
Multiple kernel learning algorithms.
Journal of Machine Learning Research, 12:2211–2268, 2011.
-  Y. Han, F. Wu, D. Tao, J. Shao, Y. Zhuang, and J. Jiang. Sparse unsupervised dimensionality reduction for multiple view data. IEEE Transactions on Circuits and Systems for Video Technology, 22(10):1485–1496, 2012.
-  G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Cambridge university press, 1952.
-  G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:27–72, 2004.
-  L. Liu and L. Shao. Discriminative partition sparsity analysis. In International Conference on Pattern Recognition, 2014.
B. Long, S. Y. Philip, and Z. M. Zhang.
A general model for multiple view unsupervised learning.In International Conference on Data Mining, 2008.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
-  A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3):145–175, 2001.
-  Y. C. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, pages 40–44, 1993.
-  S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, pages 2323–2326, 2000.
-  J. Sánchez, F. Perronnin, T. Mensink, and J. J. Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222–245, 2013.
A. Torralba, R. Fergus, and W. T. Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.TPAMI, 30(11):1958–1970, 2008.
-  A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object detection. In IEEE International Conference on Computer Vision, pages 606–613, 2009.
-  T. Xia, D. Tao, T. Mei, and Y. Zhang. Multiview spectral embedding. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 40(6):1438–1446, 2010.
-  J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
-  H. Xiaofei and P. Niyogi. Locality preserving projections. In Advances in Neural Information Processing Systems, 2004.
-  B. Xie, Y. Mu, D. Tao, and K. Huang. m-sne: Multiview stochastic neighbor embedding. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 41(4):1088–1096, 2011.
-  Z. Zhao and H. Liu. Multi-source feature selection via geometry-dependent covariance analysis. In Third Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery, pages 36–47, 2008.
-  A. Zien and C. S. Ong. Multiclass multiple kernel learning. In International Conference on Machine Learning, 2007.